I agree with Kiri that there is probably not a best approach for assessing
classification. If you are doing cluster analysis then referring the
result to some external criteria is a good approach. I often need to
explain (and re-explain if there is such a word) this to clients who are
working with cluster analysis for the first time.
If you are doing a classification (trying to model known group
assignments) then I would recommend you run all the reasonable
discrimination models (LDA, CART, etc.) and select the one with the
smallest cross-validated misclassification rate.
I think different cluster/classification methods in general will work
great for some data and fail terribly for other data -- no one approach
Kiri also mentions perturbing data and measure the stability of the
cluster. I have been working with 'consensus methods' for the last several
months which is one way to think very rigourously about Kiri's suggestion.
I'm finding it potentially useful and am preparing a paper where we use
consensus methods to cluster tumor types in a novel (and hopefully useful)
William D. Shannon, Ph.D.
Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences and Biostatistics
Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO 63110
e-mail: [log in to unmask]
web page: http://ilya.wustl.edu/~shannon
On Wed, 23 Oct 2002, Kiri Wagstaff wrote:
> On Wed, 23 Oct 2002, Henry Bulley wrote:
> > I am working on different classification methods based on ecological
> > (spatial) variables. I would be glad if any of you could direct me to
> > literature on comparing the different classifications methods as to which
> > one is better.
> The problem of validation in clustering is a hard one. I assume by "which
> one is better" you're trying to compare the results of different
> clustering methods on the same data set, and you want to decide which
> results are the best. I doubt there's any consensus about the best
> overall clustering method - the appropriateness of any algorithm depends
> strongly upon the characteristics of your data set as well as what you're
> trying to find. For example, some methods assume that your data has
> gaussian distributions in it. Others make different assumptions. The
> quality of your results will be affected by how well your data fits the
> assumptions of the method.
> Probably the best way to validate clustering results is to compare the
> classification to some actual known data labels (ground truth), if you
> have it. Unfortunately, for many clustering applications you don't have
> that kind of information (which is why you're using a clustering
> algorithm). Failing that, another approach would be to evaluate the
> robustness of the method - given a small perturbation of your original
> data, how much does the classification change? Or you can compare the
> results of different clustering algorithms just to get a sense of how much
> agreement there is - support from more than one approach for the same
> partition of the data gives you more confidence in the partition.
> Hope this helps!
> ------------ Kiri Wagstaff, Ph.D. -------- [log in to unmask] -------------
> Love is the image you place around your significant other,
> and how close it is to being true love depends on how closely he or she
> fits into the mold. -- Orlando de La Cruz