Thoughts on statistical comparisons of two (or more) clusterings.
My PhD research project will require comparisons of clusterings and
comparisons of relational networks so I am very interested in your query.
I am new to this area and only tentative thoughts that amount to an attempt
to conceptualise the questions which need to be addessed so that I may have
some idea where to look for a suitable set of answers. I have made some
progress in moving from some of the questions towards potential answer but
the process is tentative at this stage.
Chapters 19 (Procrustean Procedures) and 20 (Three-way Procrustean Models)
in Borg and Groenen (1997) provides a useful discussion of techniques for
rotating, rescaling, and comparison of multidimensional scaling outcomes
with similar or dissimilar dimensionality. I suspect that much of that
discussion is applicable to comparison of clusterings.
However, for a complete, single stage application of the methods described
by Borg and Groenen the locations of the individual points in the clusters,
rather than the patterns of the clusters, would need to be compared. It
seems as though it would be possible to compare the centroids of
(non-hierarchical) clusters using these methods. Such a treatment would
discount or ignore the characteristics of the clusters themselves. In
particular, such characteristics as density, dispersion, and intra-cluster
relations of elements of each cluster would be not be analyzed. It would
seem inappropriate to apply the Procrusrtean transformations at the
individual cluster level to compare similar clusters in different
clusterings. However, some of the comparison techniques that would
normally follow Procrustean transformations may permit analysis of the
relational distribution of elements of individual clusters that have
similar elements in different clusterings.
In short, at this stage, I suspect a two stage process using centroids and
individual clusters may be useful. If the clusterings represent multiple
sampling across some environmental facet or dimensional property of the
data generating entities it should be possible to use the binary results of
the two stage comparisons to determine if trends exist at either level and
whether the trends, if they exist, have a covariance.
However, I would not be surprised if other respondents to your question
suggest a more elegant and suitable approach.
Borg, I. & Groenen, P. (1997). Modern Multidimensional Scaling: Theory and
Applications. New York: Springer.
Queensland University of Technology
> I'm trying to find a way to compare two clusterings (two results
> of a clustering
> Is there any algorithm (or better yet - working softwar package
> for S-Plus/R/whatever) ?
> Thanks a lot...
> P.S. Comparing the results visually is not an option - too many