FWIW, I had a similar problem a couple of years ago when constructing unit
tests for various clusterers for vector data in R^n (specifcally, I was
interested in testing SONN clustering algorithms like k-means, K-SOM, neural
gas, etc.). Here is the "seat of the pants" method that these unit tests
The tests amounted to comparing two cluster models directly, rather than
testing their classifications on a set of data. This approach depends on the
fact that most of the SONN clusterers use a Euclidean or Mahalanobis measure
to determine distance and hence produce clusters that are Gaussian
hyperspheres (though I suspect that this only needs to be approximately so).
Given two equal sized sets of clusters, each represented by a central vector
in R^n and a corresponding covariance matrix ...
1) Find a one-to-one correspondence between the elements of the two sets (so
that you can compare "the same" clusters in each set) that minimizes a
global error function. For smallish sets the best correspondence can be
found exactly and for large sets it is only prectical to find a "good"
2) For each pair of corresponding clusters perform a Hotelling's T-square
Test to decide whether the clusters should be judged as "the same" at
whatever confidence level you choose.
3) Accept or reject the proportion of passing cluster pairs using a Z-test
in Binomial or Normal form as appropriate to the number of clusters.
Hope that helps.
Jim Hanlon * mailto:[log in to unmask]
www.netperceptions.com * ph +1 952 842 5320 * fax +1 952 842 5005
> -----Original Message-----
> From: Sirotkin, Alexander [mailto:[log in to unmask]]
> Sent: Tuesday, February 25, 2003 2:24 AM
> To: [log in to unmask]
> Subject: how to compare two clusterings, continued
> From a few responses I got to my previous email I understand that I
> did not formulate my question clear enough.
> What I meant by "compare two clusterings" is not to find out which
> clustering is better - this information is usually provided
> by the k-means
> algorithm in a form of within-cluster sum of squares.
> What I actually meant was a measure of how much these two clusters
> are similar. In other words - if two points belong to the same cluster
> in both clusterings this means that they are similar.
> Thanks a lot....