Print

Print


FWIW, I had a similar problem a couple of years ago when constructing unit
tests for various clusterers for vector data in R^n (specifcally, I was
interested in testing SONN clustering algorithms like k-means, K-SOM, neural
gas, etc.). Here is the "seat of the pants" method that these unit tests
utilize.

The tests amounted to comparing two cluster models directly, rather than
testing their classifications on a set of data. This approach depends on the
fact that most of the SONN clusterers use a Euclidean or Mahalanobis measure
to determine distance and hence produce clusters that are Gaussian
hyperspheres (though I suspect that this only needs to be approximately so).

Given two equal sized sets of clusters, each represented by a central vector
in R^n and a corresponding covariance matrix ...

1) Find a one-to-one correspondence between the elements of the two sets (so
that you can compare "the same" clusters in each set) that minimizes a
global error function. For smallish sets the best correspondence can be
found exactly and for large sets it is only prectical to find a "good"
correspondence.

2) For each pair of corresponding clusters perform a Hotelling's T-square
Test to decide whether the clusters should be judged as "the same" at
whatever confidence level you choose.

3) Accept or reject the proportion of passing cluster pairs using a Z-test
in Binomial or Normal form as appropriate to the number of clusters.

Hope that helps.

--
Jim Hanlon * mailto:[log in to unmask]
www.netperceptions.com * ph +1 952 842 5320 * fax +1 952 842 5005



> -----Original Message-----
> From: Sirotkin, Alexander [mailto:[log in to unmask]]
> Sent: Tuesday, February 25, 2003 2:24 AM
> To: [log in to unmask]
> Subject: how to compare two clusterings, continued
>
>
> Hello.
>
>  From a few responses I got to my previous email I understand that I
> did not formulate my question clear enough.
>
> What I meant by "compare two clusterings" is not to find out which
> clustering is better - this information is usually provided
> by the k-means
> algorithm in a form of within-cluster sum of squares.
>
> What I actually meant was a measure of how much these two clusters
> are similar. In other words - if two points belong to the same cluster
> in both clusterings this means that they are similar.
>
> Thanks a lot....
>