CLASS-L Archives

February 2003


Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
"Hanlon, Jim" <[log in to unmask]>
Reply To:
Classification, clustering, and phylogeny estimation
Tue, 25 Feb 2003 11:29:53 -0600
text/plain (60 lines)
FWIW, I had a similar problem a couple of years ago when constructing unit
tests for various clusterers for vector data in R^n (specifcally, I was
interested in testing SONN clustering algorithms like k-means, K-SOM, neural
gas, etc.). Here is the "seat of the pants" method that these unit tests

The tests amounted to comparing two cluster models directly, rather than
testing their classifications on a set of data. This approach depends on the
fact that most of the SONN clusterers use a Euclidean or Mahalanobis measure
to determine distance and hence produce clusters that are Gaussian
hyperspheres (though I suspect that this only needs to be approximately so).

Given two equal sized sets of clusters, each represented by a central vector
in R^n and a corresponding covariance matrix ...

1) Find a one-to-one correspondence between the elements of the two sets (so
that you can compare "the same" clusters in each set) that minimizes a
global error function. For smallish sets the best correspondence can be
found exactly and for large sets it is only prectical to find a "good"

2) For each pair of corresponding clusters perform a Hotelling's T-square
Test to decide whether the clusters should be judged as "the same" at
whatever confidence level you choose.

3) Accept or reject the proportion of passing cluster pairs using a Z-test
in Binomial or Normal form as appropriate to the number of clusters.

Hope that helps.

Jim Hanlon * mailto:[log in to unmask] * ph +1 952 842 5320 * fax +1 952 842 5005

> -----Original Message-----
> From: Sirotkin, Alexander [mailto:[log in to unmask]]
> Sent: Tuesday, February 25, 2003 2:24 AM
> To: [log in to unmask]
> Subject: how to compare two clusterings, continued
> Hello.
>  From a few responses I got to my previous email I understand that I
> did not formulate my question clear enough.
> What I meant by "compare two clusterings" is not to find out which
> clustering is better - this information is usually provided
> by the k-means
> algorithm in a form of within-cluster sum of squares.
> What I actually meant was a measure of how much these two clusters
> are similar. In other words - if two points belong to the same cluster
> in both clusterings this means that they are similar.
> Thanks a lot....