I wanted to move the discussion to this list because I know that there
are formal ways of comparing clustering but I do not recall them and I
wanted to have people with more specific knowledge be able to respond
with more expertise in the formal aspects.
- - - -
More specific questions are easier to answer than general questions:
Are the results simple clusters or whole trees? If whole trees, do you
care about the most detailed levels or are you interested in the levels
closest to the root.
Are you looking to classify all cases or most cases?
Are your clusterings based on the same method-index combinations?
Are the cases input in the same order? Are there necessarily many ties
in the distances? Are they based on the same set of cases but different
variables? Different cases but same variables?
How many cases do you have? In what ways to you perceive the
clusterings to be different?
What kinds of cases to you have? What constructs do your variables
represent? What are the levels of measurement of your variables?
- - - -
One quick and dirty way to compare clusterings is to use SPSS. (Perhaps
others can tell you more formal ways but you would probably still have
to assemble memberships into a single file). First, use one of 3 ways
SPSS has to do clustering.
1) CLUSTER this has several algorithms for hierarchical models based on
a few dozen similarity indices. It allows you to save cluster
membership in one of 3 ways: (You can specify a different rootname for
each method and index combination.)
When you want part of the tree you can give a variable root name
will save membership for each case in variables called
john_3 for the level with 3 groups
john_4 for the level with 4 groups
. . .
John_11 for the level with 11 groups.
when you want to start at the root simply specify
/save= mary_ (1,21)
If you want a particular slice specify
2) QUICK CLUSTER uses KMEANS to find a specific number of clusters. It
is useful for ratio, interval, ordinal, or dichotomous data.
For each run with a specified number of clusters you can save the
cluster membership of each case and its distance from its cluster
center. you can specify different rootnames for the variable name these
are stored in.
3)TWOSTEP CLUSTER is used when you have a set of categorical variables
and a set of continuous variables. For each run with a specified number
of clusters you can save the cluster membership of each case and its
distance from its cluster center. You can specify different rootnames
for the variable name these are stored in.
Then do a series of CROSSTABS to see which subgroups are pretty much
the same. Remember that that order in which clusters are formed will
depend on the method-index combination. For this purpose memberships are
strictly nominal level. In the policy/social/psycological domains it
is customary to do many approaches and accept groups that are recognized
by several methods. The more disparate the set of approaches that find
the same clusters the more valid the solution is likely to be since
reliability is considered the upper limit of validity.
You also can use standalone programs or other packages to produce files
that output case id's and membership info and match/merge into a file to
do the crosstabs.
Hope this helps.
[log in to unmask]
Social Research Consultants
University Park, MD USA
Sirotkin, Alexander wrote:
> I'm trying to find a way to compare two clusterings (two results of a
> Is there any algorithm (or better yet - working softwar package for
> S-Plus/R/whatever) ?
> Thanks a lot...
> P.S. Comparing the results visually is not an option - too many