A lot of the interpretation depends on the nature of the differences in
the sources, the clustering software, and the substantive topic you are
Are your data sets representative samples? pops? availability samples?
From two different cultures? French vs US? Males vs females? Two
sections of the same class in a university? From two species of
bacteria? Signatures of aircraft from two brands of radar?
If the sources are very different, and you get the same cluster profiles
in about the same proportions you have a strong case for stability.
If the sources are very similar, and you get the same cluster
profiles in about the same proportions you have a weak case for
stability beyond the type of source.
If the sources are very different, and you find substantively different
clusters, then the reason for the difference may be instability or it
may be the source or both
If the sources are very similar, and you find substantively different
clusters, then the reason for the difference would lean more toward
being instability than the source.
Are you doing hierarchical or non-hierarchical clustering? What do you
mean by stable? Are you also trying to establish some validity?
Are your results similar within a set?
If your procedures run reasonably quickly did you try different
If you are using k-means did you use a number of starts?
What stopping rules did you use to decide how many clusters to use?
What is the level of measurement of your variables (attributes)? Are
you using attributes in the sense of dichotomous (dummy) variables?
Does your software have provisions for applying the same model to a
different set of cases? E.g., in SPSS you can save cluster memberships
from different solutions and save them.
Finally, one approach would be to slip between the horns of the
dilemma. Put all cases in one file with variables indicating which
file a case came from. and which replication it is for. apply models
form one subset of cases to the other cases. Do 4 clusterings and apply
the models to all of the cases. Consider 4 membership variables.
Source1 half 1, Source1 half 2, source2 half 1, source2 half2. Do a 4
If you have continuous variables in the attributes explore using GLM
(ignoring tests) to do a 4 way ANOVA
If you have variables external to the clustering on the same cases, do
the same GLM on them. (Here the tests would not have the usual
interpretation but could be used as a rough measure to interpret
Hope this helps.
[log in to unmask]
Social Research Consultants
University Park, MD USA
(Inside the Washington, DC beltway.)
jessie jessie wrote:
>I have a question about the replication analysis. In
>order to carry out a replication analysis, we need to
>have two datasets first. Currently I do have two
>datasets but they are from different sources although
>the attributes(columns) are the same and the number of
>rows are similar. Since the two datasets in the
>replication analyses I read about were obtained by
>dividing a bigger dataset into two halves, I wonder if
>I can still do replication analysis using my two
>datasets for the purpose of validation (maybe after
>some statistical procedures). The expectation I have
>is that if the result is good, then I claim the
>clusters I've found are stable. Could anyone please
>give me some insightful suggestions on this? Thank you
>very much in advance!
>Do You Yahoo!?
>Tired of spam? Yahoo! Mail has the best spam protection around