Art,
Thank you so much for taking the
time to provide these helpful, detailed suggestions. I will try them out, and see
how it goes.
Liza
From: Classification, clustering, and
phylogeny estimation [mailto:[log in to unmask]] On Behalf Of Art
Kendall
Sent: Thursday, September 04, 2008 5:43 AM
To: [log in to unmask]
Subject: Re: cluster analysis validation technique
If you have SPSS here are some ways to do this.
the squared Euclidean distance is the sum of the squared distances on each
dimension.
If you have 10 z variables try something like this untested syntax.
which will find the distance of each case from each centroid.
create 60 variables for the centroids in a file with 1 "case" with a
variable called constant set to 1, and 6 sets of 10
cen1z1 to cen1z10 cen2z1 to cen2z10 ...cen6z1 to cen6z10
in your main file
compute constant=1.
match files file=main /table= centroids by constant.
do repeat
vector
distance= distance1 to distance6
/ z = z1 to z10
/ center1 = cen1z1 to cen1z10
/ center2 = cen2z1 to cen2z10
. . .
/ center6 = cen6z1 to cen6z10.
loop #i =1 to 6
compute distance(#i)=0.
loop #j = 1 to 10.
distance (#i) = distance(#i) + ((center(#i) - z(#j)**2).
end loop.
end loop.
If you do not have a huge number of cases and have a fairly powerful machine a
solution with less effort on your part but a lot of computation for the
machine might be this.
Just add 6 cases to the main each representing a centroid at the top of the
files and do PROXIMITIES on the large matrix and then delete the columns you do
not want.
Another way to look at the agreement between two solutions is to do the
clusterings with filtered cases saving the memberships.
Then do two DISCRIMINANTs, each time treating the other set of cases as
unclustered in the classification phase saving the assignments and probabilities
of membership on each pass.
Then CROSSTAB the assignments on the DFA with those from the original
clustering.
Art Kendall
Social Research Consultants
Liza Rovniak wrote:
Hi,
I am hoping someone here can help me with a “how to”
question on running McIntyre and Blashfield’s (1980) nearest-centroid
evaluation procedure to validate the stability of my cluster analysis solution.
I am a newbie to cluster analysis, so this is my first time running this
procedure.
I have a sample of about 900 observations and have
randomly split the sample in two (Sample A and Sample B). I conducted
hierarchical cluster analysis and then calculated the centroid vectors for a
3-cluster solution on each of these two subsamples (i.e., steps 1 through 4 of
McIntrye and Blashfield’s evaluation technique).
Step 5 of McIntrye and Blashfield’s technique is to
calculate “the squared Euclidean distance for each of Sample B’s objects from
each of the centroids of Sample A,” and Step 6 is to assign “each object
in Sample B to the closest centroid vector.” At this point, I am not sure
what buttons to press in SPSS to complete the analysis. One possibility I tried
is to use K-means cluster analysis to achieve these two steps, but K-means uses
simple Euclidean distance (not squared Euclidean distance as recommended by
McIntyre and Blashfield) to assign the observations to clusters. Is this okay?
(someone told me it was, but I just want to double-check). I would
greatly appreciate any guidance on what buttons to press in SPSS/appropriate
syntax to complete steps 5 and 6 of this analysis.
Thank you.
Liza Rovniak
Liza S. Rovniak, PhD, MPH
Adjunct Assistant Professor
Center for Behavioral Epidemiology & Community Health
Graduate School of Public Health, San Diego State University
San Diego, CA 92123
Phone: 858-505-4770, ext. 152; Fax: 858-505-8614
Email: [log in to unmask]
----------------------------------------------
CLASS-L list. Instructions: http://www.classification-society.org/csna/lists.html#class-l
---------------------------------------------- CLASS-L list. Instructions:
http://www.classification-society.org/csna/lists.html#class-l