CLASS-L Archives

September 2011


Options: Use Proportional Font
Show HTML Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Matthew Pirritano <[log in to unmask]>
Reply To:
Classification, clustering, and phylogeny estimation
Thu, 22 Sep 2011 20:34:20 -0700
text/plain (6 kB) , text/html (11 kB)


I can do hierarchical on all cases. That is what I'm doing. I've got 1008
couples, 2016 individuals in the data. I've tried clustering individuals
(man and woman in cluster analysis together), couple scores together, and
combinations of couple scores.


I'm doing what I see a lot of other behavioral researchers do, start with
hierarchical do get an idea of the number of clusters, then get cluster
centers from hierarchical, use those to start cluster centers for k-means -
to get better separation and more easy to interpret clusters. I always
randomly sort my cases before each cluster run. 


I've run a DFA and I think I only got 36% cases correctly classified. From a
behavioral research perspective, thinking in terms of correlation metrics
36% variance explained is pretty good. So maybe I should look at it like
that. I'm tending to think that a lack of separation and cohesion may not be
necessary. This is only one construct made up of four scales, how could I
expect it to classify people with a higher degree of certainty. I could just
keep adding other variables, and I've tried that. Like couple communication,
lifestyle, time spent with friends, but the clusters are much more general
and get away from the idea of clustering on coping.


I don't think my composite variables are ipsative. I have tried difference
scores, which sound like they'd be ipsative, but the strongest solution was
multiplying men and women. I think that maybe I'm using compositional data
when I include data for both men and women in the analysis. But that's just
from a quick google understanding of the terms.


Thanks for the good advice.




From: Classification, clustering, and phylogeny estimation
[mailto:[log in to unmask]] On Behalf Of Art Kendall
Sent: Thursday, September 22, 2011 7:17 AM
To: [log in to unmask]
Subject: Re: good external validity for clusters but bad cohesion and


This sounds like a very interesting question both methodologically and

How many cases did you start with?  Too many to do  hierarchical clustering
on all cases?
I have not used the silhouette coefficients per se, but I have used an
approach based on the same kind of information, the distance of a case from
the centroids, i.e., the probabilities of membership in each cluster from
the classification phase of a discriminant function analysis.  It would be
interesting to hear whether silhouette coefficients have been of use to list
members in ball parking the number of clusters to retain.

Since the mid-70s my habit is to use several methods of clustering and find
sets of cases that are placed together by several methods. I called these
core clusters. Since k-means and to some extent TWOSTEP are sensitive to the
order of cases within the file some of the methods would be based an random
re-sequencing of the file. Most clustering methods save  variables that are
the cluster assignments.  I then use several of the variables saved in the
classification phase of discriminant function analysis (DFA).  These are the
predicted cluster that DFA would assign a cases to, the scores of each case
on the discriminant functions, and the probability a case would be so far
from the centroid of the cluster.   

I would then create a new membership variable giving a value that would be
ungrouped in the discriminant to cases that have are close to more than one
cluster. . I would repeat this until the results subjectively stabilized.  

Once I have the core cluster I use profile graphs and by-products of the DFA
to give a final interpretation of  the clusters. Profile graphs are very
like parallel coordinate plots when when the data is aggregated to the

Relation of the new variable found by the clustering to variables that were
not used in the clustering can be a form of external validation.

A caveat.  You use the term 'composite'. There may be complications if the
data is ipsative/ compositional.  Getting up to speed on compositional data
has been on my to do list since I retired, BUT...  I have a feeling that
there are strong analogies for compositional and ipsative data, possibly
these are different names for the same thing.  If that is what you have,
someone else on the list may be more able to address that question.

It might also be interesting to cluster score profiles for each person
separately  and  exploring, e.g., by crosstab, whether members of a couple
would be assigned to the same cluster. 

Art Kendall
Social Research Consultants

On 9/21/2011 9:54 PM, Matthew Pirritano wrote: 



I'm a first time poster. I have data on coping strategies used by couples
undergoing infertility treatment. I have created clusters of the coping
strategies keeping male and female scores separate. There are 4 coping
scores, based on composite scores of 4 subscales (active-avoidance,
active-confronting, passive-avoidance, meaning-based). So I have 8 variables
in my cluster analysis. I've started with Hierarchical clustering using
Ward's method and squared Euclidean distance. I then used those cluster
centers as the starting centers for a k-means cluster analysis. Based on my
dendrogram from the hierarchical analysis and the clinical interpretability
of the k-means solutions I arrived at a 5 cluster solution. These cluster's
predict well a number of outcome variables, such as stress. These
predictions are well in line with theory and previous research. That's the
external validity.


I then went to validate the clusters using the average silhouette. I've
tested all solutions between 2 and 12 clusters and my average silhouette is
never greater than .4. I've tried different clustering methods and different
distance measures, with the same results. The highest average silhouette I
get is when I multiply men and women's scores. I've seen this done before,
but I'm not sure how to interpret the resulting scores. Any ideas? And that
solution was only for 2 clusters.


So, is it still possible that could still discuss the original 5 cluster
solution despite not finding good separation and cohesion with the average
silhouette? Is all lost, or is there a way to save the situation?


Any help is much appreciated. Please let me know if you need more info or if
I've violated any list protocol.






---------------------------------------------- CLASS-L list. Instructions: 

---------------------------------------------- CLASS-L list. Instructions:

CLASS-L list.