I can do hierarchical on all cases. That is what I’m doing. I’ve got 1008 couples, 2016 individuals in the data. I’ve tried clustering individuals (man and woman in cluster analysis together), couple scores together, and combinations of couple scores.


I’m doing what I see a lot of other behavioral researchers do, start with hierarchical do get an idea of the number of clusters, then get cluster centers from hierarchical, use those to start cluster centers for k-means – to get better separation and more easy to interpret clusters. I always randomly sort my cases before each cluster run.


I’ve run a DFA and I think I only got 36% cases correctly classified. From a behavioral research perspective, thinking in terms of correlation metrics 36% variance explained is pretty good. So maybe I should look at it like that. I’m tending to think that a lack of separation and cohesion may not be necessary. This is only one construct made up of four scales, how could I expect it to classify people with a higher degree of certainty. I could just keep adding other variables, and I’ve tried that. Like couple communication, lifestyle, time spent with friends, but the clusters are much more general and get away from the idea of clustering on coping.


I don’t think my composite variables are ipsative. I have tried difference scores, which sound like they’d be ipsative, but the strongest solution was multiplying men and women. I think that maybe I’m using compositional data when I include data for both men and women in the analysis. But that’s just from a quick google understanding of the terms.


Thanks for the good advice.




From: Classification, clustering, and phylogeny estimation [mailto:[log in to unmask]] On Behalf Of Art Kendall
Sent: Thursday, September 22, 2011 7:17 AM
To: [log in to unmask]
Subject: Re: good external validity for clusters but bad cohesion and separation


This sounds like a very interesting question both methodologically and substantively.

How many cases did you start with?  Too many to do  hierarchical clustering on all cases?
I have not used the silhouette coefficients per se, but I have used an approach based on the same kind of information, the distance of a case from the centroids, i.e., the probabilities of membership in each cluster from the classification phase of a discriminant function analysis.  It would be interesting to hear whether silhouette coefficients have been of use to list members in ball parking the number of clusters to retain.

Since the mid-70s my habit is to use several methods of clustering and find sets of cases that are placed together by several methods. I called these core clusters. Since k-means and to some extent TWOSTEP are sensitive to the order of cases within the file some of the methods would be based an random re-sequencing of the file. Most clustering methods save  variables that are the cluster assignments.  I then use several of the variables saved in the classification phase of discriminant function analysis (DFA).  These are the predicted cluster that DFA would assign a cases to, the scores of each case on the discriminant functions, and the probability a case would be so far from the centroid of the cluster.  

I would then create a new membership variable giving a value that would be ungrouped in the discriminant to cases that have are close to more than one cluster. . I would repeat this until the results subjectively stabilized. 

Once I have the core cluster I use profile graphs and by-products of the DFA to give a final interpretation of  the clusters. Profile graphs are very like parallel coordinate plots when when the data is aggregated to the clusters.

Relation of the new variable found by the clustering to variables that were not used in the clustering can be a form of external validation.

A caveat.  You use the term 'composite'. There may be complications if the data is ipsative/ compositional.  Getting up to speed on compositional data has been on my to do list since I retired, BUT...  I have a feeling that there are strong analogies for compositional and ipsative data, possibly these are different names for the same thing.  If that is what you have, someone else on the list may be more able to address that question.

It might also be interesting to cluster score profiles for each person separately  and  exploring, e.g., by crosstab, whether members of a couple would be assigned to the same cluster.

Art Kendall
Social Research Consultants

On 9/21/2011 9:54 PM, Matthew Pirritano wrote:



I’m a first time poster. I have data on coping strategies used by couples undergoing infertility treatment. I have created clusters of the coping strategies keeping male and female scores separate. There are 4 coping scores, based on composite scores of 4 subscales (active-avoidance, active-confronting, passive-avoidance, meaning-based). So I have 8 variables in my cluster analysis. I’ve started with Hierarchical clustering using Ward’s method and squared Euclidean distance. I then used those cluster centers as the starting centers for a k-means cluster analysis. Based on my dendrogram from the hierarchical analysis and the clinical interpretability of the k-means solutions I arrived at a 5 cluster solution. These cluster’s predict well a number of outcome variables, such as stress. These predictions are well in line with theory and previous research. That’s the external validity.


I then went to validate the clusters using the average silhouette. I’ve tested all solutions between 2 and 12 clusters and my average silhouette is never greater than .4. I’ve tried different clustering methods and different distance measures, with the same results. The highest average silhouette I get is when I multiply men and women’s scores. I’ve seen this done before, but I’m not sure how to interpret the resulting scores. Any ideas? And that solution was only for 2 clusters.


So, is it still possible that could still discuss the original 5 cluster solution despite not finding good separation and cohesion with the average silhouette? Is all lost, or is there a way to save the situation?


Any help is much appreciated. Please let me know if you need more info or if I’ve violated any list protocol.






---------------------------------------------- CLASS-L list. Instructions:

---------------------------------------------- CLASS-L list. Instructions:

---------------------------------------------- CLASS-L list. Instructions: