CLASS-L Archives

October 2000


Options: Use Monospaced Font
Show Text Part by Default
Condense Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
"Classification, clustering, and phylogeny estimation" <[log in to unmask]>
Mark Weeks <[log in to unmask]>
Sat, 14 Oct 2000 12:18:55 -0700
text/plain; charset=us-ascii
"Classification, clustering, and phylogeny estimation" <[log in to unmask]>
text/plain (121 lines)
Generally speaking, there are no clusters in this kind of data.  Fisher made
the discovery that, using a dataset of specific measurements,  Irises only
occur with sets of distinct characteristics; there were no "in between"
varieties.  Behavioral and feelings type scalar data are merely attempts to
gain measurements of vague and complicated concepts, and by design are trying
to get a decent "spread" of responses.  (i.e. we wouldn't have a scale where we
knew that everyone would have to say either "always" or "never".  This would be
a straight yes or no question).

In fact, in our attempt to cover the full range of behaviors and emotions we
deliberately construct scales that are most unlikely to give us "clusters".  We
want to measure the full range of the human response.  This means that we will
have some people who are, say, extremely aggressive, some extremely passive,
some a little bit aggressive, and so on.  A continuum.

The oddity of cluster analysis is that the most widely used techniques set out
to find clusters by trying to actually cluster the data.  So, in the process of
looking to see if there are clusters, people will be allocated to "cluster"
one, "cluster" two, and so on.  At the end, the statistics might indicate that
there are clear differences between these "groups", but further investigation
would show that they are not actually clusters.  In fact, they are just
arbitrary groups, which display the same degree of differences between each
other as thousands of other arbitrary groupings.

So, what you are dealing with should be something with a name like
"segmentation".  There will be no clusters in the sense of finding groups of
people who are extremely similar to each other, and completely different to all
the other groups.  They will be more like the segments of an orange, a useful
division but bang up close next to the adjacent segments.

Of course, it might be very useful to pursue the idea of "segmenting" the data,
and the technique will probably borrow a lot from the technique of cluster
analysis.  The big difference is that you will have to decide subjectively if
the groups you go with are maximally useful in furthering the understanding of
this specific problem.

And if you acknowledge that this is a "subjective art" as opposed to delegating
the decision to the misuse of cluster analysis, then you need to decide
subjectively what to do about your correlations.  If you think you get a better
outcome by treating them as "discoveries" within the segmentation process then
leave them alone.  If you think you get a better outcome by treating them as
"semantics" (unwittingly, you measured the same thing twice) and removing them,
then so be it.

As you say, the correlations are essentially "weighting" the data.  Two
variables highly correlated will give twice as much weight to potentially the
same issue as if you had only included one.  But then where to stop?

Next you need to look at the variance of each scale.  Scales with twice as much
variance will, with most techniques, also contribute twice as much as the final
solution.  So, another subjective decision.  Should you transform the data so
that all the variances are the same and each variable has equal weight, or
should you recognize that people don't discriminate much on certain issues and
so those issues probably aren't important?

My point is that you can make the outcome of this analysis be pretty much
anything you want by subjectively choosing how to muck around with your source
data.  So get into it and make wise choices, and realize that "implicit
weighting" is just another way of saying that "data exists".

Clare Guse wrote:

> Hi all,
> Thanks to everyone who responded to my query.
> For anyone who may be interested, I will try to summarize the responses I
> have gotten to my original message (shown below).  I make no assertions as
> to the appropriateness of any of these suggestions.  Here is some
> additional background on the study:  males and a small number of females
> were asked to rate on a 5 point scale (1 = never, 5 = always) their
> response to a partner using violence against them.  There are three
> behavioral responses: stop their aggression, increase their aggression,
> laugh at partner's effort, and 5 feelings: angry, afraid, amused, insulted
> and threatened.  The most highly correlated are laugh and amused.  The
> sample size is 61.
> The reason I'm concerned about correlation is that Aldenderfer and
> Blashfield (1984) state that using highly correlated variables is implicit
> weighting.  However, they don't define "high" correlation.
> Suggestions:
> 1) use Mahalanobis distance
> 2) do principal components analysis first
> 3) drop one of the correlated pair
> 4) replace the most highly correlated pair with their average
> 5) it's not a problem
> I think that I may end up doing 3 and/or 4 and comparing the results to
> including all the variables.  I'm reluctant to use principal components
> since I'm not familiar with the technique and it would seem to complicate
> final interpretation.  My reading of Aldenderfer and Blashfield (1984)
> would suggest that using Mahalanobis distance would be a good way to handle
> this situation, but unfortunately I don't have that option in the
> statistical software that I am using.
> Clare
>  >I am beginning to perform a cluster analysis with 7 variables reflecting
> a subject's behaviors and >feelings in reaction to a partner's use of
> violence against them.  However, some of these variables are >correlated
> (correlations range from 0.010 to 0.696) and I'm not sure how to handle
> this situation.  What >level of correlation is a problem?  Should one of
> the pair of correlated variables be removed from >consideration and, if so,
> how does one choose?
> *******************************************
> Clare Guse, MS
> Biostatistician
> Dept. of Family & Community Medicine
> Division of Research
> Center for Practice-Based Research (CPBR)
> Medical College of Wisconsin
> [log in to unmask]
> [log in to unmask]
> 414-456-8699
> 414-456-6522  (FAX)
> *******************************************