September 2003


leo horseman
Classification, clustering, and phylogeny estimation
Mon, 22 Sep 2003 19:37:35 -0700
     You may wish to check Jacob Cohen's 1960 paper on the kappa statistic
(in Educational and Psychological Measurement).  He briefly discusses the
matter of requirements for independence of observations.  Generally,
introductory statistics texts also will discuss these matters.  Pooling a
single individual's data does not affect independence of observations any
more than using a single individual's mean score on a set of tests would.
I'm assuming this is what you did (pooled within individuals).  Do not
presume that because a clustering is interpretable and makes sense that it
is meaningful.  Given choices, the rawer the data, usually the better--but
often not possible to achieve.

                        M. Childress

>I am interested in the question of whether pooling data from the same
>individuals into a single variable which would violate the assumption of
>the independence of observations in multiple regression, is problematic in
>cluster analysis.
>Briefly, I have data collected at baseline and 4 time points asking whether
>someone smoked and the reasons why. Any individual might give 1-3
>responses, which could range from a single word to a sentence. These
>open-ended responses have been coded by coders. There are therefore 5 time
>periods x potentially 3 responses.
>I have received advice that it is acceptable to pool this data into 1
>variable and have run the analysis using the cluster option in a content
>analysis software program and the results were both interpretable and made
>sense (the analysis was performed using the default options of a similarity
>matrix, average linkage and the Jaccard coefficient) .  However, my
>readings and enquiries to date have not been of much assistance in
>providing substantiative support for this approach.  Any advice or
>references in relation to this question is appreciated,
>Bob Green

