CLASS-L Archives

July 2003


Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
John Dziak <[log in to unmask]>
Reply To:
Classification, clustering, and phylogeny estimation
Mon, 7 Jul 2003 13:02:16 -0400
text/plain (155 lines)
Hello, I'm John Dziak, a graduate student.  I hope this is on topic.

I was listening to another student give a talk about her thesis research on
Bayesian mixture models.

She was doing a Bayesian cluster analysis with normal priors on data about
the levels of certain chemicals

in river water at various observation points along a large river.  She was
able to do this, but only by treating

the observations as iid and clustering based on chemical similarity.  The
result was that the best model

seemed to be a two-component model, so that there were two levels of
turbidity.  Many of the posts

upstream belonged to the low-turbidity group and many downstream belonged to
the high-turbidity group

(or something like that; this could be done in a univariate or multivariate
case).  The fact that geographically close

observation points were often assigned to the same cluster seemed to be
heuristic evidence that the clustering

was meaningful.

            A listener pointed out that the observations were in fact almost
certainly autocorrelated, and that he

thought the analysis should take this into account in some way.  I'm not
sure what he meant by this;  perhaps

(a) he wanted the algorithm to be changed so that clustering is based partly
on geographical similarity and

partially on chemical similarity, in order to divide the river into regions
based on chemical profile, or (b) he was

thinking in some hypothesis-testing sense, that there should not be
considered to be different subpopulations

(latent classes) of observations if in fact the similarity of adjacent
observations was only due to autocorrelation.

Objection b does not make very much sense in this context because the whole
idea of clustering was to find

a way to discretize variations in a continuous process (i.e., divide the
river into regions).  So I thought about objection a.

            I do not know very much at all about the algorithm she was using
(Bayesian fitting with MCMC) but I

knew about an old modified k-means type algorithm with EM which I learned
about in a class I took from Dr.

Lindsay here at Penn State.  (He talked about much better topics than that,
but that was the best I was

able to understand, being only a second-year grad student.)  In this k-means
algorithm we would iteratively

(1) assign each observation i a posterior probability p(j|i) of having come
from cluster j, based on its similarity to the

previously estimated centroid (mean) of cluster j, and (2) update the
centroid of each cluster j as the average of

all the observations weighted by their prior probabilities p(j|i).

            I thought that this could be altered to take into account
spatial as well as chemical similarity.  Basically

I thought that the kind of clustering we might really want to do would be a
compromise between k-means

clustering in geographic space (latitude and longitude of each post) and
k-means clustering in chemical space

(the vectors of observed data from each post).

            If we were just clustering iid observations then at each step we
would estimate

p(j|i) = f(j,y) / sum(f(w,y), w=1.k),

where f(j, y) = p(j)Normal(y | m[j], sigmahat),

p[j] = estimated weight for jth cluster

m[j] = estimated mean for jth cluster

so that p(j|i) is inversely proportional to the squared distance in data
space between observation i and estimated centroid j.

            If we were just clustering the posts on their geographical
location, we could do the same thing with

the vectors of longitude and latitude.

            So perhaps we could compromise and say that for some chosen
constant a in [0,1],

p(j|i) = a*g(j,y)+(1-a)*h(j,y),

where g(j,y) is an index of geographical closeness between observation i and
the geographical centroid of cluster j,

and h(j,y) is an index of chemical similarity between observation i and the
chemical centroid of cluster j.  Then if

a=0, we are just doing k-means clustering as before.  If a=1, we are also
doing k-means, but on geographical points rather

than points in the sample space.  If a is between zero and one then we are
compromising between the two approaches.  If we

get similar solutions regardless of a, then that might be evidence for
spatial heterogeneity.  It could also be

interpreted as evidence for the existence of autocorrelation but again, that
is really the same thing here, only thought of as

continuous rather than discrete.  In a sense this proposal is really cluster
analysis on a larger data vector:

chemical concentrations appended to latitude and longitude, with the
importance of the former versus the latter variables

weighted by the size of a.

            I'm not very knowledgeable about Bayes but I thought that this
compromise (with 0<a<1) might be justified as

proposing that, for each cluster (region), the locations in 2-d geographical
space of the observation posts themselves were

generated randomly (and independently, conditional on j) from a prior mean
location, and that also the chemical profiles

(locations in p-dimensional sample space) were generated randomly (and
independently, conditional on j) from a prior

mean profile.  The latter assumption was already there;  I would add the
former assumption.  Of course, the former assumption isn't literally true,
but neither is the latter;  they just make the idea of clustering possible.

            This other student is not using the k-means reasoning, so she
was unsure how my suggestion would fit into her algorithm.

 I was unsure about whether my suggestion made sense to anyone other than
me, whether anyone had used it before, and whether it would work.