Hello, I'm John Dziak, a graduate student. I hope this is on topic.
I was listening to another student give a talk about her thesis research on
Bayesian mixture models.
She was doing a Bayesian cluster analysis with normal priors on data about
the levels of certain chemicals
in river water at various observation points along a large river. She was
able to do this, but only by treating
the observations as iid and clustering based on chemical similarity. The
result was that the best model
seemed to be a twocomponent model, so that there were two levels of
turbidity. Many of the posts
upstream belonged to the lowturbidity group and many downstream belonged to
the highturbidity group
(or something like that; this could be done in a univariate or multivariate
case). The fact that geographically close
observation points were often assigned to the same cluster seemed to be
heuristic evidence that the clustering
was meaningful.
A listener pointed out that the observations were in fact almost
certainly autocorrelated, and that he
thought the analysis should take this into account in some way. I'm not
sure what he meant by this; perhaps
(a) he wanted the algorithm to be changed so that clustering is based partly
on geographical similarity and
partially on chemical similarity, in order to divide the river into regions
based on chemical profile, or (b) he was
thinking in some hypothesistesting sense, that there should not be
considered to be different subpopulations
(latent classes) of observations if in fact the similarity of adjacent
observations was only due to autocorrelation.
Objection b does not make very much sense in this context because the whole
idea of clustering was to find
a way to discretize variations in a continuous process (i.e., divide the
river into regions). So I thought about objection a.
I do not know very much at all about the algorithm she was using
(Bayesian fitting with MCMC) but I
knew about an old modified kmeans type algorithm with EM which I learned
about in a class I took from Dr.
Lindsay here at Penn State. (He talked about much better topics than that,
but that was the best I was
able to understand, being only a secondyear grad student.) In this kmeans
algorithm we would iteratively
(1) assign each observation i a posterior probability p(ji) of having come
from cluster j, based on its similarity to the
previously estimated centroid (mean) of cluster j, and (2) update the
centroid of each cluster j as the average of
all the observations weighted by their prior probabilities p(ji).
I thought that this could be altered to take into account
spatial as well as chemical similarity. Basically
I thought that the kind of clustering we might really want to do would be a
compromise between kmeans
clustering in geographic space (latitude and longitude of each post) and
kmeans clustering in chemical space
(the vectors of observed data from each post).
If we were just clustering iid observations then at each step we
would estimate
p(ji) = f(j,y) / sum(f(w,y), w=1.k),
where f(j, y) = p(j)Normal(y  m[j], sigmahat),
p[j] = estimated weight for jth cluster
m[j] = estimated mean for jth cluster
so that p(ji) is inversely proportional to the squared distance in data
space between observation i and estimated centroid j.
If we were just clustering the posts on their geographical
location, we could do the same thing with
the vectors of longitude and latitude.
So perhaps we could compromise and say that for some chosen
constant a in [0,1],
p(ji) = a*g(j,y)+(1a)*h(j,y),
where g(j,y) is an index of geographical closeness between observation i and
the geographical centroid of cluster j,
and h(j,y) is an index of chemical similarity between observation i and the
chemical centroid of cluster j. Then if
a=0, we are just doing kmeans clustering as before. If a=1, we are also
doing kmeans, but on geographical points rather
than points in the sample space. If a is between zero and one then we are
compromising between the two approaches. If we
get similar solutions regardless of a, then that might be evidence for
spatial heterogeneity. It could also be
interpreted as evidence for the existence of autocorrelation but again, that
is really the same thing here, only thought of as
continuous rather than discrete. In a sense this proposal is really cluster
analysis on a larger data vector:
chemical concentrations appended to latitude and longitude, with the
importance of the former versus the latter variables
weighted by the size of a.
I'm not very knowledgeable about Bayes but I thought that this
compromise (with 0<a<1) might be justified as
proposing that, for each cluster (region), the locations in 2d geographical
space of the observation posts themselves were
generated randomly (and independently, conditional on j) from a prior mean
location, and that also the chemical profiles
(locations in pdimensional sample space) were generated randomly (and
independently, conditional on j) from a prior
mean profile. The latter assumption was already there; I would add the
former assumption. Of course, the former assumption isn't literally true,
but neither is the latter; they just make the idea of clustering possible.
This other student is not using the kmeans reasoning, so she
was unsure how my suggestion would fit into her algorithm.
I was unsure about whether my suggestion made sense to anyone other than
me, whether anyone had used it before, and whether it would work.
