CLASS-L Archives

June 2008


Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Murray Jorgensen <[log in to unmask]>
Reply To:
Classification, clustering, and phylogeny estimation
Fri, 20 Jun 2008 12:34:35 +1200
text/plain (55 lines)
You can find a discussion on this type of clustering in McLachlan & 
Basford (1988). It's called Classification Maximum Likelihood and there 
are some problems with it. Generally in mixture models it is not 
advisable to treat the component assignments as parameters.

Murray Jorgensen

Torvik, Vetle Ingvald wrote:
> I have what appears to be a fundamental clustering problem but I have not
> been able to find much relevant literature on it: Given an n-by-n matrix
> of probabilities P, find the maximum likelihood clustering solution
> represented by an n-by-n binary matrix X where Xij = 1 if elements i and j
> are assigned to the same cluster, and 0 otherwise. That is,
> maximize product{ [Pij*Xij + (1-Pij)*(1-Xij)], over all pairs (i,j): i < j}
> subject to Xij + Xik + Xjk != 2 for all triples (i,j,k): i<j<k.
> P satisfies symmetry (i.e., Pij = Pji) but may have many transitivity
> violations of the form Pij + Pik  1 > Pjk.
> Do you know of someone who has studied this problem? I am particularly
> interested in fast algorithms that solve it optimally (or near-optimally).
> If not, do you know of any closely related problems (beyond traditional
> clustering) that could shed some light on this? If not, maybe you can
> point me to a person who is likely to know?
> Vetle Torvik
> PS. We have developed a procedure that seems to work well when applied to
> author name disambiguation (Torvik et al., A probabilistic similarity
> metric for Medline records: a model for author name disambiguation. JASIST
> 2005). This paper describes a model to estimate P. To find a clustering
> solution X, we first correct all transitivity violations by weighted least
> squares, and then perform agglomerative clustering (iteratively merge the
> pair of clusters that give the greatest instantaneous increase in
> likelihood until the likelihood hits max).
> ----------------------------------------------
> CLASS-L list.
> Instructions:

Dr Murray Jorgensen
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: [log in to unmask]                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    Home +64 7 825 0441    Mobile 021 1395 862

CLASS-L list.