CLASS-L Archives

June 2008


Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
"Torvik, Vetle Ingvald" <[log in to unmask]>
Reply To:
Classification, clustering, and phylogeny estimation
Thu, 19 Jun 2008 19:11:32 -0500
text/plain (35 lines)
I have what appears to be a fundamental clustering problem but I have not
been able to find much relevant literature on it: Given an n-by-n matrix
of probabilities P, find the maximum likelihood clustering solution
represented by an n-by-n binary matrix X where Xij = 1 if elements i and j
are assigned to the same cluster, and 0 otherwise. That is,

maximize product{ [Pij*Xij + (1-Pij)*(1-Xij)], over all pairs (i,j): i < j}
subject to Xij + Xik + Xjk != 2 for all triples (i,j,k): i<j<k.

P satisfies symmetry (i.e., Pij = Pji) but may have many transitivity
violations of the form Pij + Pik  1 > Pjk.

Do you know of someone who has studied this problem? I am particularly
interested in fast algorithms that solve it optimally (or near-optimally).
If not, do you know of any closely related problems (beyond traditional
clustering) that could shed some light on this? If not, maybe you can
point me to a person who is likely to know?

Vetle Torvik

PS. We have developed a procedure that seems to work well when applied to
author name disambiguation (Torvik et al., A probabilistic similarity
metric for Medline records: a model for author name disambiguation. JASIST
2005). This paper describes a model to estimate P. To find a clustering
solution X, we first correct all transitivity violations by weighted least
squares, and then perform agglomerative clustering (iteratively merge the
pair of clusters that give the greatest instantaneous increase in
likelihood until the likelihood hits max).

CLASS-L list.