there exist many decision rules for the optimal number of classes. See
for example the Gordon book "Classification".
However, the definition of "optimal" is not a mainly mathematical problem.
You have to decide about what is good with respect to the aim of your
particular clustering problem.
I would expect the "best" solution in your sense from optimizing a
criterion that results from an as direct as possible translation of your
demands to a solution into a mathematical formula.
Therefore, define your own optimality criterion, adapted to your
particular situation! (Or, at least, try to understand the requirements of
your particular problem well enough that your can use this to choose among
the many existing criteria.)
On Mon, 11 Apr 2005, Travis Brenden wrote:
> Classification Listserve Members,
> I am working on a clustering method that can be applied to digital river
> in a geographic information system. Essentially, the goal is to cluster
> adjoining river reaches (i.e., river reaches that flow into each other) into
> larger habitat units (i.e. patches, valley segments) based on habitat data that
> are attributed
> to each river reach. I have found that most standard clustering methods do not
> work well with this type of data because the methods do not recognize the fact
> that only adjoining reaches should be clustered. I thus have constructed an
> algorithmn that will "crawl" through a river network and form patches one at
> time by iteratively merging adjoining river reaches, until no more adjoining
> reaches satisfy the merging threshold. Once a patch is formed, all reaches
> that comprise that patch are dropped from the candidate list of reaches so that
> they will not get clustered into another habitat patch. Right now, I am basing
> my threshold value on the average (or some other statistic) of the pairwise
> Euclidean differences between all river reaches in the network. Clustering also
> is based on Euclidean differences in the habitat variables.
> The method seems to work fairly well, but I would now like to try and merge
> neighboring patches into larger units until some "optimum" level of patches is
> found (although I realize "optimum" is probably a myth). Essentially, this
> concerns the implementation of a stopping rule for forming clusters. My
> current stopping rule is based on the Calinski and Harabasz (1974) index, which
> in a nutshell is the ratio of the between and pooled within cluster sum of
> squares. I thus iteratively merge the most similar patches until the Calinski
> and Harabasz index can no longer be improved.
> My question is whether the Calinski and Harabasz index is useful for this type
> of application (trying to find an "optimum" number of clusters) or to see if
> anybody had any other suggestions as to a better stopping rule? I would also
> be interested to hear if anybody had other suggestions concerning how to
> cluster only adjoining river reaches. I have done a number of web searches for
> a better method but I have always come up empty. To me, this is a form of
> spatially-constrained clustering, but I have not come across anything similar
> in other fields.
> Thanks in advance for any suggestions that might be provided.
> Travis Brenden
> School of Natural Resources and Environment
> University of Michigan
> 212 Museums Annex
> Ann Arbor, MI 48109-1084
> 734-663-3554 (Ext. 122)
> [log in to unmask]
*** NEW ADDRESS! ***
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[log in to unmask], www.homepages.ucl.ac.uk/~ucakche