Print

Print


On Thu, 3 Jan 2002, George Feretzakis wrote:

> In order to compare Clustering methods, I need an artificial data
> set which is formed in true clusters.  I would therefore greatly
> appreciate anyone sending me a data set like this or information
> where I could find this kind of data sets and if there is any
> simpler algorithm for generating artificial data.

Three suggestions for you. First, an algorithm developed by Glenn Milligan:

@Article{Milligan:1985,
  author =       {Glenn W. Milligan},
  title =        {An Algorithm for Generating Artificial Test Clusters},
  journal =      {Psychometrica},
  year =         1985,
  volume =       50,
  number =       1,
  pages =        {123--127},
  month =        {March}
}

A C++ implementation of Milligan's algorithm has been written by Dan
Pape, based on an implementation in C that I wrote. You can get it here:

http://clusutils.sourceforge.net/

Compiling it on a Unix or Linux system will be easy using recent
versions of the standard Gnu development tools. We haven't heard much
from people who have compiled it for other platforms, but we'd like to.

Dr. Milligan's implementation is still available at the CSNA website
as Fortran source and (I think) a DOS executable:

http://www.pitt.edu/~csna/Milligan/

You might also try the algorithm developed by Waller, et al:

@Article{waller99,
  author =       {Waller, N.G. and Underhill, J. M. and Kaiser, H. A.},
  title =        {A method for generating simulated plasmodes and
                  artificial test clusters with user-defined shape,
                  size, and orientation},
  journal =      {Multivariate Behavioral Research},
  year =         1999,
  volume =       34,
  number =       2,
  pages =        {123--142},
}

Dr. Waller has links to Windows executables and Splus implementations of the
algorithm from this page:

 http://peabody.vanderbilt.edu/depts/psych_and_hd/faculty/wallern/

Final suggestion: these algorithms were developed to support the same
kind of research you are proposing to begin. Review the methodologies
Milligan and Waller employed, not just the algorithms for the test
data.  Here are two starting points that you can also use in citation
searches:

@Article{Milligan:1980,
  author =       {Glenn W. Milligan},
  title =        {An Examination of the Effect of Six Types of Error
                  Perturbation on Fifteen Clustering Algorithms},
  journal =      {Psychometrica},
  year =         1980,
  volume =       45,
  number =       3,
  pages =        {325--341},
  month =        {September}
}

@Article{Waller:1998b,
  author =       {Niels G. Waller and Heather A. Kaiser and Janine
                  B. Illian and Mike Manry},
  title =        {A Comparison of the Classification Capabilities of the
                  1-Dimensional Kohonen Neural Network with Two
                  Partitioning and Three Hierarchical Cluster Analysis
                  Algorithms},
  journal =      {Psychometrica},
  year =         1998,
  volume =       63,
  number =       1,
  pages =        {5--22},
  month =        {March}
}

cheers,

Dave Dubin
ISRL, UIUC