## CLASS-L@LISTS.SUNYSB.EDU

 Options: Use Monospaced Font Show Text Part by Default Show All Mail Headers Message: [<< First] [< Prev] [Next >] [Last >>] Topic: [<< First] [< Prev] [Next >] [Last >>] Author: [<< First] [< Prev] [Next >] [Last >>]

 Subject: Re: I need some artificial data. From: John Day <[log in to unmask]> Reply To: Classification, clustering, and phylogeny estimation Date: Thu, 3 Jan 2002 00:13:14 -0500 Content-Type: text/plain Parts/Attachments: text/plain (122 lines)
```Let me add an additional bit of advice.

In real-world problems, data often does not line up in neat balls or
clusters. Quite often the algorithms fail to aggregate groups which we feel
should have been grouped together, or, the algorithm will join groups which
we feel should have been split

That is because all the algorithms require some sort of 'vigilance
parameter', such as the predetermined number of clusters or some criterion
for splitting and joining, based on some concept of 'how big is big?'  'how
far is far?' etc.

The choice of this parameter thus makes the clustering problem somewhat
subjective.

Therefore, to assess algorithms I recommend that you  start with 2 or 3
dimensional data sets which you can scatter-plot independently, to verify
how well the algorithms performed.

Move to higher dimensions, once you've satisfied yourself that the
algorithm is working correctly on the clusters you can perceive with your
own eyes.

HTH,
John Day, Staff Scientist
Computer Science Innovations, Inc
Melbourne, FL
http://www.csi.cc

At 10:36 AM 1/3/02 -0600, you wrote:
>On Thu, 3 Jan 2002, George Feretzakis wrote:
>
> > In order to compare Clustering methods, I need an artificial data
> > set which is formed in true clusters.  I would therefore greatly
> > appreciate anyone sending me a data set like this or information
> > where I could find this kind of data sets and if there is any
> > simpler algorithm for generating artificial data.
>
>Three suggestions for you. First, an algorithm developed by Glenn Milligan:
>
>@Article{Milligan:1985,
>   author =       {Glenn W. Milligan},
>   title =        {An Algorithm for Generating Artificial Test Clusters},
>   journal =      {Psychometrica},
>   year =         1985,
>   volume =       50,
>   number =       1,
>   pages =        {123--127},
>   month =        {March}
>}
>
>A C++ implementation of Milligan's algorithm has been written by Dan
>Pape, based on an implementation in C that I wrote. You can get it here:
>
>http://clusutils.sourceforge.net/
>
>Compiling it on a Unix or Linux system will be easy using recent
>versions of the standard Gnu development tools. We haven't heard much
>from people who have compiled it for other platforms, but we'd like to.
>
>Dr. Milligan's implementation is still available at the CSNA website
>as Fortran source and (I think) a DOS executable:
>
>http://www.pitt.edu/~csna/Milligan/
>
>You might also try the algorithm developed by Waller, et al:
>
>@Article{waller99,
>   author =       {Waller, N.G. and Underhill, J. M. and Kaiser, H. A.},
>   title =        {A method for generating simulated plasmodes and
>                   artificial test clusters with user-defined shape,
>                   size, and orientation},
>   journal =      {Multivariate Behavioral Research},
>   year =         1999,
>   volume =       34,
>   number =       2,
>   pages =        {123--142},
>}
>
>Dr. Waller has links to Windows executables and Splus implementations of the
>algorithm from this page:
>
>  http://peabody.vanderbilt.edu/depts/psych_and_hd/faculty/wallern/
>
>Final suggestion: these algorithms were developed to support the same
>kind of research you are proposing to begin. Review the methodologies
>Milligan and Waller employed, not just the algorithms for the test
>data.  Here are two starting points that you can also use in citation
>searches:
>
>@Article{Milligan:1980,
>   author =       {Glenn W. Milligan},
>   title =        {An Examination of the Effect of Six Types of Error
>                   Perturbation on Fifteen Clustering Algorithms},
>   journal =      {Psychometrica},
>   year =         1980,
>   volume =       45,
>   number =       3,
>   pages =        {325--341},
>   month =        {September}
>}
>
>@Article{Waller:1998b,
>   author =       {Niels G. Waller and Heather A. Kaiser and Janine
>                   B. Illian and Mike Manry},
>   title =        {A Comparison of the Classification Capabilities of the
>                   1-Dimensional Kohonen Neural Network with Two
>                   Partitioning and Three Hierarchical Cluster Analysis
>                   Algorithms},
>   journal =      {Psychometrica},
>   year =         1998,
>   volume =       63,
>   number =       1,
>   pages =        {5--22},
>   month =        {March}
>}
>
>cheers,
>
>Dave Dubin
>ISRL, UIUC
```