Making parameter files for Multimix using R or Splus
Murray Jorgensen
The purpose of this message is to introduce a short S program to aid in the
creation of parameter files for the Multimix program. Multimix is a Fortran
77 program for mixture model based cluster analysis written by Lyn Hunt. It
is described in the paper 'Mixture model clustering using the Multimix
program' by L. Hunt and M. Jorgensen (Australian & New Zealand Journal of
Statistics, 41, 1999, pp 153171) and also in some papers at
ftp://ftp.math.waikato.ac.nz/pub/maj/ , from where the program itself and
related files may be downloaded, including the program pfile.rs that I am
describing here.
The distributions used by Multimix are built up from building blocks of
discrete distributions, (possibly multivariate) normal distributions, and
location models (also called 'conditional Gaussian' distributions) which
are pvariate normal distributions, except that the means may depend on a
(p+1)st discrete variable.
The model to be fitted is described to Multimix by way of a fully numeric
'parameter file'. An interactive Fortran program 'read3' is available to
create parameter files. These can also be created in a text editor either
from scratch, or by editing older parameter files. An explanation of the
parameter file format may be found in the file notes.ps at the above URL.
When the number of variables (attributes) is large, read3 can be tedious to
use, so I have written an S program to do the same job which may prove more
convenient to use for those familiar with R or Splus.
The following two examples should demonstrate its use.
Example 1.
Suppose we wish to make the following (within cluster) distributional
assumptions for a data set with nine attributes:
Attribute: 1 2 3 4 5 6 7 8 9
Var type: C D D C C C D D C
+ + +Bivariate Normal
* * *Location [3category]
+ +Discrete [binary]
* * *Location [5category]
+ +Discrete [4category]
* *Univariate Normal
1 2 3 4 5 6 7 8 9
C D D C C C D D C
here C stands for continuous, D for discrete.
Then in the R or Splus command window make the following assignments:
dvars < c(3,8)
dlevs < c(2,4)
nvars < list(list(1,4),list(9))
lvars < list(list(2,5),list(7,6))
llevs < c(3,5)
file < "d:/writing/multimix/examp.par" # or whatever
and paste in the S program. The file 'examp.par' that is created follows:
ngroups nobs 9 6 2
3 6 1 4 7 9 8 2 5
1 1 2 1 2 2
0 0 2 1 1 1
1 2 3 5 6 8
1 2 4 5 7 9
1 1 2 2 3 3
1 1 2 2 2 3 4 3 4
2 4 0 0 0 3 0 5 0
Replace this text by initial classification (class assignment for each
observation)
To fit a model with 2 groups to a data set with 100 observations replace
'ngroups' by 2, 'nobs' by 100 and the last two lines of the file by some
possibly random initial assignment like
1 2 1 1 2 2 2 2 1 2 2 1 2 2 1 1 2 2 1 1 1 2 1 2 2 1 2 2 2 1 1 1 1 2 1 2 1 2
1 2 2 1 1 1 2 2 2 1 1 2 1 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 1 1 1 2 1 2 1 2 2 1
2 2 1 2 1 2 1 2 2 2 2 1 2 2 1 1 1 2 1 1 2 2 2 2
Example 2.
This is based on data that I am working on now. There are 33 binary
variables in the first 33 input columns, followed by 21 continuous
variables. I am in the early stages of exploration with this data set, so I
am using a model with full local independence. This implies 33 discrete
variables, 21 univariate normals, and no location models. To set this up I
make the initial assignments:
dvars < 1:33
dlevs < rep(2,33)
nvars < lapply(34:54,as.list)
lvars < NULL
llevs < numeric(0)
The output file will need editing to supply the number of groups, the
number of observations, and the initial assignment as in the first example.
The EM algorithm used by Multimix may also be initialized by specifying
starting parameter values, but it is usually easiest to do this from
parameter output files created by Multimix itself.
20001109
Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
*Applications Editor, Australian and New Zealand Journal of Statistics*
[log in to unmask] Phone +647 838 4773 home phone 856 6705 Fax 838 4155
