Classification, clustering, and phylogeny estimation
Sat, 24 Jun 2000 09:10:32 +0300
At 11:34 23.6.2000, you wrote:
>can someone tell me the limits on the number of variables in relation to
>sample size? Are there any good references on this topic? Thanks in advance,
The good empirical rule for discriminant analysis (classification) is:
N is the numer of observations (sample size);
p is the number of variable.
However this rule is appropriate for two classes.
In general the minimum sample size depends on the procedure you will use.
The parametric statistical procedures require less N,
while the nonparametric ones require more N.
But you have to use as more as possible training observations,
except if you have tremendous data set.
I any case you need some unbiased estimation of
the classification accuracy (cross-validation; leave-one-out;
test sample) in order to determine the particular classifier's
And the most important questions are:
a) selection of variables (the best subset)
Even you have many variables and moderate N,
you could use different variables slection procedures and
you will decrease p
b) choice of an appropriate discriminant procedure
Ognian Asparoukhov Phone: ++(359) 2 700-528
Centre of Biomedical Engineering ++(359) 2 700-326
Bulgarian Academy of Sciences Fax: ++(359) 2 723-787
Acad. Georgi Bonchev Street, Bl. 105 E-mail: [log in to unmask]
1113 Sofia, BULGARIA [log in to unmask]