CLASS-L Archives

July 2007


Options: Use Monospaced Font
Show HTML Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
William Shannon <[log in to unmask]>
Reply To:
Classification, clustering, and phylogeny estimation
Mon, 2 Jul 2007 06:25:11 -0700
text/plain (2484 bytes) , text/html (3063 bytes)
Hi Peter,

I am unaware of SPINA and am downloading party now to look into that software.  I generally have used rpart (because Salford is so expensive) but have never dealt with this many variables with rpart.

Do you have anyway to reduce the number of covariates before partitioning?  I would be concerned about the curse of dimensionality with 900 variables and 1,000 data points.  It would be very easy to find excellent classifiers based on noise.  Some suggest that a split data set (train on one subset randomly selected from the 1,000 data points and test on the remaining) overcomes this.  However, if X by chance due to the curse of dimensionality discriminates well than it will discriminate well in both the training and test data sets.

Can you reduce the 900 covariates by PCA or perhaps use an upfront stepwise linear discriminant analysis with a high P value threshold to retain the covariate (say p = .2).  We have a paper where we proposed and tested a genetic algorithm to reduce the number of variables in microarray data that I can send you in a couple of weeks when I get back to St. Louis.  It is being published in Sept. in the Interface Proceedings.

Good luck.
Bill Shannon
Washington Univ. School of Medicine, St. Louis 

Peter Flom <[log in to unmask]> wrote:     Tree software     I have been getting involved with classification trees, and have some questions regarding software.  My data consist of the following:
 about 1,000 subjects - likely to increase but not dramatically
 about 900 independent or predictor variables - all continuous, some highly correlated, all standardized and approximately normally distributed
 outcome which can be dichotomous or categorical, with up to 10 or so categories.
 I have been using software from R - both Torsten Hothorn's party package and Therneau and Atkinson's rpart - but these bog down when the tree is not dichotomous
 I have investigated Salford System's software, which is very impressive, but expensive, and may be beyond our budget.
 I've looked briefly at SPINA
 I'd appreciate any advice or references to recent reviews.
 Peter L. Flom, PhD
 Brainscope, Inc.
 212 263 7863 (MTW)
 212 845 4485 (Th)
 917 488 7176 (F)
  ---------------------------------------------- CLASS-L list. Instructions:

CLASS-L list.