Subject: | |
From: | |
Reply To: | Classification, clustering, and phylogeny estimation |
Date: | Mon, 2 Jul 2007 11:04:56 -0700 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
Have you considered the R interface to a public domain version of Random Forests?
http://cran.r-project.org/src/contrib/Descriptions/randomForest.html
You do not need to reduce the number of covariates.
It thrives on correlated covariates.
Regards,
- Kari Torkkola
William Shannon wrote:
> Hi Peter,
>
> I am unaware of SPINA and am downloading party now to look into that
> software. I generally have used rpart (because Salford is so expensive)
> but have never dealt with this many variables with rpart.
>
> Do you have anyway to reduce the number of covariates before
> partitioning? I would be concerned about the curse of dimensionality
> with 900 variables and 1,000 data points. It would be very easy to find
> excellent classifiers based on noise. Some suggest that a split data
> set (train on one subset randomly selected from the 1,000 data points
> and test on the remaining) overcomes this. However, if X by chance due
> to the curse of dimensionality discriminates well than it will
> discriminate well in both the training and test data sets.
>
> Can you reduce the 900 covariates by PCA or perhaps use an upfront
> stepwise linear discriminant analysis with a high P value threshold to
> retain the covariate (say p = .2). We have a paper where we proposed
> and tested a genetic algorithm to reduce the number of variables in
> microarray data that I can send you in a couple of weeks when I get back
> to St. Louis. It is being published in Sept. in the Interface Proceedings.
>
> Good luck.
> Bill Shannon
> Washington Univ. School of Medicine, St. Louis
> 314-704-8725
>
> */Peter Flom <[log in to unmask]>/* wrote:
>
> I have been getting involved with classification trees, and have
> some questions regarding software. My data consist of the following:
>
> about 1,000 subjects - likely to increase but not dramatically
>
> about 900 independent or predictor variables - all continuous, some
> highly correlated, all standardized and approximately normally
> distributed
>
> outcome which can be dichotomous or categorical, with up to 10 or so
> categories.
>
> I have been using software from R - both Torsten Hothorn's party
> package and Therneau and Atkinson's rpart - but these bog down when
> the tree is not dichotomous
>
> I have investigated Salford System's software, which is very
> impressive, but expensive, and may be beyond our budget.
>
> I've looked briefly at SPINA
>
>
> I'd appreciate any advice or references to recent reviews.
>
> Thanks
>
> Peter L. Flom, PhD
> Brainscope, Inc.
> 212 263 7863 (MTW)
> 212 845 4485 (Th)
> 917 488 7176 (F)
>
>
> ---------------------------------------------- CLASS-L list.
> Instructions:
> http://www.classification-society.org/csna/lists.html#class-l
>
>
> ---------------------------------------------- CLASS-L list.
> Instructions: http://www.classification-society.org/csna/lists.html#class-l
----------------------------------------------
CLASS-L list.
Instructions: http://www.classification-society.org/csna/lists.html#class-l
|
|
|