With signal = 1, party split only on signal, eventually getting to 4 nodes, with splits at -.237, .602, and 1.29. The percent correctly placed was about .8 for the extreme nodes and .6 for the middle nodes

With signal = .5, it made only one split, on signal at .732, with about 55% to 60% correctly placed

Peter L. Flom, PhD

Brainscope, Inc.

212 263 7863 (MTW)

212 845 4485 (Th)

917 488 7176 (F)

-----Original Message-----

From: William Shannon [mailto:[log in to unmask]]

Sent: Tue 7/3/2007 6:18 AM

To: Classification, clustering, and phylogeny estimation

Cc: Peter Flom

Subject: Re: Tree software

That is surprising. I generated the same data as you did and ran

library(rpart)

a=rpart(as.factor(GROUP)~., data=treetestdata)

and obtained a tree with 24 terminal nodes.

Add a true signal variable to your data and let us know how party does. This can be done by adding the third line to your data generation code:

treetestdata <- as.data.frame(mvrnorm(n = 1000, mu=rep(0,900), Sigma = diag(900)))

treetestdata$GROUP <- rep(c("G1","G2"), each=500)

treetestdata$SIGNAL <- c(rnorm(500, mean=0), rnorm(500, mean=1))

For mean = 1 rpart split on SIGNAL first plus other variables, and for mean=0.5 rpart split on others first and SIGNAL eventually.

Bill

Peter Flom <[log in to unmask]> wrote: RE: Tree software William Shannon wrote

<<<

Do you have anyway to reduce the number of covariates before partitioning? I would be concerned about the curse of dimensionality with 900 variables and 1,000 data points. It would be very easy to find excellent classifiers based on noise. Some suggest that a split data set (train on one subset randomly selected from the 1,000 data points and test on the remaining) overcomes this. However, if X by chance due to the curse of dimensionality discriminates well than it will discriminate well in both the training and test data sets.

>>>>

and suggested the following experiment:

<<<<

1. Simulate a dataset consisting of 1,000 data points and 900 covariates where each covariate value comes from a normal(0,1) (or any other distribution) -- everything independent from each other.

2. Randomly assign the first 500 data points to group 1 and the second 500 data points to group 2

3. Fit your favorite discriminator to predict these two groups and see how well you can with random data.

4. After identifying the best fitting model removes those covariates and redo the analysis.

>>>

I did this.

Using party, there were no viable splits in the original data.

Here is my code:

<<<

treetestdata <- as.data.frame(mvrnorm(n = 1000, mu=rep(0,900), Sigma = diag(900)))

treetestdata$GROUP <- rep(c("G1","G2"), each=500)

treetest <- ctree(as.factor(GROUP) ~ ., data = treetestdata)

plot(treetest)

>>>

the result was a single node, with 500 subjects in each of the two groups.

There was, thus, no way to do steps 3 or 4.

Now, it's true that these variables are uncorrelated and mine in real life are correlated. I can play around with that a little bit, but don't have time to do so right now. If others are interested in playing around with this structure, I'd appreciate seeing any results.

---------------------------------------------- CLASS-L list. Instructions: http://www.classification-society.org/csna/lists.html#class-l