William Shannon wrote
Do you have anyway to reduce the number of covariates before partitioning? I would be concerned about the curse of dimensionality with 900 variables and 1,000 data points. It would be very easy to find excellent classifiers based on noise. Some suggest that a split data set (train on one subset randomly selected from the 1,000 data points and test on the remaining) overcomes this. However, if X by chance due to the curse of dimensionality discriminates well than it will discriminate well in both the training and test data sets.
and suggested the following experiment:
1. Simulate a dataset consisting of 1,000 data points and 900 covariates where each covariate value comes from a normal(0,1) (or any other distribution) -- everything independent from each other.
2. Randomly assign the first 500 data points to group 1 and the second 500 data points to group 2
3. Fit your favorite discriminator to predict these two groups and see how well you can with random data.
4. After identifying the best fitting model removes those covariates and redo the analysis.
I did this.
Using party, there were no viable splits in the original data.
Here is my code:
treetestdata <- as.data.frame(mvrnorm(n = 1000, mu=rep(0,900), Sigma = diag(900)))
treetestdata$GROUP <- rep(c("G1","G2"), each=500)
treetest <- ctree(as.factor(GROUP) ~ ., data = treetestdata)
the result was a single node, with 500 subjects in each of the two groups.
There was, thus, no way to do steps 3 or 4.
Now, it's true that these variables are uncorrelated and mine in real life are correlated. I can play around with that a little bit, but don't have time to do so right now. If others are interested in playing around with this structure, I'd appreciate seeing any results.