CLASS-L Archives

February 2005


Options: Use Monospaced Font
Show HTML Part by Default
Condense Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
multipart/alternative; boundary="----=_NextPart_000_0008_01C50B89.8B9E9C60"
"Classification, clustering, and phylogeny estimation" <[log in to unmask]>
Leif Peterson <[log in to unmask]>
Sat, 5 Feb 2005 13:49:42 -0600
"Classification, clustering, and phylogeny estimation" <[log in to unmask]>
text/plain (1758 bytes) , text/html (2394 bytes)
I have 3 classes and hundreds of attributes each with ~50 values
(continuous, but discretized based on quartiles).  After calculating
information gain for each attribute I sort the information gain in
descending order.  My goal is not to generate a tree, but rather to perform
instance-based learning using the cumulative list of attributes selected.

Something I have not seen in the literature is what to do if a majority of
the attributes with the greatest information gain have less impurity but in
one particular class.  Given this problem, is there a commonly used method
for weighting or selecting attributes which are the purest for a class?  (I
have tried selecting attributes with the greatest gain for each class,
looping through 3 classes each time I select the next best attribute, and
that seemed to work better than just selecting attributes with the greatest
gain).  Do I need to "prune" unwanted attributes?  If so, are there any
papers which show background methods and criteria for pruning unwanted
attributes in instance-based learning?

Last, another remaining question is that because my goal is not really to
build a hierarchical tree, each time I select an attribute I use the
accumulated attribute data and loop through all of the objects (train) in
order to assign each object to the predicted class.  Each time I add an
attribute, a confusion matrix is generated for classification of all the
objects -- from which I obtain accuracy.  So I get a confusion matrix for
the cumulative list of attributes at each step.  In this scenario, when
should I stop selecting attributes?  Recall that I am not building a tree
for which I can assess purity in each node, but rather picking off
attributes to train and generate a confusion matrix in instance-based