i've been using a similar heuristic with extra trees quite effectively. i
importance than this feature. quite simple to implement with a few lines of
code using extra trees. but stochastic in nature given how my "control"
feature is generated (at present simply randn).
implemented as one. i thought the variations might be good as a contrib
package rather than a new feature selection module.
Post by Andreas MuellerHi Dan.
I saw that paper, but it is not well-cited.
My question is more how different this is from what we already have.
So it looks like some (5) random control features are added and the
features importances are compared against the control.
The question is whether the feature importance that is used is different
from ours. Gilles?
If not, this could be hard to add. If it is the same, I think a
meta-estimator would be a nice addition to the feature selection module.
Cheers,
Andy
Hi Andy,
This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 79
times according to Google Scholar.
Regarding your second point, the first 3 questions of the FAQ on the
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/
1. *So, what's so special about Boruta?* It is an all relevant feature
selection method, while most other are minimal optimal; this means it tries
to find all features carrying information usable for prediction, rather
than finding a possibly compact subset of features on which some classifier
has a minimal error. Here is a paper with the details.
2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors that
contribute to it, not just the bluntest signs of it in context of your
methodology (yes, minimal optimal set of features by definition depends on
your classifier choice).
3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in pâ«n problems, one can usually
cherry-pick a nonsense subset of features which yields good or even perfect
classification â minimal optimal methods can easily get deceived by that,
leaving you with an overfitted model and no sign that something is wrong.
See this or that for an example.
I'm not an ML expert by any means but it seemed reasonable to me. Any
thoughts?
Cheers,
Dan
Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different feature
importance?
btw: your mail is flagged as spam as your link is broken and links to some
imperial college internal page.
Cheers,
Andy
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I started
to read up on Random Foretst based methods, as in my limited understanding
RF copes pretty well with this suboptimal situation.
I came across an R package called Boruta: https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought I'll
reimplement it in Python, because I hoped (based on this
http://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how should I
prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general