[Scikit-learn-general] Contributing to scikit-learn with a re-implementation of a Random Forest based iterative feature selection method

Discussion:

Daniel Homola

2015-04-15 09:03:35 UTC

Permalink

Hi all,

I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I
started to read up on Random Forest based methods, as in my limited
understanding RF copes pretty well with this suboptimal situation.

I came across an R package called Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>

After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought
I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..

I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?

If yes, do you have a tutorial or some sort of guidance about how should
I prepare the code, what conventions should I follow, etc?

Cheers,

Daniel Homola

STRATiGRAD PhD Programme
Imperial College London

Andreas Mueller

2015-04-15 15:23:32 UTC

Permalink

Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different feature
importance?

btw: your mail is flagged as spam as your link is broken and links to
some imperial college internal page.

Cheers,
Andy

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I
started to read up on Random Foretst based methods, as in my limited
understanding RF copes pretty well with this suboptimal situation.
I came across an R package called Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought
I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Daniel Homola

2015-04-15 15:32:16 UTC

Permalink

Hi Andy,

This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 79
times according to Google Scholar.

Regarding your second point, the first 3 questions of the FAQ on the
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/

1. *So, what's so special about Boruta?* It is an all relevant feature
selection method, while most other are minimal optimal; this means
it tries to find all features carrying information usable for
prediction, rather than finding a possibly compact subset of
features on which some classifier has a minimal error. Here is a
paper with the details.
2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors
that contribute to it, not just the bluntest signs of it in context
of your methodology (yes, minimal optimal set of features by
definition depends on your classifier choice).
3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in pâ«n problems, one can usually
cherry-pick a nonsense subset of features which yields good or even
perfect classification â minimal optimal methods can easily get
deceived by that, leaving you with an overfitted model and no sign
that something is wrong. See this or that for an example.

I'm not an ML expert by any means but it seemed reasonable to me. Any
thoughts?

Cheers,
Dan

Post by Andreas Mueller
Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different
feature importance?
btw: your mail is flagged as spam as your link is broken and links to
some imperial college internal page.
Cheers,
Andy

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I
started to read up on Random Foretst based methods, as in my limited
understanding RF copes pretty well with this suboptimal situation.
I came across an R package called
Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought
I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2015-04-15 15:56:56 UTC

Permalink

Hi Dan.
I saw that paper, but it is not well-cited.
My question is more how different this is from what we already have.
So it looks like some (5) random control features are added and the
features importances are compared against the control.

The question is whether the feature importance that is used is different
from ours. Gilles?

If not, this could be hard to add. If it is the same, I think a
meta-estimator would be a nice addition to the feature selection module.

Cheers,
Andy

Post by Daniel Homola
Hi Andy,
This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited
79 times according to Google Scholar.
Regarding your second point, the first 3 questions of the FAQ on the
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/
1. *So, what's so special about Boruta?* It is an all relevant
feature selection method, while most other are minimal optimal;
this means it tries to find all features carrying information
usable for prediction, rather than finding a possibly compact
subset of features on which some classifier has a minimal error.
Here is a paper with the details.
2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors
that contribute to it, not just the bluntest signs of it in
context of your methodology (yes, minimal optimal set of features
by definition depends on your classifier choice).
3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in pâ«n problems, one can usually
cherry-pick a nonsense subset of features which yields good or
even perfect classification â minimal optimal methods can easily
get deceived by that, leaving you with an overfitted model and no
sign that something is wrong. See this or that for an example.
I'm not an ML expert by any means but it seemed reasonable to me. Any
thoughts?
Cheers,
Dan

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I
started to read up on Random Foretst based methods, as in my limited
understanding RF copes pretty well with this suboptimal situation.
I came across an R package called
Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought
I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Daniel Homola

2015-04-15 16:13:49 UTC

Permalink

Hi Andy,

So at each iteration the x predictor matrix (n by m) is practically
copied and each column is shuffled in the copied version. This shuffled
matrix is then copied next to the original (n by 2m) and fed into the
RF, to get the feature importances.
Also at the start of the method, a vector with length m is initialized
with zeros, called hitReg.
After the RF training, each feature's importance in x is checked against
the maximum of the shuffled ones. Those that are higher, are recorded by
increasing their index in the vector hitReg.
At each iteration the method checks which feature is doing better than
expected by random chance. So if we are in the 10th iteration, and
feature F was better than the max of the shuffled ones, 8 times, we get
p= .01 with sp.stats.binom.sf(8, 10, .5). We correct for multiple
testing, and if the feature is still significant, we record it as a
"confirmed" or important one. Conversely if feature F was only better
once (sp.stats.binom.cdf(1, 10, .5)), we reject it and delete it from
the x matrix. The method ends if all features are either rejected or
confirmed or if the number of iterations reaches the user set max.

Cheers,
Dan

Post by Daniel Homola
Hi Andy,
This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited
79 times according to Google Scholar.
Regarding your second point, the first 3 questions of the FAQ on the
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/
1. *So, what's so special about Boruta?* It is an all relevant
feature selection method, while most other are minimal optimal;
this means it tries to find all features carrying information
usable for prediction, rather than finding a possibly compact
subset of features on which some classifier has a minimal error.
Here is a paper with the details.
2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors
that contribute to it, not just the bluntest signs of it in
context of your methodology (yes, minimal optimal set of features
by definition depends on your classifier choice).
3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in pâ«n problems, one can
usually cherry-pick a nonsense subset of features which yields
good or even perfect classification â minimal optimal methods can
easily get deceived by that, leaving you with an overfitted model
and no sign that something is wrong. See this or that for an example.
I'm not an ML expert by any means but it seemed reasonable to me. Any
thoughts?
Cheers,
Dan

Post by Andreas Mueller
Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different
feature importance?
btw: your mail is flagged as spam as your link is broken and links
to some imperial college internal page.
Cheers,
Andy

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As
I'm working with biological/medical data, where n < p or even n <<
p I started to read up on Random Foretst based methods, as in my
limited understanding RF copes pretty well with this suboptimal
situation.
I came across an R package called
Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I
thought I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Satrajit Ghosh

2015-04-15 16:16:02 UTC

Permalink

hi andy and dan,

i've been using a similar heuristic with extra trees quite effectively. i
have to look at the details of this R package and the paper, but in my case
i add a feature that has very low correlation with my target class/value
(depending on the problem) and choose features that have a higher feature
importance than this feature. quite simple to implement with a few lines of
code using extra trees. but stochastic in nature given how my "control"
feature is generated (at present simply randn).

since there are potential variations one can add to this idea, i never
thought of it as a standalone feature transformer, but it could easily be
implemented as one. i thought the variations might be good as a contrib
package rather than a new feature selection module.

cheers,

satra

Post by Andreas Mueller
Hi Dan.
I saw that paper, but it is not well-cited.
My question is more how different this is from what we already have.
So it looks like some (5) random control features are added and the
features importances are compared against the control.
The question is whether the feature importance that is used is different
from ours. Gilles?
If not, this could be hard to add. If it is the same, I think a
meta-estimator would be a nice addition to the feature selection module.
Cheers,
Andy
Hi Andy,
This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 79
times according to Google Scholar.
Regarding your second point, the first 3 questions of the FAQ on the
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/
1. *So, what's so special about Boruta?* It is an all relevant feature
selection method, while most other are minimal optimal; this means it tries
to find all features carrying information usable for prediction, rather
than finding a possibly compact subset of features on which some classifier
has a minimal error. Here is a paper with the details.
2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors that
contribute to it, not just the bluntest signs of it in context of your
methodology (yes, minimal optimal set of features by definition depends on
your classifier choice).
3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in pâ«n problems, one can usually
cherry-pick a nonsense subset of features which yields good or even perfect
classification â minimal optimal methods can easily get deceived by that,
leaving you with an overfitted model and no sign that something is wrong.
See this or that for an example.
I'm not an ML expert by any means but it seemed reasonable to me. Any
thoughts?
Cheers,
Dan
Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different feature
importance?
btw: your mail is flagged as spam as your link is broken and links to some
imperial college internal page.
Cheers,
Andy
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I started
to read up on Random Foretst based methods, as in my limited understanding
RF copes pretty well with this suboptimal situation.
I came across an R package called Boruta: https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought I'll
reimplement it in Python, because I hoped (based on this
http://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how should I
prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gilles Louppe

2015-04-17 09:45:16 UTC

Permalink

Hi,

In general, I agree that we should at least add a way to compute feature
importances using permutations. This is an alternative, yet standard, way
to do it in comparison to what we do (mean decrease of impurity, which is
also standard).

Assuming we provide permutation importances as a building block, it wouldnt
be difficult for users to add contrast features and rank their features
against those, thereby implementing the algorithms you describe. What do
you think?

this means it tries to find all features carrying information usable for

prediction, rather than finding a possibly compact subset of features on
which some classifier has a minimal error. Here is a paper with the details.

Yes! This is a very good point. And in fact, this can be achieved in
scikit-learn by using totally randomized trees instead
(ExtraTreesClassifier(max_features=1)).

Best,
Gilles

hi andy and dan,
i've been using a similar heuristic with extra trees quite effectively. i
have to look at the details of this R package and the paper, but in my case
i add a feature that has very low correlation with my target class/value
(depending on the problem) and choose features that have a higher feature
importance than this feature. quite simple to implement with a few lines of
code using extra trees. but stochastic in nature given how my "control"
feature is generated (at present simply randn).
since there are potential variations one can add to this idea, i never
thought of it as a standalone feature transformer, but it could easily be
implemented as one. i thought the variations might be good as a contrib
package rather than a new feature selection module.
cheers,
satra

Post by Andreas Mueller
Hi Dan.
I saw that paper, but it is not well-cited.
My question is more how different this is from what we already have.
So it looks like some (5) random control features are added and the
features importances are compared against the control.
The question is whether the feature importance that is used is different
from ours. Gilles?
If not, this could be hard to add. If it is the same, I think a
meta-estimator would be a nice addition to the feature selection module.
Cheers,
Andy
Hi Andy,
This is the paper: http://www.jstatsoft.org/v36/i11/ which was cited 79
times according to Google Scholar.
Regarding your second point, the first 3 questions of the FAQ on the
Boruta website answers it I guess.. https://m2.icm.edu.pl/boruta/
1. *So, what's so special about Boruta?* It is an all relevant
feature selection method, while most other are minimal optimal; this means
it tries to find all features carrying information usable for prediction,
rather than finding a possibly compact subset of features on which some
classifier has a minimal error. Here is a paper with the details.
2. *Why should I care?* For a start, when you try to understand the
phenomenon that made your data, you should care about all factors that
contribute to it, not just the bluntest signs of it in context of your
methodology (yes, minimal optimal set of features by definition depends on
your classifier choice).
3. *But I only care about good classification accuracy!* So you also
care about having a robust model; in pâ«n problems, one can usually
cherry-pick a nonsense subset of features which yields good or even perfect
classification â minimal optimal methods can easily get deceived by that,
leaving you with an overfitted model and no sign that something is wrong.
See this or that for an example.
I'm not an ML expert by any means but it seemed reasonable to me. Any
thoughts?
Cheers,
Dan
Hi Daniel.
That sounds potentially interesting.
Is there a widely cited paper for this?
I didn't read the paper, but it looks very similar to
RFE(RandomForestClassifier()).
Is it qualitatively different from that? Does it use a different feature
importance?
btw: your mail is flagged as spam as your link is broken and links to
some imperial college internal page.
Cheers,
Andy
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I started
to read up on Random Foretst based methods, as in my limited understanding
RF copes pretty well with this suboptimal situation.
I came across an R package called Boruta: https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought I'll
reimplement it in Python, because I hoped (based on this
http://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how should
I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exerciseshttp://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Daniel Homola

2015-05-08 18:34:47 UTC

Permalink

Hi all,

I wrote a couple of weeks ago about implementing the Boruta all-relevant
feature selection method algorithm in Python..

I think it's ready to go now. I wrote fit, transform and fit_transform
methods for it to make it sklearn like.

Here it is:
https://bitbucket.org/danielhomola/boruta_py

Let me know what you think. If anyone thinks this might be worthy of
adding it to the feature selection module, the original author Miron is
happy to give his blessing, and I'm happy work on it further.

Cheers,
Daniel

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I
started to read up on Random Forest based methods, as in my limited
understanding RF copes pretty well with this suboptimal situation.
I came across an R package called Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought
I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London

Andreas Mueller

2015-05-08 19:01:23 UTC

Permalink

Hi Daniel.
That looks cool.
Can you do a github pull request?
See the contributor docs:
http://scikit-learn.org/dev/developers/index.html

Thanks,
Andy

Post by Daniel Homola
Hi all,
I wrote a couple of weeks ago about implementing the Boruta
all-relevant feature selection method algorithm in Python..
I think it's ready to go now. I wrote fit, transform and fit_transform
methods for it to make it sklearn like.
https://bitbucket.org/danielhomola/boruta_py
Let me know what you think. If anyone thinks this might be worthy of
adding it to the feature selection module, the original author Miron
is happy to give his blessing, and I'm happy work on it further.
Cheers,
Daniel

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I
started to read up on Random Forest based methods, as in my limited
understanding RF copes pretty well with this suboptimal situation.
I came across an R package called
Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought
I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2015-05-08 19:02:21 UTC

Permalink

Btw, an example that compares this against existing feature selection
methods that explains differences and advantages would help users and
convince us to merge ;)

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I
started to read up on Random Forest based methods, as in my limited
understanding RF copes pretty well with this suboptimal situation.
I came across an R package called
Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought
I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London

Daniel Homola

2015-05-08 19:15:26 UTC

Permalink

Hi Andy,

Thanks! Will definitely do a github pull request once Miron confirmed he
benchmarked my implementation by running it on the datasets the method
was published with.

I wrote a blog post about it, which explains the differences but in a
quite casual an non rigorous way:
http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/

I guess a more technical write-up, with one of the built in datasets
would be more useful for the sklearn audience.. I'm happy to do it if
Miron says everything looks good.

Cheers,
Daniel

Post by Andreas Mueller
Btw, an example that compares this against existing feature selection
methods that explains differences and advantages would help users and
convince us to merge ;)

Post by Daniel Homola
Hi all,
I wrote a couple of weeks ago about implementing the Boruta
all-relevant feature selection method algorithm in Python..
I think it's ready to go now. I wrote fit, transform and
fit_transform methods for it to make it sklearn like.
https://bitbucket.org/danielhomola/boruta_py
Let me know what you think. If anyone thinks this might be worthy of
adding it to the feature selection module, the original author Miron
is happy to give his blessing, and I'm happy work on it further.
Cheers,
Daniel

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As I'm
working with biological/medical data, where n < p or even n << p I
started to read up on Random Forest based methods, as in my limited
understanding RF copes pretty well with this suboptimal situation.
I came across an R package called
Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I thought
I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London

Andreas Mueller

2015-05-08 19:22:48 UTC

Permalink

It doesn't need to be super technical, and we try to keep the user guide
"easy to understand". No bonus points for unnecessary latex ;)
The example should be as illustrative and fair as possible, and built-in
datasets are preferred. It shouldn't be to heavy-weight, though.
If you like, you can show off some plots in the PR, that is always very
welcome.

Post by Daniel Homola
Hi Andy,
Thanks! Will definitely do a github pull request once Miron confirmed
he benchmarked my implementation by running it on the datasets the
method was published with.
I wrote a blog post about it, which explains the differences but in a
http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/
I guess a more technical write-up, with one of the built in datasets
would be more useful for the sklearn audience.. I'm happy to do it if
Miron says everything looks good.
Cheers,
Daniel

Post by Andreas Mueller
Btw, an example that compares this against existing feature selection
methods that explains differences and advantages would help users and
convince us to merge ;)

Post by Daniel Homola
Hi all,
I wrote a couple of weeks ago about implementing the Boruta
all-relevant feature selection method algorithm in Python..
I think it's ready to go now. I wrote fit, transform and
fit_transform methods for it to make it sklearn like.
https://bitbucket.org/danielhomola/boruta_py
Let me know what you think. If anyone thinks this might be worthy of
adding it to the feature selection module, the original author Miron
is happy to give his blessing, and I'm happy work on it further.
Cheers,
Daniel

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As
I'm working with biological/medical data, where n < p or even n <<
p I started to read up on Random Forest based methods, as in my
limited understanding RF copes pretty well with this suboptimal
situation.
I came across an R package called
Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I
thought I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London

Daniel Homola

2016-01-31 14:02:37 UTC

Permalink

Dear all,

I migrated my Python implementation of the Boruta algorithm to:
https://github.com/danielhomola/boruta_py

I also implemented 3 mutual information based feature selection (JMI,
JMIM, MRMR) methods and wrapped them up in scikit-learn like interface:
https://github.com/danielhomola/mifs

Could you please have a look at it? I'm writing a blog post
demonstrating their strengths against existing methods. Would you
require anything else to possibly include these in the next release?

Thanks a lot,
Daniel

Post by Andreas Mueller
It doesn't need to be super technical, and we try to keep the user
guide "easy to understand". No bonus points for unnecessary latex ;)
The example should be as illustrative and fair as possible, and
built-in datasets are preferred. It shouldn't be to heavy-weight, though.
If you like, you can show off some plots in the PR, that is always
very welcome.

Post by Andreas Mueller
Btw, an example that compares this against existing feature
selection methods that explains differences and advantages would
help users and convince us to merge ;)

Post by Daniel Homola
Hi all,
I wrote a couple of weeks ago about implementing the Boruta
all-relevant feature selection method algorithm in Python..
I think it's ready to go now. I wrote fit, transform and
fit_transform methods for it to make it sklearn like.
https://bitbucket.org/danielhomola/boruta_py
Let me know what you think. If anyone thinks this might be worthy
of adding it to the feature selection module, the original author
Miron is happy to give his blessing, and I'm happy work on it further.
Cheers,
Daniel

Post by Daniel Homola
Hi all,
I needed a multivariate feature selection method for my work. As
I'm working with biological/medical data, where n < p or even n <<
p I started to read up on Random Forest based methods, as in my
limited understanding RF copes pretty well with this suboptimal
situation.
I came across an R package called
Boruta:https://m2.icm.edu.pl/boruta/
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=https%3a%2f%2fm2.icm.edu.pl%2fboruta%2f>
After reading the paper and checking some of the pretty impressive
citations I thought I'd try it, but it was really slow. So I
thought I'll reimplement it in Python, because I hoped (based on
thishttp://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
<https://exchange.imperial.ac.uk/owa/redir.aspx?C=Yp1dHGp6hkyiZQZzx17DHznOv7PxStIIK3PgwAs_McazihitoU3Fm6_EBXvwfIJB2CJSzkCKKjo.&URL=http%3a%2f%2fwww.slideshare.net%2fglouppe%2faccelerating-random-forests-in-scikitlearn>)
that it will be faster. And it is :) I mean a LOT faster..
I was wondering if this would be something that you would consider
incorporating into the feature selection module of scikit-learn?
If yes, do you have a tutorial or some sort of guidance about how
should I prepare the code, what conventions should I follow, etc?
Cheers,
Daniel Homola
STRATiGRAD PhD Programme
Imperial College London