Discussion:
Weighted and Balanced Random Forests
(too old to reply)
Manish Amde
2013-02-07 23:44:01 UTC
Permalink
Fellow sklearners,

I am working on a classification problem with an unbalanced data set and
have been successful using SVM classifiers with the class_weight option.

I have also tried Random Forests and am getting a decent ROC performance
but I am hoping to get a performance improvement by using Weighted or
Balanced Random Forests as suggested in this paper.
http://www.stat.berkeley.edu/tech-reports/666.pdf

I don't see any implementation of these options but I might be mistaken so
I wanted to ask the community. Also, I am willing to write code and
contribute back if this will be useful to other folks.

I have also thought about balancing the data using up/down sampling the
minority/majority class (with or without replacement) and even SMOTE but
couldn't find those implementation in the scikit-learn library yet. The
modified Random Forests seem to outperform these methods according to the
paper, hence I am interested in trying those first.

-Manish
Gilles Louppe
2013-02-08 07:33:44 UTC
Permalink
Hello,

You might achieve what you want by using sample weights when fitting
your forest (See the 'sample_weight' parameter). There is also a
'balance_weights' method from the preprocessing module that basically
generates sample weights for you, such that classes become balanced.

https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221

(This should appear in the reference, I'll fix that)

Hope this helps,

Gilles
Post by Manish Amde
Fellow sklearners,
I am working on a classification problem with an unbalanced data set and
have been successful using SVM classifiers with the class_weight option.
I have also tried Random Forests and am getting a decent ROC performance but
I am hoping to get a performance improvement by using Weighted or Balanced
Random Forests as suggested in this paper.
http://www.stat.berkeley.edu/tech-reports/666.pdf
I don't see any implementation of these options but I might be mistaken so I
wanted to ask the community. Also, I am willing to write code and contribute
back if this will be useful to other folks.
I have also thought about balancing the data using up/down sampling the
minority/majority class (with or without replacement) and even SMOTE but
couldn't find those implementation in the scikit-learn library yet. The
modified Random Forests seem to outperform these methods according to the
paper, hence I am interested in trying those first.
-Manish
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Manish Amde
2013-02-08 07:44:17 UTC
Permalink
Thanks Gilles. This definitely helps. I am glad I asked. :-)

-Manish
Post by Gilles Louppe
Hello,
You might achieve what you want by using sample weights when fitting
your forest (See the 'sample_weight' parameter). There is also a
'balance_weights' method from the preprocessing module that basically
generates sample weights for you, such that classes become balanced.
https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221
(This should appear in the reference, I'll fix that)
Hope this helps,
Gilles
Post by Manish Amde
Fellow sklearners,
I am working on a classification problem with an unbalanced data set and
have been successful using SVM classifiers with the class_weight option.
I have also tried Random Forests and am getting a decent ROC performance but
I am hoping to get a performance improvement by using Weighted or Balanced
Random Forests as suggested in this paper.
http://www.stat.berkeley.edu/tech-reports/666.pdf
I don't see any implementation of these options but I might be mistaken so I
wanted to ask the community. Also, I am willing to write code and contribute
back if this will be useful to other folks.
I have also thought about balancing the data using up/down sampling the
minority/majority class (with or without replacement) and even SMOTE but
couldn't find those implementation in the scikit-learn library yet. The
modified Random Forests seem to outperform these methods according to the
paper, hence I am interested in trying those first.
-Manish
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Jeff Elmore
2013-02-08 15:46:03 UTC
Permalink
I've been wrestling with this same issue in the regression case.

I realize it's not as straight forward to balance continuous data as it is
for discrete classes of output.

But I wonder if this list has any thoughts about how it might be approached.

The data I'm predicting is distributed normally and particularly when
sample sizes are small the tails tend to be neglected and poorly predicted.

Thoughts?
Post by Manish Amde
Thanks Gilles. This definitely helps. I am glad I asked. :-)
-Manish
Post by Gilles Louppe
Hello,
You might achieve what you want by using sample weights when fitting
your forest (See the 'sample_weight' parameter). There is also a
'balance_weights' method from the preprocessing module that basically
generates sample weights for you, such that classes become balanced.
https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221
Post by Gilles Louppe
(This should appear in the reference, I'll fix that)
Hope this helps,
Gilles
Post by Manish Amde
Fellow sklearners,
I am working on a classification problem with an unbalanced data set and
have been successful using SVM classifiers with the class_weight option.
I have also tried Random Forests and am getting a decent ROC
performance but
Post by Gilles Louppe
Post by Manish Amde
I am hoping to get a performance improvement by using Weighted or
Balanced
Post by Gilles Louppe
Post by Manish Amde
Random Forests as suggested in this paper.
http://www.stat.berkeley.edu/tech-reports/666.pdf
I don't see any implementation of these options but I might be mistaken
so I
Post by Gilles Louppe
Post by Manish Amde
wanted to ask the community. Also, I am willing to write code and
contribute
Post by Gilles Louppe
Post by Manish Amde
back if this will be useful to other folks.
I have also thought about balancing the data using up/down sampling the
minority/majority class (with or without replacement) and even SMOTE but
couldn't find those implementation in the scikit-learn library yet. The
modified Random Forests seem to outperform these methods according to
the
Post by Gilles Louppe
Post by Manish Amde
paper, hence I am interested in trying those first.
-Manish
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Manish Amde
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Manish Amde
2013-03-20 06:33:54 UTC
Permalink
I have a follow up question regarding the usage of sample_weights for
fitting the RandomForestClassifier. Does the predict_proba method take the
sample weights (used during fitting) into account as well? I spent some
time trying to understand the _tree.pyc and tree.py files in the codebase
but still I am a little fuzzy about how the predict_proba code works when
the sample_weights are present.

I have an unbalanced data set (1:12 ratio) and I find that the
probabilities are highly skewed towards the majority class even after using
sample weights.

I am planning to use isotonic regression to calibrate my predictions but it
will be nice to have a less skewed input into the calibration algorithms.
Post by Gilles Louppe
Hello,
You might achieve what you want by using sample weights when fitting
your forest (See the 'sample_weight' parameter). There is also a
'balance_weights' method from the preprocessing module that basically
generates sample weights for you, such that classes become balanced.
https://github.com/glouppe/scikit-learn/blob/master/sklearn/preprocessing.py#L1221
(This should appear in the reference, I'll fix that)
Hope this helps,
Gilles
Post by Manish Amde
Fellow sklearners,
I am working on a classification problem with an unbalanced data set and
have been successful using SVM classifiers with the class_weight option.
I have also tried Random Forests and am getting a decent ROC performance
but
Post by Manish Amde
I am hoping to get a performance improvement by using Weighted or
Balanced
Post by Manish Amde
Random Forests as suggested in this paper.
http://www.stat.berkeley.edu/tech-reports/666.pdf
I don't see any implementation of these options but I might be mistaken
so I
Post by Manish Amde
wanted to ask the community. Also, I am willing to write code and
contribute
Post by Manish Amde
back if this will be useful to other folks.
I have also thought about balancing the data using up/down sampling the
minority/majority class (with or without replacement) and even SMOTE but
couldn't find those implementation in the scikit-learn library yet. The
modified Random Forests seem to outperform these methods according to the
paper, hence I am interested in trying those first.
-Manish
------------------------------------------------------------------------------
Post by Manish Amde
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Manish Amde
2013-02-07 18:39:59 UTC
Permalink
Fellow sklearners,

I am working on a classification problem with an unbalanced data set and have been successful using SVM classifiers with the class_weight option.

I have also tried Random Forests and am getting a decent ROC performance but I am hoping to get a performance improvement by using Weighted or Balanced Random Forests as suggested in this paper.
http://www.stat.berkeley.edu/tech-reports/666.pdf

I don't see any implementation of these options but I might be mistaken so I wanted to ask the community. Also, I am willing to write code and contribute back if this will be useful to other folks.

I have also thought about balancing the data using up/down sampling the minority/majority class (with or without replacement) and even SMOTE but couldn't find those implementation in the scikit-learn library yet. The modified Random Forests seem to outperform these methods according to the paper, hence I am interested in trying those first.

-Manish
Continue reading on narkive:
Loading...