Discussion:
[Scikit-learn-general] Some problems about gradient boosting module
Li Aodong
2016-04-27 12:46:22 UTC
Permalink
Dear all,


Recently I have been using scikit-learn as my tool box of traditional machine learning. And I am pretty surprised about the strength. However, I have also met some problems about gradient boosting module and I think you can help me figure them out.


First, it is about the subsample parameter description. As the picture below, it says that "Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias". But I think choosing subsample < 1.0 actually increases the variance and decreases the bias.


[cid:9512bd78-deb1-4e97-935c-d15b2e58e403]


Second, if subsample is smaller than 1.0, it is indeed a necessary condition of Stochastic Gradient Boosting (SGB). However, SGB is not that simple. According to [1], SGB is required that "at each iteration a subsample of the training data is drawn at random (without replacement) from the full training data". I have tried your implementation, and I think you generate the sample mask with replacement at each iteration.


I really like scikit-learn and I want to use it for my research. So I hope to figure it out whether I am wrong or there are problems about this.


Best,

Aodong
Paolo Losi
2016-04-29 10:28:51 UTC
Permalink
Hi!
Post by Li Aodong
First, it is about the subsample parameter description. As the picture
below, it says that "Choosing subsample < 1.0 leads to *a* *reduction of
variance and an increase in bias*". But I think choosing subsample < 1.0
actually* increases the variance and decreases the bias*.
Subsampling in GBM can be seens as a form of bagging and it indeed reduce
variance
at the espense of increased bias.
Post by Li Aodong
Second, if subsample is smaller than 1.0, it is indeed a necessary
condition of Stochastic Gradient Boosting (SGB). However, SGB is not that
simple. According to [1], SGB is required that "at each iteration a
subsample of the training data is drawn at random (*without replacement*)
from the full training data". I have tried your implementation, and I think
you generate the sample mask *with replacement* at each iteration.
I haven't checked the implementation but I can reasonably state that
sampling
with or without replacement affects in minimal way the performance of the
model.
See http://www.stat.washington.edu/wxs/Learning-papers/paper-bag.pdf

Paolo

Loading...