Li Aodong

2016-04-27 12:46:22 UTC

Dear all,

Recently I have been using scikit-learn as my tool box of traditional machine learning. And I am pretty surprised about the strength. However, I have also met some problems about gradient boosting module and I think you can help me figure them out.

First, it is about the subsample parameter description. As the picture below, it says that "Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias". But I think choosing subsample < 1.0 actually increases the variance and decreases the bias.

[cid:9512bd78-deb1-4e97-935c-d15b2e58e403]

Second, if subsample is smaller than 1.0, it is indeed a necessary condition of Stochastic Gradient Boosting (SGB). However, SGB is not that simple. According to [1], SGB is required that "at each iteration a subsample of the training data is drawn at random (without replacement) from the full training data". I have tried your implementation, and I think you generate the sample mask with replacement at each iteration.

I really like scikit-learn and I want to use it for my research. So I hope to figure it out whether I am wrong or there are problems about this.

Best,

Aodong

Recently I have been using scikit-learn as my tool box of traditional machine learning. And I am pretty surprised about the strength. However, I have also met some problems about gradient boosting module and I think you can help me figure them out.

First, it is about the subsample parameter description. As the picture below, it says that "Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias". But I think choosing subsample < 1.0 actually increases the variance and decreases the bias.

[cid:9512bd78-deb1-4e97-935c-d15b2e58e403]

Second, if subsample is smaller than 1.0, it is indeed a necessary condition of Stochastic Gradient Boosting (SGB). However, SGB is not that simple. According to [1], SGB is required that "at each iteration a subsample of the training data is drawn at random (without replacement) from the full training data". I have tried your implementation, and I think you generate the sample mask with replacement at each iteration.

I really like scikit-learn and I want to use it for my research. So I hope to figure it out whether I am wrong or there are problems about this.

Best,

Aodong