Discussion:
[Scikit-learn-general] Random forest low score on testing data
muhammad waseem
2016-02-05 16:00:32 UTC
Permalink
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.

1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724

Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722

Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]

The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725

Then I used GridSearch among the best parameters found in the above runs
and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729

Then I used these parameters to predict for an unknown dataset but got a
very low score (around 0.72).

My questions are; Am I doing the hyperparameter tuning correctly or I am
missing something?

2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?

Sorry, if these are basic questions as I am new to scikit-learn and ML.

Thanks!
Luca Puggini
2016-02-05 16:13:51 UTC
Permalink
To me the score is not so low. The model is slightly over fitting. Try to
repeat the same process with extremely randomized trees instead of random
forest and try to keep a low depth.
Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above runs
and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got a
very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I am
missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Sent by mobile phone
muhammad waseem
2016-02-05 16:27:21 UTC
Permalink
Hi Luca,
Could you please explain how can do this randomized trees in scikit-learn?
So you suggest I should be using Random forest?
Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting. Try to
repeat the same process with extremely randomized trees instead of random
forest and try to keep a low depth.
Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above runs
and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got a
very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I am
missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Sent by mobile phone
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Luca Puggini
2016-02-05 17:00:07 UTC
Permalink
Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor

it work similarly to random forest. In my experience RF tends often to
overfit.
I suggest you to start using the default parameters and cross validate only
on the max_depth parameter. Start with small values of max_depth [2, 3, 5,
7, 10] and check how the performances of the model change.

Good Luck.
Luca
Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in scikit-learn?
So you suggest I should be using Random forest?
Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting. Try to
repeat the same process with extremely randomized trees instead of random
forest and try to keep a low depth.
Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above runs
and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got a
very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I am
missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Sent by mobile phone
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Sent by mobile phone
muhammad waseem
2016-02-05 17:13:21 UTC
Permalink
Thanks Luca, I will give it a try. When you say extremely randomised, does
this mean using large number of n_estimators?

Also, any idea how to solve overfitting problem for random forest?

Regards
Waseem
Post by Luca Puggini
Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
it work similarly to random forest. In my experience RF tends often to
overfit.
I suggest you to start using the default parameters and cross validate
only on the max_depth parameter. Start with small values of max_depth [2,
3, 5, 7, 10] and check how the performances of the model change.
Good Luck.
Luca
Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?
Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting. Try
to repeat the same process with extremely randomized trees instead of
random forest and try to keep a low depth.
Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above
runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got
a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I
am missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Sent by mobile phone
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Sent by mobile phone
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Luca Puggini
2016-02-05 20:46:23 UTC
Permalink
The number of trees (n estimators) should be as much large as possible. It
does not cause over fitting. In random forest over fitting is usually
caused by the depth and by variables with several unique values. I'll
suggest you to start using randomized trees with low depth. If you want to
use rf you can try to reduce the number of variables used at each split.

Observe that if you use OOB to estimate the prediction error it may be
biased when the number of trees is large.

In addition I'll suggest you to shuffle the data at the beginning if you
can.
Post by muhammad waseem
Thanks Luca, I will give it a try. When you say extremely randomised, does
this mean using large number of n_estimators?
Also, any idea how to solve overfitting problem for random forest?
Regards
Waseem
Post by Luca Puggini
Here there are the extra trees
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
it work similarly to random forest. In my experience RF tends often to
overfit.
I suggest you to start using the default parameters and cross validate
only on the max_depth parameter. Start with small values of max_depth [2,
3, 5, 7, 10] and check how the performances of the model change.
Good Luck.
Luca
Post by muhammad waseem
Hi Luca,
Could you please explain how can do this randomized trees in
scikit-learn? So you suggest I should be using Random forest?
Post by Luca Puggini
To me the score is not so low. The model is slightly over fitting. Try
to repeat the same process with extremely randomized trees instead of
random forest and try to keep a low depth.
Post by muhammad waseem
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above
runs and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got
a very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I
am missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Sent by mobile phone
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Sent by mobile phone
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------