muhammad waseem
2016-02-05 16:00:32 UTC
Dear All,
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above runs
and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got a
very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I am
missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!
I am trying to train my model using Scikit-learn's Random forest
(Regression) and have tried to use GridSearch with Cross-validation (CV=5)
to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
are the few searches that I performed.
1) max_features :[1,3,5], max_depth :[1,5,10,15],
min_samples_split:[2,6,8,10], bootstrap:[True, False]
The best were max_features=5, max_depth = 15, min_samples_split:10,
bootstrap=True
Best score = 0.8724
Then I searched close to the parameters that were best;
2) max_features :[3,5,6], max_depth :[10,20,30,40],
min_samples_split:[8,16,20,24], bootstrap:[True, False]
The best were max_features=5, max_depth = 30, min_samples_split:20,
bootstrap=True
Best score = 0.8722
Again, I searched close to the parameters that were best;
3) max_features :[2,4,6], max_depth :[25,35,40,50],
min_samples_split:[22,28,34,40], bootstrap:[True, False]
The best were max_features=4, max_depth = 25, min_samples_split:22,
bootstrap=True
Best score = 0.8725
Then I used GridSearch among the best parameters found in the above runs
and found the best on as max_features=4, max_depth = 15,
min_samples_split:10,
Best score = 0.8729
Then I used these parameters to predict for an unknown dataset but got a
very low score (around 0.72).
My questions are; Am I doing the hyperparameter tuning correctly or I am
missing something?
2) Why is my testing score very low as compared to my training and
validation score and how can I improve it so that I get good predictions
out of my model?
Sorry, if these are basic questions as I am new to scikit-learn and ML.
Thanks!