[Scikit-learn-general] gridsearchCV

Discussion:

[Scikit-learn-general] gridsearchCV - overfitting

A neuman

2016-05-12 09:53:13 UTC

Hello everyone,

I'm having a bit trouble with the parameters that I've got from
gridsearchCV.

For example:

If i'm using the parameter what i've got from grid seardh CV for example on
RF oder k-nn and i test the model on the train set, i get everytime an AUC
value about 1.00 or 0.99.
The dataset have 1200 samples.

Does that mean that i can't use the Parameters that i've got from the
gridsearchCV? Cause it was in actually every case. I already tried the
nested-CV to compare the algorithms.

example for RF with the values i have got from gridsearchCV (10-fold):

RandomForestClassifier(n_estimators=200,oob_score=True,max_features=None,random_state=1,min_samples_leaf=
2,class_weight='balanced_subsample')

and then i'm just using *clf.predict(X_train) *and test it on the*
y_train set. *

the AUC-value from the clf.predict(X_test) are about 0.73, so there is a
big difference from the train and test dataset.

best,

Joel Nothman

2016-05-12 10:53:17 UTC

Permalink

This would be much clearer if you provided some code, but I think I get
what you're saying.

The final GridSearchCV model is trained on the full training set, so the
fact that it perfectly fits that data with random forests is not altogether
surprising. What you can say about the parameters is that they are also the
best parameters (among those searched) for the RF classifier to predict the
held-out samples under cross-validation.

Post by A neuman
Hello everyone,
I'm having a bit trouble with the parameters that I've got from
gridsearchCV.
If i'm using the parameter what i've got from grid seardh CV for example
on RF oder k-nn and i test the model on the train set, i get everytime an
AUC value about 1.00 or 0.99.
The dataset have 1200 samples.
Does that mean that i can't use the Parameters that i've got from the
gridsearchCV? Cause it was in actually every case. I already tried the
nested-CV to compare the algorithms.
RandomForestClassifier(n_estimators=200,oob_score=True,max_features=None,random_state=1,min_samples_leaf=
2,class_weight='balanced_subsample')
and then i'm just using *clf.predict(X_train) *and test it on the*
y_train set. *
the AUC-value from the clf.predict(X_test) are about 0.73, so there is a
big difference from the train and test dataset.
best,
------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data
untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

A neuman

2016-05-12 11:02:24 UTC

Permalink

Thanks for the answer!

but how should i check that its overfitted or not?

best,

Olivier Grisel

2016-05-12 11:45:01 UTC

Permalink

Post by A neuman
Thanks for the answer!
but how should i check that its overfitted or not?

--
Olivier

A neuman

2016-05-12 12:07:46 UTC

Permalink

thats actually what i did.

and the difference is way to big.

Should I do it withlout gridsearchCV? I'm just wondering why gridsearch
giving me overfitted values. I know that these are the best params and so
on... but i thought i can skip the manual part where i test the params on
my own. GridsearchCV give me just one pool of params, if they are
overfitting, i cant use gridsearchCV? Just having problems to understand
this.

Post by Olivier Grisel

Post by A neuman
Thanks for the answer!
but how should i check that its overfitted or not?

Andreas Mueller

2016-05-12 16:15:04 UTC

Permalink

How did you evaluate on the development set?
You should use "best_score_", not grid_search.score.

Post by A neuman
thats actually what i did.
and the difference is way to big.
Should I do it withlout gridsearchCV? I'm just wondering why
gridsearch giving me overfitted values. I know that these are the best
params and so on... but i thought i can skip the manual part where i
test the params on my own. GridsearchCV give me just one pool of
params, if they are overfitting, i cant use gridsearchCV? Just having
problems to understand this.

Post by A neuman
Thanks for the answer!
but how should i check that its overfitted or not?

Do a development / evaluation split of your dataset, for instance with
the train_test_split utility first. Then train your GridSearchCV model
on the development set and evaluate it both on the development set and
on the evaluation set. If the difference is large it means that you
are overfittng.
--
Olivier
------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data
untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Josh Vredevoogd

2016-05-12 17:03:20 UTC

Permalink

Another point of confusion:
You shouldn't be using clf.predict() to calculate roc auc, you need
clf.predict_proba(). Roc is a measure of sorting and predict only gives you
the predicted class, not the probability, so the roc "curve" can only have
points at 0 and 1 instead of any probability in between.

Post by Andreas Mueller
How did you evaluate on the development set?
You should use "best_score_", not grid_search.score.
thats actually what i did.
and the difference is way to big.
Should I do it withlout gridsearchCV? I'm just wondering why gridsearch
giving me overfitted values. I know that these are the best params and so
on... but i thought i can skip the manual part where i test the params on
my own. GridsearchCV give me just one pool of params, if they are
overfitting, i cant use gridsearchCV? Just having problems to understand
this.

Post by Olivier Grisel

Post by A neuman
Thanks for the answer!
but how should i check that its overfitted or not?

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general