[Scikit-learn-general] Weird overfitting in GridSearchCV?

Discussion:

Juan Nunez-Iglesias

2016-03-10 06:08:17 UTC

Hi all,

TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.

Longer version:

I'm trying to optimise RF hyperparameters using GridSearchCV for the first
time. I have a lot of data (~3M samples, 140 features), so I subsampled it
to do this. First I subsampled to 3000 samples, which finished in 5min, so
I ran 70K samples to see if result would still hold. This resulted in
completely different parameter choices, so I ran 280K samples overnight, to
see whether at least the choices would stabilise as n -> inf. Then when I
printed the top 10 models, I got the following:

In [7]: bests = sorted(random_search.grid_scores_, reverse=True, key=lambda
x: x
[1])

In [8]: bests[:10]
Out[8]:
[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500, 'bootstrap':
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500, 'bootstrap':
True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200, 'bootstrap':
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200, 'bootstrap':
True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200, 'bootstrap':
True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20, 'bootstrap':
False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100, 'bootstrap':
False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20, 'bootstrap':
False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100, 'bootstrap':
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500, 'bootstrap':
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]

Needless to say, perfect accuracy is suspicious, and indeed, in this case,
completely spurious:

In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})

In [17]: rftop.fit(X[:200000], y[:200000])

In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125

That's more in line with what's expected for this dataset, and what was
found by the search with 72K samples (top model: [mean: 0.82640, std:
0.00341, params: {'n_estimators': 500, 'bootstrap': False, 'max_features':
20, 'max_depth': 20, 'criterion': 'gini'},)

Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?

# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()

Thank you!

Juan.

Andreas Mueller

2016-04-12 22:51:52 UTC

Permalink

Have you tried to "score" the grid-search on the non-training set?
The cross-validation is using stratified k-fold while your confirmation
used the beginning of the dataset vs the rest.
Your data is probably not IID.

Post by Juan Nunez-Iglesias
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
I'm trying to optimise RF hyperparameters using GridSearchCV for the
first time. I have a lot of data (~3M samples, 140 features), so I
subsampled it to do this. First I subsampled to 3000 samples, which
finished in 5min, so I ran 70K samples to see if result would still
hold. This resulted in completely different parameter choices, so I
ran 280K samples overnight, to see whether at least the choices would
stabilise as n -> inf. Then when I printed the top 10 models, I got
In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
key=lambda x: x
[1])
In [8]: bests[:10]
[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,
'bootstrap': False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,
'bootstrap': False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this
20, 'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and what
was found by the search with 72K samples (top model: [mean: 0.82640,
std: 0.00341, params: {'n_estimators': 500, 'bootstrap': False,
'max_features': 20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Juan Nunez-Iglesias

2016-04-13 00:10:01 UTC

Permalink

Hallelujah! I'd given up on this thread. Thanks for resurrecting it, Andy!
=)

However, I don't think data distribution can explain the result, since
GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
random samples but changes to perfect classification for 280K samples. I
don't have the data on this computer so I can't test it right now, though.

Juan.

Post by Andreas Mueller
Have you tried to "score" the grid-search on the non-training set?
The cross-validation is using stratified k-fold while your confirmation
used the beginning of the dataset vs the rest.
Your data is probably not IID.
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
I'm trying to optimise RF hyperparameters using GridSearchCV for the first
time. I have a lot of data (~3M samples, 140 features), so I subsampled it
to do this. First I subsampled to 3000 samples, which finished in 5min, so
I ran 70K samples to see if result would still hold. This resulted in
completely different parameter choices, so I ran 280K samples overnight, to
see whether at least the choices would stabilise as n -> inf. Then when I
In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
key=lambda x: x
[1])
In [8]: bests[:10]
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this case,
In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and what was
20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2016-04-13 00:21:00 UTC

Permalink

It's hard to believe this is a software problem rather than a data problem.
If your data was accidentally a duplicate of the dataset, you could
certainly get 100%.

Post by Juan Nunez-Iglesias
Hallelujah! I'd given up on this thread. Thanks for resurrecting it, Andy!
=)
However, I don't think data distribution can explain the result, since
GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
random samples but changes to perfect classification for 280K samples. I
don't have the data on this computer so I can't test it right now, though.
Juan.

Post by Andreas Mueller
Have you tried to "score" the grid-search on the non-training set?
The cross-validation is using stratified k-fold while your confirmation
used the beginning of the dataset vs the rest.
Your data is probably not IID.
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
I'm trying to optimise RF hyperparameters using GridSearchCV for the
first time. I have a lot of data (~3M samples, 140 features), so I
subsampled it to do this. First I subsampled to 3000 samples, which
finished in 5min, so I ran 70K samples to see if result would still hold.
This resulted in completely different parameter choices, so I ran 280K
samples overnight, to see whether at least the choices would stabilise as n
In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
key=lambda x: x
[1])
In [8]: bests[:10]
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this
In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and what was
20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Juan Nunez-Iglesias

2016-04-13 00:31:29 UTC

Permalink

Yes but would you expect sampling 280K / 3M to be qualitatively different
from sampling 70K / 3M?

At any rate I'll attempt a more rigorous test later this week and report
back. Thanks!

Juan.

Post by Joel Nothman
It's hard to believe this is a software problem rather than a data
problem. If your data was accidentally a duplicate of the dataset, you
could certainly get 100%.

Post by Juan Nunez-Iglesias
Hallelujah! I'd given up on this thread. Thanks for resurrecting it,
Andy! =)
However, I don't think data distribution can explain the result, since
GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
random samples but changes to perfect classification for 280K samples. I
don't have the data on this computer so I can't test it right now, though.
Juan.

Post by Andreas Mueller
Have you tried to "score" the grid-search on the non-training set?
The cross-validation is using stratified k-fold while your confirmation
used the beginning of the dataset vs the rest.
Your data is probably not IID.
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
I'm trying to optimise RF hyperparameters using GridSearchCV for the
first time. I have a lot of data (~3M samples, 140 features), so I
subsampled it to do this. First I subsampled to 3000 samples, which
finished in 5min, so I ran 70K samples to see if result would still hold.
This resulted in completely different parameter choices, so I ran 280K
samples overnight, to see whether at least the choices would stabilise as n
In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
key=lambda x: x
[1])
In [8]: bests[:10]
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this
In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and what was
20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2016-04-13 00:35:47 UTC

Permalink

I don't think we can deny this is strange, certainly for real-world, IID
data!

Post by Juan Nunez-Iglesias
Yes but would you expect sampling 280K / 3M to be qualitatively different
from sampling 70K / 3M?
At any rate I'll attempt a more rigorous test later this week and report
back. Thanks!
Juan.

Post by Joel Nothman
It's hard to believe this is a software problem rather than a data
problem. If your data was accidentally a duplicate of the dataset, you
could certainly get 100%.

Post by Juan Nunez-Iglesias
Hallelujah! I'd given up on this thread. Thanks for resurrecting it,
Andy! =)
However, I don't think data distribution can explain the result, since
GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
random samples but changes to perfect classification for 280K samples. I
don't have the data on this computer so I can't test it right now, though.
Juan.

Post by Andreas Mueller
Have you tried to "score" the grid-search on the non-training set?
The cross-validation is using stratified k-fold while your confirmation
used the beginning of the dataset vs the rest.
Your data is probably not IID.
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
I'm trying to optimise RF hyperparameters using GridSearchCV for the
first time. I have a lot of data (~3M samples, 140 features), so I
subsampled it to do this. First I subsampled to 3000 samples, which
finished in 5min, so I ran 70K samples to see if result would still hold.
This resulted in completely different parameter choices, so I ran 280K
samples overnight, to see whether at least the choices would stabilise as n
In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
key=lambda x: x
[1])
In [8]: bests[:10]
[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this
In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and what was
20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications
Manager
Applications Manager provides deep performance insights into multiple
tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2016-04-13 17:38:45 UTC

Permalink

The 280k were the staring of the sequence, while the 70k were from a
shuffled bit, right?

Post by Joel Nothman
I don't think we can deny this is strange, certainly for real-world,
IID data!
Yes but would you expect sampling 280K / 3M to be qualitatively
different from sampling 70K / 3M?
At any rate I'll attempt a more rigorous test later this week and
report back. Thanks!
Juan.
On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman
It's hard to believe this is a software problem rather than a
data problem. If your data was accidentally a duplicate of the
dataset, you could certainly get 100%.
On 13 April 2016 at 10:10, Juan Nunez-Iglesias
Hallelujah! I'd given up on this thread. Thanks for
resurrecting it, Andy! =)
However, I don't think data distribution can explain the
result, since GridSearchCV gives the expected result (~0.8
accuracy) with 3K and 70K random samples but changes to
perfect classification for 280K samples. I don't have the
data on this computer so I can't test it right now, though.
Juan.
On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller
Have you tried to "score" the grid-search on the
non-training set?
The cross-validation is using stratified k-fold while
your confirmation used the beginning of the dataset vs
the rest.
Your data is probably not IID.

Post by Juan Nunez-Iglesias
Hi all,
TL;DR: when I run GridSearchCV with
RandomForestClassifier and "many" samples (280K), it
falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
I'm trying to optimise RF hyperparameters using
GridSearchCV for the first time. I have a lot of data
(~3M samples, 140 features), so I subsampled it to do
this. First I subsampled to 3000 samples, which
finished in 5min, so I ran 70K samples to see if
result would still hold. This resulted in completely
different parameter choices, so I ran 280K samples
overnight, to see whether at least the choices would
stabilise as n -> inf. Then when I printed the top 10
In [7]: bests = sorted(random_search.grid_scores_,
reverse=True, key=lambda x: x
[1])
In [8]: bests[:10]
{'n_estimators': 500, 'bootstrap': True, '
max_features': 'auto', 'max_depth': None,
'criterion': 'gini'},
{'n_estimators': 500, 'bootstrap': True, '
'gini'},
{'n_estimators': 200, 'bootstrap': True, '
max_features': 'auto', 'max_depth': None,
'criterion': 'entropy'},
{'n_estimators': 200, 'bootstrap': True, '
'entropy'},
{'n_estimators': 200, 'bootstrap': True, '
'entropy'},
{'n_estimators': 20, 'bootstrap': False, '
max_features': 'auto', 'max_depth': None,
'criterion': 'gini'},
{'n_estimators': 100, 'bootstrap': False,
'max_features': 'auto', 'max_depth': None,
'criterion': 'gini'},
{'n_estimators': 20, 'bootstrap': False, '
'gini'},
{'n_estimators': 100, 'bootstrap': False,
'gini'},
{'n_estimators': 500, 'bootstrap': False,
'gini'}]
Needless to say, perfect accuracy is suspicious, and
In [16]: rftop =
20, 'bootstr
None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this
dataset, and what was found by the search with 72K
samples (top model: [mean: 0.82640, std: 0.00341,
params: {'n_estimators': 500, 'bootstrap': False,
'gini'},)
Anyway, here's my code, any idea why more samples
would cause this overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000,
replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf,
param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Find and fix application performance issues faster
with Applications Manager
Applications Manager provides deep performance
insights into multiple tiers of
your business applications. It resolves application
problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with
Applications Manager
Applications Manager provides deep performance insights
into multiple tiers of
your business applications. It resolves application
problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with
Applications Manager
Applications Manager provides deep performance insights into
multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with
Applications Manager
Applications Manager provides deep performance insights into
multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Juan Nunez-Iglesias

2016-04-14 22:16:18 UTC

Permalink

No, both the 280K and the 70K were random indices. See the code at the end
of the original post. The 200K at the start were merely me doing a quick
check that the classifier *wasn't* perfectly accurate as claimed by the
grid search.

Post by Andreas Mueller
The 280k were the staring of the sequence, while the 70k were from a
shuffled bit, right?
I don't think we can deny this is strange, certainly for real-world, IID
data!

Post by Joel Nothman
It's hard to believe this is a software problem rather than a data
problem. If your data was accidentally a duplicate of the dataset, you
could certainly get 100%.

Post by Juan Nunez-Iglesias
Hallelujah! I'd given up on this thread. Thanks for resurrecting it,
Andy! =)
However, I don't think data distribution can explain the result, since
GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
random samples but changes to perfect classification for 280K samples. I
don't have the data on this computer so I can't test it right now, though.
Juan.

Post by Andreas Mueller
Have you tried to "score" the grid-search on the non-training set?
The cross-validation is using stratified k-fold while your
confirmation used the beginning of the dataset vs the rest.
Your data is probably not IID.
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
I'm trying to optimise RF hyperparameters using GridSearchCV for the
first time. I have a lot of data (~3M samples, 140 features), so I
subsampled it to do this. First I subsampled to 3000 samples, which
finished in 5min, so I ran 70K samples to see if result would still hold.
This resulted in completely different parameter choices, so I ran 280K
samples overnight, to see whether at least the choices would stabilise as n
In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
key=lambda x: x
[1])
In [8]: bests[:10]
[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,
'bootstrap': False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,
'bootstrap': False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this
20, 'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and what
20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2016-04-14 22:31:44 UTC

Permalink

When you're shuffling you get 100% accuracy, and when you're not
shuffling you don't, right?

Post by Juan Nunez-Iglesias
No, both the 280K and the 70K were random indices. See the code at the
end of the original post. The 200K at the start were merely me doing a
quick check that the classifier *wasn't* perfectly accurate as claimed
by the grid search.
The 280k were the staring of the sequence, while the 70k were from
a shuffled bit, right?

Post by Joel Nothman
I don't think we can deny this is strange, certainly for
real-world, IID data!
On 13 April 2016 at 10:31, Juan Nunez-Iglesias
Yes but would you expect sampling 280K / 3M to be
qualitatively different from sampling 70K / 3M?
At any rate I'll attempt a more rigorous test later this week
and report back. Thanks!
Juan.
On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman
It's hard to believe this is a software problem rather
than a data problem. If your data was accidentally a
duplicate of the dataset, you could certainly get 100%.
On 13 April 2016 at 10:10, Juan Nunez-Iglesias
Hallelujah! I'd given up on this thread. Thanks for
resurrecting it, Andy! =)
However, I don't think data distribution can explain
the result, since GridSearchCV gives the expected
result (~0.8 accuracy) with 3K and 70K random samples
but changes to perfect classification for 280K
samples. I don't have the data on this computer so I
can't test it right now, though.
Juan.
On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller
Have you tried to "score" the grid-search on the
non-training set?
The cross-validation is using stratified k-fold
while your confirmation used the beginning of the
dataset vs the rest.
Your data is probably not IID.

Post by Juan Nunez-Iglesias
Hi all,
TL;DR: when I run GridSearchCV with
RandomForestClassifier and "many" samples
(280K), it falsely shows accuracy of 1.0 for
full trees (max_depth=None). This doesn't happen
for fewer samples.
I'm trying to optimise RF hyperparameters using
GridSearchCV for the first time. I have a lot of
data (~3M samples, 140 features), so I
subsampled it to do this. First I subsampled to
3000 samples, which finished in 5min, so I ran
70K samples to see if result would still hold.
This resulted in completely different parameter
choices, so I ran 280K samples overnight, to see
whether at least the choices would stabilise as
n -> inf. Then when I printed the top 10 models,
In [7]: bests =
sorted(random_search.grid_scores_, reverse=True,
key=lambda x: x
[1])
In [8]: bests[:10]
{'n_estimators': 500, 'bootstrap': True, '
max_features': 'auto', 'max_depth': None,
'criterion': 'gini'},
{'n_estimators': 500, 'bootstrap': True, '
max_features': 5, 'max_depth': None,
'criterion': 'gini'},
{'n_estimators': 200, 'bootstrap': True, '
max_features': 'auto', 'max_depth': None,
'criterion': 'entropy'},
{'n_estimators': 200, 'bootstrap': True, '
max_features': 5, 'max_depth': None,
'criterion': 'entropy'},
{'n_estimators': 200, 'bootstrap': True, '
max_features': 20, 'max_depth': None,
'criterion': 'entropy'},
{'n_estimators': 20, 'bootstrap': False, '
max_features': 'auto', 'max_depth': None,
'criterion': 'gini'},
{'n_estimators': 100, 'bootstrap': False,
'max_features': 'auto', 'max_depth': None,
'criterion': 'gini'},
{'n_estimators': 20, 'bootstrap': False, '
max_features': 5, 'max_depth': None,
'criterion': 'gini'},
{'n_estimators': 100, 'bootstrap': False,
'max_features': 5, 'max_depth': None,
'criterion': 'gini'},
{'n_estimators': 500, 'bootstrap': False,
'max_features': 5, 'max_depth': None,
'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious,
In [16]: rftop =
20, 'bootstr
None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) ==
y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for
this dataset, and what was found by the search
with 72K samples (top model: [mean: 0.82640,
std: 0.00341, params: {'n_estimators': 500,
'bootstrap': False, 'max_features': 20,
'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more
samples would cause this overfitting / testing
on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000,
replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf,
param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Find and fix application performance issues
faster with Applications Manager
Applications Manager provides deep performance
insights into multiple tiers of
your business applications. It resolves
application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster
with Applications Manager
Applications Manager provides deep performance
insights into multiple tiers of
your business applications. It resolves application
problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with
Applications Manager
Applications Manager provides deep performance insights
into multiple tiers of
your business applications. It resolves application
problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with
Applications Manager
Applications Manager provides deep performance insights into
multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Chris Holdgraf

2016-04-14 23:06:47 UTC

Permalink

I've had similar results when doing regression with timeseries data and
taking random samples of indices.

If you're taking a small number of samples relative to the full dataset
size, then the chances of having lots of samples that are correlated with
one another would be relatively small (in the extreme, think of taking only
2 samples from a dataset of 10 seconds sampled at 1000 Hz, there's a small
chance that those two points will be near one another in time)

If your random sample is quite large, then now you are more likely to take
a sample with lots of points that are correlated with one another (e.g., if
you're sampling 90% of your 10 second dataset, then you're guaranteed to
have samples that are correlated with one another).

The problem there is that if you then split your samples into training /
test sets, then you'll be training on datapoints that are correlated with
the datapoints in your test set, and your score will be artificially high.

So maybe something like that is going on? I'd think about what correlations
might exist in the dataset, and then make sure that you're doing your
training/test splits so that you take this into account. (e.g., if it's
data over time, always keep datapoints that are near one another in time
together in a training or test set).

Chris

When you're shuffling you get 100% accuracy, and when you're not shuffling
you don't, right?
No, both the 280K and the 70K were random indices. See the code at the end
of the original post. The 200K at the start were merely me doing a quick
check that the classifier *wasn't* perfectly accurate as claimed by the
grid search.

Post by Andreas Mueller
The 280k were the staring of the sequence, while the 70k were from a
shuffled bit, right?
I don't think we can deny this is strange, certainly for real-world, IID
data!

Post by Juan Nunez-Iglesias
Yes but would you expect sampling 280K / 3M to be qualitatively
different from sampling 70K / 3M?
At any rate I'll attempt a more rigorous test later this week and report
back. Thanks!
Juan.
On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman <

Post by Joel Nothman
It's hard to believe this is a software problem rather than a data
problem. If your data was accidentally a duplicate of the dataset, you
could certainly get 100%.

Post by Juan Nunez-Iglesias
Hallelujah! I'd given up on this thread. Thanks for resurrecting it,
Andy! =)
However, I don't think data distribution can explain the result, since
GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
random samples but changes to perfect classification for 280K samples. I
don't have the data on this computer so I can't test it right now, though.
Juan.

Post by Andreas Mueller
Have you tried to "score" the grid-search on the non-training set?
The cross-validation is using stratified k-fold while your
confirmation used the beginning of the dataset vs the rest.
Your data is probably not IID.
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
I'm trying to optimise RF hyperparameters using GridSearchCV for the
first time. I have a lot of data (~3M samples, 140 features), so I
subsampled it to do this. First I subsampled to 3000 samples, which
finished in 5min, so I ran 70K samples to see if result would still hold.
This resulted in completely different parameter choices, so I ran 280K
samples overnight, to see whether at least the choices would stabilise as n
In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
key=lambda x: x
[1])
In [8]: bests[:10]
[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,
'bootstrap': False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,
'bootstrap': False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this
20, 'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and what
20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
<https://ad.doubleclick.net/ddm/clk/302982198;130105516;z>
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
<https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
_____________________________________

PhD Candidate in Neuroscience | UC Berkeley <http://hwni.org/>
Editor and Web Director | Berkeley Science Review
<http://sciencereview.berkeley.edu/>
_____________________________________