Juan Nunez-Iglesias
2016-03-10 06:08:17 UTC
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
Longer version:
I'm trying to optimise RF hyperparameters using GridSearchCV for the first
time. I have a lot of data (~3M samples, 140 features), so I subsampled it
to do this. First I subsampled to 3000 samples, which finished in 5min, so
I ran 70K samples to see if result would still hold. This resulted in
completely different parameter choices, so I ran 280K samples overnight, to
see whether at least the choices would stabilise as n -> inf. Then when I
printed the top 10 models, I got the following:
In [7]: bests = sorted(random_search.grid_scores_, reverse=True, key=lambda
x: x
[1])
In [8]: bests[:10]
Out[8]:
[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500, 'bootstrap':
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500, 'bootstrap':
True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200, 'bootstrap':
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200, 'bootstrap':
True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200, 'bootstrap':
True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20, 'bootstrap':
False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100, 'bootstrap':
False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20, 'bootstrap':
False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100, 'bootstrap':
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500, 'bootstrap':
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this case,
completely spurious:
In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and what was
found by the search with 72K samples (top model: [mean: 0.82640, std:
0.00341, params: {'n_estimators': 500, 'bootstrap': False, 'max_features':
20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
Longer version:
I'm trying to optimise RF hyperparameters using GridSearchCV for the first
time. I have a lot of data (~3M samples, 140 features), so I subsampled it
to do this. First I subsampled to 3000 samples, which finished in 5min, so
I ran 70K samples to see if result would still hold. This resulted in
completely different parameter choices, so I ran 280K samples overnight, to
see whether at least the choices would stabilise as n -> inf. Then when I
printed the top 10 models, I got the following:
In [7]: bests = sorted(random_search.grid_scores_, reverse=True, key=lambda
x: x
[1])
In [8]: bests[:10]
Out[8]:
[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500, 'bootstrap':
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500, 'bootstrap':
True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200, 'bootstrap':
True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200, 'bootstrap':
True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200, 'bootstrap':
True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20, 'bootstrap':
False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100, 'bootstrap':
False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20, 'bootstrap':
False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100, 'bootstrap':
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500, 'bootstrap':
False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this case,
completely spurious:
In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
'bootstr
ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])
In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and what was
found by the search with 72K samples (top model: [mean: 0.82640, std:
0.00341, params: {'n_estimators': 500, 'bootstrap': False, 'max_features':
20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.