Discussion:
Class Weight Random Forest Classifier
Mamun Rashid
2016-04-14 09:04:42 UTC
Hi Andreas,

By, thresholding predict_proba do you mean thresholding the posterior probability at different level than 0.5 ?
But reducing the threshold from 0.5 would simply increase false positives and increasing will give rise to false negative. Right ?

I am trying to obtained a biased split at each node of decision tree. I have two classes and at every node of the decision tree I
want to give more weight to the positive class. By default Gini gives same weight to both classes. Does the class weight parameter regulate that ?

Thanks,
Mamun
Another possibility is to threshold the predict_proba differently, such
that the decision maximizes whatever metric you have defined.
Hi All,
I have asked this question couple of weeks ago on the list. I have a
two class problem where my positive class ( Class 1 ) and negative
class ( Class 0 )
is imbalanced. Secondly I care much less about the negative class. So,
I specified both class weight (to a random forest classifier) and
sample wright to
the fit function to give more importance to my positive class.
cl_weight = {0:weight1,1:weight2}
clf= RandomForestClassifier(n_estimators=400, max_depth=None,
min_samples_split=2, random_state=0, oob_score=True, class_weight =
cl_weight, criterion=*?g**ini*")
sample_weight = np.array([weightif m ==1 else 1 for min df_tr[label_column]])
y_pred = clf.fit(X_tr, y_tr,sample_weight= sample_weight).predict(X_te)
Despite specifying dramatically different class weight I do not
observe much difference. Example :: cl_weight = {0:0.001, 1:0.999} and
cl_weight = {0:0.50, 1:0.50}. Am I passing the class weight correctly ?
I am giving example of two folds from these two runs :: Fold 1 and
Fold 2.
## cl_weight = {0:0.001, 1:0.999}
Fold_1 Confusion Matrix 0 1 0 1681 26 1 636 149 Fold_5 Confusion
Matrix 0 1 0 1670 15 1 734 160 ## cl_weight = {0:0.50, 1:0.50}
Fold_1 Confusion Matrix 0 1 0 1690 15 1 630 163 Fold_5 Confusion
Matrix 0 1 0 1676 14 1 709 170
Thanks,
Mamun
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
-------------- next part --------------
An HTML attachment was scrubbed...
------------------------------
Message: 3
Date: Tue, 12 Apr 2016 18:51:52 -0400
Subject: Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?
Content-Type: text/plain; charset="windows-1252"
Have you tried to "score" the grid-search on the non-training set?
The cross-validation is using stratified k-fold while your confirmation
used the beginning of the dataset vs the rest.
Your data is probably not IID.
Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
samples (280K), it falsely shows accuracy of 1.0 for full trees
(max_depth=None). This doesn't happen for fewer samples.
I'm trying to optimise RF hyperparameters using GridSearchCV for the
first time. I have a lot of data (~3M samples, 140 features), so I
subsampled it to do this. First I subsampled to 3000 samples, which
finished in 5min, so I ran 70K samples to see if result would still
hold. This resulted in completely different parameter choices, so I
ran 280K samples overnight, to see whether at least the choices would
stabilise as n -> inf. Then when I printed the top 10 models, I got
In : bests = sorted(random_search.grid_scores_, reverse=True,
key=lambda x: x
)
In : bests[:10]
[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
'bootstrap': True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,
'bootstrap': False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,
'bootstrap': False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in this
20, 'bootstr
'gini'})
In : rftop.fit(X[:200000], y[:200000])
In : np.mean(rftop.predict(X[200000:]) == y[200000:])
Out: 0.826125
That's more in line with what's expected for this dataset, and what
was found by the search with 72K samples (top model: [mean: 0.82640,
std: 0.00341, params: {'n_estimators': 500, 'bootstrap': False,
'max_features': 20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause this
overfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
'max_depth': [3, 5, 20, None],
'max_features': ['auto', 5, 10, 20],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()
Thank you!
Juan.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
-------------- next part --------------
An HTML attachment was scrubbed...
------------------------------
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
End of Scikit-learn-general Digest, Vol 75, Issue 14
****************************************************
Andreas Mueller
2016-04-14 15:36:32 UTC