[Scikit-learn-general] Class Weight Random Forest Classifier

Discussion:

Mamun Rashid

2016-03-15 11:44:20 UTC

Hi All,
I have asked this question couple of weeks ago on the list. I have a two class problem where my positive class ( Class 1 ) and negative class ( Class 0 )
is imbalanced. Secondly I care much less about the negative class. So, I specified both class weight (to a random forest classifier) and sample wright to
the fit function to give more importance to my positive class.

cl_weight = {0:weight1, 1:weight2}
clf = RandomForestClassifier(n_estimators=400, max_depth=None, min_samples_split=2, random_state=0, oob_score=True, class_weight = cl_weight, criterion=âgini")
sample_weight = np.array([weight if m == 1 else 1 for m in df_tr[label_column]])
y_pred = clf.fit(X_tr, y_tr,sample_weight= sample_weight).predict(X_te)

Despite specifying dramatically different class weight I do not observe much difference.
Example :: cl_weight = {0:0.001, 1:0.999} and cl_weight = {0:0.50, 1:0.50}.
Am I passing the class weight correctly ?

I am giving example of two folds from these two runs :: Fold 1 and Fold 2.

## cl_weight = {0:0.001, 1:0.999}

Fold_1 Confusion Matrix
0 1
0 1681 26
1 636 149

Fold_5 Confusion Matrix
0 1
0 1670 15
1 734 160

## cl_weight = {0:0.50, 1:0.50}

Fold_1 Confusion Matrix
0 1
0 1690 15
1 630 163

Fold_5 Confusion Matrix
0 1
0 1676 14
1 709 170

Thanks,
Mamun

Raghav R V

2016-03-15 14:00:30 UTC

Permalink

Hi Mamun,

Scikit-learn's RandomForestClassifier has an option to set `class_weight`
to "balanced". Have you tried that alone without specifying
`sample_weights`?

See this documentation -
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

Is there a chance that what you try to achieve by `class_weights` is being
undone by your `sample_weights`?

Thanks.
R

Post by Mamun Rashid
Hi All,
I have asked this question couple of weeks ago on the list. I have a two
class problem where my positive class ( Class 1 ) and negative class (
Class 0 )
is imbalanced. Secondly I care much less about the negative class. So, I
specified both class weight (to a random forest classifier) and sample
wright to
the fit function to give more importance to my positive class.
cl_weight = {0:weight1, 1:weight2}
clf = RandomForestClassifier(n_estimators=400, max_depth=None, min_samples_split=2, random_state=0, oob_score=True, class_weight = cl_weight, criterion=*âg**ini*")
sample_weight = np.array([weight if m == 1 else 1 for m in df_tr[label_column]])
y_pred = clf.fit(X_tr, y_tr,sample_weight= sample_weight).predict(X_te)
Despite specifying dramatically different class weight I do not observe much difference.
Example :: cl_weight = {0:0.001, 1:0.999} and cl_weight = {0:0.50, 1:0.50}.
Am I passing the class weight correctly ?
I am giving example of two folds from these two runs :: Fold 1 and Fold 2.
## cl_weight = {0:0.001, 1:0.999}
Fold_1 Confusion Matrix 0 1 0 1681 26 1 636 149 Fold_5 Confusion Matrix 0
1 0 1670 15 1 734 160 ## cl_weight = {0:0.50, 1:0.50}
Fold_1 Confusion Matrix 0 1 0 1690 15 1 630 163 Fold_5 Confusion Matrix 0
1 0 1676 14 1 709 170
Thanks,
Mamun
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Mamun Rashid

2016-03-18 14:30:56 UTC

Permalink

Hi Raghav,
Thanks for your reply. My Class 1 is smaller in size than class 0. So, even if the âsample_weights' introduce any bias ,
it should favour the class 1.

Regardless of the size bias of class 1 and class 0, I want to give more importance to class 1 while splitting at tree
node using Gini impurity. The âclass_weightâ parameter should have take care of that. However, I do not see any
effect.

I will try specifying the âclass_weightâ without specifying sample weight to see if there is any change in the situation.

Thanks,
Mamun

Post by Raghav R V
Hi Mamun,
Scikit-learn's RandomForestClassifier has an option to set `class_weight`
to "balanced". Have you tried that alone without specifying
`sample_weights`?
See this documentation -
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier <http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier>
Is there a chance that what you try to achieve by `class_weights` is being
undone by your `sample_weights`?
Thanks.
R

Post by Mamun Rashid
Hi All,
I have asked this question couple of weeks ago on the list. I have a two
class problem where my positive class ( Class 1 ) and negative class (
Class 0 )
is imbalanced. Secondly I care much less about the negative class. So, I
specified both class weight (to a random forest classifier) and sample
wright to
the fit function to give more importance to my positive class.
cl_weight = {0:weight1, 1:weight2}
clf = RandomForestClassifier(n_estimators=400, max_depth=None, min_samples_split=2, random_state=0, oob_score=True, class_weight = cl_weight, criterion=*?g**ini*")
sample_weight = np.array([weight if m == 1 else 1 for m in df_tr[label_column]])
y_pred = clf.fit(X_tr, y_tr,sample_weight= sample_weight).predict(X_te)
Despite specifying dramatically different class weight I do not observe much difference.
Example :: cl_weight = {0:0.001, 1:0.999} and cl_weight = {0:0.50, 1:0.50}.
Am I passing the class weight correctly ?
I am giving example of two folds from these two runs :: Fold 1 and Fold 2.
## cl_weight = {0:0.001, 1:0.999}
Fold_1 Confusion Matrix 0 1 0 1681 26 1 636 149 Fold_5 Confusion Matrix 0
1 0 1670 15 1 734 160 ## cl_weight = {0:0.50, 1:0.50}
Fold_1 Confusion Matrix 0 1 0 1690 15 1 630 163 Fold_5 Confusion Matrix 0
1 0 1676 14 1 709 170
Thanks,
Mamun

Andreas Mueller

2016-04-12 22:47:59 UTC

Permalink

Another possibility is to threshold the predict_proba differently, such
that the decision maximizes whatever metric you have defined.

Post by Mamun Rashid
Hi All,
I have asked this question couple of weeks ago on the list. I have a
two class problem where my positive class ( Class 1 ) and negative
class ( Class 0 )
is imbalanced. Secondly I care much less about the negative class. So,
I specified both class weight (to a random forest classifier) and
sample wright to
the fit function to give more importance to my positive class.
cl_weight = {0:weight1,1:weight2}
clf= RandomForestClassifier(n_estimators=400, max_depth=None,
min_samples_split=2, random_state=0, oob_score=True, class_weight =
cl_weight, criterion=*g**ini*")
sample_weight = np.array([weightif m ==1 else 1 for min df_tr[label_column]])
y_pred = clf.fit(X_tr, y_tr,sample_weight= sample_weight).predict(X_te)
Despite specifying dramatically different class weight I do not
observe much difference. Example :: cl_weight = {0:0.001, 1:0.999} and
cl_weight = {0:0.50, 1:0.50}. Am I passing the class weight correctly ?
I am giving example of two folds from these two runs :: Fold 1 and Fold 2.
## cl_weight = {0:0.001, 1:0.999}
Fold_1 Confusion Matrix 0 1 0 1681 26 1 636 149 Fold_5 Confusion
Matrix 0 1 0 1670 15 1 734 160 ## cl_weight = {0:0.50, 1:0.50}
Fold_1 Confusion Matrix 0 1 0 1690 15 1 630 163 Fold_5 Confusion
Matrix 0 1 0 1676 14 1 709 170
Thanks,
Mamun
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general