Mamun Rashid

2016-04-14 09:04:42 UTC

Hi Andreas,

Thanks for your reply. Also thanks in advance for your patience with a repeated question ?

By, thresholding predict_proba do you mean thresholding the posterior probability at different level than 0.5 ?

But reducing the threshold from 0.5 would simply increase false positives and increasing will give rise to false negative. Right ?

I am trying to obtained a biased split at each node of decision tree. I have two classes and at every node of the decision tree I

want to give more weight to the positive class. By default Gini gives same weight to both classes. Does the class weight parameter regulate that ?

Thanks,

Mamun

Thanks for your reply. Also thanks in advance for your patience with a repeated question ?

By, thresholding predict_proba do you mean thresholding the posterior probability at different level than 0.5 ?

But reducing the threshold from 0.5 would simply increase false positives and increasing will give rise to false negative. Right ?

I am trying to obtained a biased split at each node of decision tree. I have two classes and at every node of the decision tree I

want to give more weight to the positive class. By default Gini gives same weight to both classes. Does the class weight parameter regulate that ?

Thanks,

Mamun

Another possibility is to threshold the predict_proba differently, such

that the decision maximizes whatever metric you have defined.

An HTML attachment was scrubbed...

------------------------------

Message: 3

Date: Tue, 12 Apr 2016 18:51:52 -0400

Subject: Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

Content-Type: text/plain; charset="windows-1252"

Have you tried to "score" the grid-search on the non-training set?

The cross-validation is using stratified k-fold while your confirmation

used the beginning of the dataset vs the rest.

Your data is probably not IID.

An HTML attachment was scrubbed...

------------------------------

------------------------------------------------------------------------------

Find and fix application performance issues faster with Applications Manager

Applications Manager provides deep performance insights into multiple tiers of

your business applications. It resolves application problems quickly and

reduces your MTTR. Get your free trial!

https://ad.doubleclick.net/ddm/clk/302982198;130105516;z

------------------------------

_______________________________________________

Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

End of Scikit-learn-general Digest, Vol 75, Issue 14

****************************************************

that the decision maximizes whatever metric you have defined.

Hi All,

I have asked this question couple of weeks ago on the list. I have a

two class problem where my positive class ( Class 1 ) and negative

class ( Class 0 )

is imbalanced. Secondly I care much less about the negative class. So,

I specified both class weight (to a random forest classifier) and

sample wright to

the fit function to give more importance to my positive class.

cl_weight = {0:weight1,1:weight2}

clf= RandomForestClassifier(n_estimators=400, max_depth=None,

min_samples_split=2, random_state=0, oob_score=True, class_weight =

cl_weight, criterion=*?g**ini*")

sample_weight = np.array([weightif m ==1 else 1 for min df_tr[label_column]])

y_pred = clf.fit(X_tr, y_tr,sample_weight= sample_weight).predict(X_te)

Despite specifying dramatically different class weight I do not

observe much difference. Example :: cl_weight = {0:0.001, 1:0.999} and

cl_weight = {0:0.50, 1:0.50}. Am I passing the class weight correctly ?

I am giving example of two folds from these two runs :: Fold 1 and

Fold 2.

## cl_weight = {0:0.001, 1:0.999}

Fold_1 Confusion Matrix 0 1 0 1681 26 1 636 149 Fold_5 Confusion

Matrix 0 1 0 1670 15 1 734 160 ## cl_weight = {0:0.50, 1:0.50}

Fold_1 Confusion Matrix 0 1 0 1690 15 1 630 163 Fold_5 Confusion

Matrix 0 1 0 1676 14 1 709 170

Thanks,

Mamun

------------------------------------------------------------------------------

Transform Data into Opportunity.

Accelerate data analysis in your applications with

Intel Data Analytics Acceleration Library.

Click to learn more.

http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140

_______________________________________________

Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

-------------- next part --------------I have asked this question couple of weeks ago on the list. I have a

two class problem where my positive class ( Class 1 ) and negative

class ( Class 0 )

is imbalanced. Secondly I care much less about the negative class. So,

I specified both class weight (to a random forest classifier) and

sample wright to

the fit function to give more importance to my positive class.

cl_weight = {0:weight1,1:weight2}

clf= RandomForestClassifier(n_estimators=400, max_depth=None,

min_samples_split=2, random_state=0, oob_score=True, class_weight =

cl_weight, criterion=*?g**ini*")

sample_weight = np.array([weightif m ==1 else 1 for min df_tr[label_column]])

y_pred = clf.fit(X_tr, y_tr,sample_weight= sample_weight).predict(X_te)

Despite specifying dramatically different class weight I do not

observe much difference. Example :: cl_weight = {0:0.001, 1:0.999} and

cl_weight = {0:0.50, 1:0.50}. Am I passing the class weight correctly ?

I am giving example of two folds from these two runs :: Fold 1 and

Fold 2.

## cl_weight = {0:0.001, 1:0.999}

Fold_1 Confusion Matrix 0 1 0 1681 26 1 636 149 Fold_5 Confusion

Matrix 0 1 0 1670 15 1 734 160 ## cl_weight = {0:0.50, 1:0.50}

Fold_1 Confusion Matrix 0 1 0 1690 15 1 630 163 Fold_5 Confusion

Matrix 0 1 0 1676 14 1 709 170

Thanks,

Mamun

------------------------------------------------------------------------------

Transform Data into Opportunity.

Accelerate data analysis in your applications with

Intel Data Analytics Acceleration Library.

Click to learn more.

http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140

_______________________________________________

Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

An HTML attachment was scrubbed...

------------------------------

Message: 3

Date: Tue, 12 Apr 2016 18:51:52 -0400

Subject: Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

Content-Type: text/plain; charset="windows-1252"

Have you tried to "score" the grid-search on the non-training set?

The cross-validation is using stratified k-fold while your confirmation

used the beginning of the dataset vs the rest.

Your data is probably not IID.

Hi all,

TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"

samples (280K), it falsely shows accuracy of 1.0 for full trees

(max_depth=None). This doesn't happen for fewer samples.

I'm trying to optimise RF hyperparameters using GridSearchCV for the

first time. I have a lot of data (~3M samples, 140 features), so I

subsampled it to do this. First I subsampled to 3000 samples, which

finished in 5min, so I ran 70K samples to see if result would still

hold. This resulted in completely different parameter choices, so I

ran 280K samples overnight, to see whether at least the choices would

stabilise as n -> inf. Then when I printed the top 10 models, I got

In [7]: bests = sorted(random_search.grid_scores_, reverse=True,

key=lambda x: x

[1])

In [8]: bests[:10]

[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,

'bootstrap': True, '

max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,

'bootstrap': True, '

max_features': 5, 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,

'bootstrap': True, '

max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,

'bootstrap': True, '

max_features': 5, 'max_depth': None, 'criterion': 'entropy'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,

'bootstrap': True, '

max_features': 20, 'max_depth': None, 'criterion': 'entropy'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,

'bootstrap': False, '

max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,

'bootstrap': False,

'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,

'bootstrap': False, '

max_features': 5, 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,

'bootstrap': False,

'max_features': 5, 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,

'bootstrap': False,

'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]

Needless to say, perfect accuracy is suspicious, and indeed, in this

20, 'bootstr

'gini'})

In [17]: rftop.fit(X[:200000], y[:200000])

In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])

Out[20]: 0.826125

That's more in line with what's expected for this dataset, and what

was found by the search with 72K samples (top model: [mean: 0.82640,

std: 0.00341, params: {'n_estimators': 500, 'bootstrap': False,

'max_features': 20, 'max_depth': 20, 'criterion': 'gini'},)

Anyway, here's my code, any idea why more samples would cause this

overfitting / testing on training data?

# [omitted: boilerplate to load full data in X0, y0]

import numpy as np

idx = np.random.choice(len(y0), size=280000, replace=False)

X, y = X0[idx], y0[idx]

param_dist = {'n_estimators': [20, 100, 200, 500],

'max_depth': [3, 5, 20, None],

'max_features': ['auto', 5, 10, 20],

'bootstrap': [True, False],

'criterion': ['gini', 'entropy']}

from sklearn import grid_search as gs

from time import time

from sklearn import ensemble

rf = ensemble.RandomForestClassifier()

random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,

verbose=2, n_jobs=12)

start=time(); random_search.fit(X, y); stop=time()

Thank you!

Juan.

------------------------------------------------------------------------------

Transform Data into Opportunity.

Accelerate data analysis in your applications with

Intel Data Analytics Acceleration Library.

Click to learn more.

http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140

_______________________________________________

Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

-------------- next part --------------TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"

samples (280K), it falsely shows accuracy of 1.0 for full trees

(max_depth=None). This doesn't happen for fewer samples.

I'm trying to optimise RF hyperparameters using GridSearchCV for the

first time. I have a lot of data (~3M samples, 140 features), so I

subsampled it to do this. First I subsampled to 3000 samples, which

finished in 5min, so I ran 70K samples to see if result would still

hold. This resulted in completely different parameter choices, so I

ran 280K samples overnight, to see whether at least the choices would

stabilise as n -> inf. Then when I printed the top 10 models, I got

In [7]: bests = sorted(random_search.grid_scores_, reverse=True,

key=lambda x: x

[1])

In [8]: bests[:10]

[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,

'bootstrap': True, '

max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,

'bootstrap': True, '

max_features': 5, 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,

'bootstrap': True, '

max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,

'bootstrap': True, '

max_features': 5, 'max_depth': None, 'criterion': 'entropy'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,

'bootstrap': True, '

max_features': 20, 'max_depth': None, 'criterion': 'entropy'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,

'bootstrap': False, '

max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,

'bootstrap': False,

'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,

'bootstrap': False, '

max_features': 5, 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,

'bootstrap': False,

'max_features': 5, 'max_depth': None, 'criterion': 'gini'},

mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,

'bootstrap': False,

'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]

Needless to say, perfect accuracy is suspicious, and indeed, in this

20, 'bootstr

'gini'})

In [17]: rftop.fit(X[:200000], y[:200000])

In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])

Out[20]: 0.826125

That's more in line with what's expected for this dataset, and what

was found by the search with 72K samples (top model: [mean: 0.82640,

std: 0.00341, params: {'n_estimators': 500, 'bootstrap': False,

'max_features': 20, 'max_depth': 20, 'criterion': 'gini'},)

Anyway, here's my code, any idea why more samples would cause this

overfitting / testing on training data?

# [omitted: boilerplate to load full data in X0, y0]

import numpy as np

idx = np.random.choice(len(y0), size=280000, replace=False)

X, y = X0[idx], y0[idx]

param_dist = {'n_estimators': [20, 100, 200, 500],

'max_depth': [3, 5, 20, None],

'max_features': ['auto', 5, 10, 20],

'bootstrap': [True, False],

'criterion': ['gini', 'entropy']}

from sklearn import grid_search as gs

from time import time

from sklearn import ensemble

rf = ensemble.RandomForestClassifier()

random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,

verbose=2, n_jobs=12)

start=time(); random_search.fit(X, y); stop=time()

Thank you!

Juan.

------------------------------------------------------------------------------

Transform Data into Opportunity.

Accelerate data analysis in your applications with

Intel Data Analytics Acceleration Library.

Click to learn more.

http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140

_______________________________________________

Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

An HTML attachment was scrubbed...

------------------------------

------------------------------------------------------------------------------

Find and fix application performance issues faster with Applications Manager

Applications Manager provides deep performance insights into multiple tiers of

your business applications. It resolves application problems quickly and

reduces your MTTR. Get your free trial!

https://ad.doubleclick.net/ddm/clk/302982198;130105516;z

------------------------------

_______________________________________________

Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

End of Scikit-learn-general Digest, Vol 75, Issue 14

****************************************************