Discussion:
[Scikit-learn-general] RandomForestClassifier - Feature Importances
Piotr Bialecki
2016-02-27 17:30:53 UTC
Permalink
Hi all,

I am currently working with the RandomForestClassifier performing a
RandomizedSearchCV on the training data set.
The data contains 106 features and approx. 10.000 observations.

The hyperparameter search returns the best parameters as:
{'bootstrap': True,
'class_weight': 'balanced',
'criterion': 'entropy',
'max_depth': 10,
'max_features': 'log2',
'min_samples_leaf': 4,
'min_samples_split': 3,
'n_estimators': 33}

My question is regarding feature_importances_.
When calling this on my RandomForestClassifier (clf) it returns:

clf.feature_importances_
Out[140]:
array([ 0.51036391, 0.03331918, 0.02011316, 0.11259915, 0.17919327,
0.05119163, 0.01932924, 0.03351345, 0.01557083, 0.02480619])

Calling feature_importances_ on the different trees returns:

clf.estimators_[1].feature_importances_
Out[137]:
array([ 0.42919509, 0.0524983 , 0.01913177, 0.13067667, 0.20454586,
0.03236881, 0.06266216, 0.02380507, 0.01972648, 0.02538979])

clf.estimators_[0].feature_importances_
Out[138]:
array([ 0.57415072, 0.02156333, 0.01333293, 0.08907816, 0.20841139,
0.02695001, 0.03061188, 0.02447627, 0.0064503 , 0.00497501])

Since every tree is using different features, the feature importances of
each tree should represent the relative importance of the used features
in the tree.
Even though each tree seem to use 10 features, although max_features is
set to log2, which should be log2(106) ~= 7.

However, what does clf.feature_importances_ return?
Is it a mean value of all feature importances? If so, does it makes
sense, since every tree is using a different feature set?

Please let me know, if you need more information.


Kind regards
Piotr Bialecki
Nicolas Goix
2016-03-01 00:41:38 UTC
Permalink
Hi Piotr,

In RandomForestClassifier, max_features is not the number of features
selected from X to train each tree (as it is in bagging methods). It is the
number of features (randomly chosen at each split) to consider when looking
for the best split.

HTH

Nicolas
Post by Piotr Bialecki
Hi all,
I am currently working with the RandomForestClassifier performing a
RandomizedSearchCV on the training data set.
The data contains 106 features and approx. 10.000 observations.
{'bootstrap': True,
'class_weight': 'balanced',
'criterion': 'entropy',
'max_depth': 10,
'max_features': 'log2',
'min_samples_leaf': 4,
'min_samples_split': 3,
'n_estimators': 33}
My question is regarding feature_importances_.
clf.feature_importances_
array([ 0.51036391, 0.03331918, 0.02011316, 0.11259915, 0.17919327,
0.05119163, 0.01932924, 0.03351345, 0.01557083, 0.02480619])
clf.estimators_[1].feature_importances_
array([ 0.42919509, 0.0524983 , 0.01913177, 0.13067667, 0.20454586,
0.03236881, 0.06266216, 0.02380507, 0.01972648, 0.02538979])
clf.estimators_[0].feature_importances_
array([ 0.57415072, 0.02156333, 0.01333293, 0.08907816, 0.20841139,
0.02695001, 0.03061188, 0.02447627, 0.0064503 , 0.00497501])
Since every tree is using different features, the feature importances of
each tree should represent the relative importance of the used features
in the tree.
Even though each tree seem to use 10 features, although max_features is
set to log2, which should be log2(106) ~= 7.
However, what does clf.feature_importances_ return?
Is it a mean value of all feature importances? If so, does it makes
sense, since every tree is using a different feature set?
Please let me know, if you need more information.
Kind regards
Piotr Bialecki
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...