Piotr Bialecki
2016-02-27 17:30:53 UTC
Hi all,
I am currently working with the RandomForestClassifier performing a
RandomizedSearchCV on the training data set.
The data contains 106 features and approx. 10.000 observations.
The hyperparameter search returns the best parameters as:
{'bootstrap': True,
'class_weight': 'balanced',
'criterion': 'entropy',
'max_depth': 10,
'max_features': 'log2',
'min_samples_leaf': 4,
'min_samples_split': 3,
'n_estimators': 33}
My question is regarding feature_importances_.
When calling this on my RandomForestClassifier (clf) it returns:
clf.feature_importances_
Out[140]:
array([ 0.51036391, 0.03331918, 0.02011316, 0.11259915, 0.17919327,
0.05119163, 0.01932924, 0.03351345, 0.01557083, 0.02480619])
Calling feature_importances_ on the different trees returns:
clf.estimators_[1].feature_importances_
Out[137]:
array([ 0.42919509, 0.0524983 , 0.01913177, 0.13067667, 0.20454586,
0.03236881, 0.06266216, 0.02380507, 0.01972648, 0.02538979])
clf.estimators_[0].feature_importances_
Out[138]:
array([ 0.57415072, 0.02156333, 0.01333293, 0.08907816, 0.20841139,
0.02695001, 0.03061188, 0.02447627, 0.0064503 , 0.00497501])
Since every tree is using different features, the feature importances of
each tree should represent the relative importance of the used features
in the tree.
Even though each tree seem to use 10 features, although max_features is
set to log2, which should be log2(106) ~= 7.
However, what does clf.feature_importances_ return?
Is it a mean value of all feature importances? If so, does it makes
sense, since every tree is using a different feature set?
Please let me know, if you need more information.
Kind regards
Piotr Bialecki
I am currently working with the RandomForestClassifier performing a
RandomizedSearchCV on the training data set.
The data contains 106 features and approx. 10.000 observations.
The hyperparameter search returns the best parameters as:
{'bootstrap': True,
'class_weight': 'balanced',
'criterion': 'entropy',
'max_depth': 10,
'max_features': 'log2',
'min_samples_leaf': 4,
'min_samples_split': 3,
'n_estimators': 33}
My question is regarding feature_importances_.
When calling this on my RandomForestClassifier (clf) it returns:
clf.feature_importances_
Out[140]:
array([ 0.51036391, 0.03331918, 0.02011316, 0.11259915, 0.17919327,
0.05119163, 0.01932924, 0.03351345, 0.01557083, 0.02480619])
Calling feature_importances_ on the different trees returns:
clf.estimators_[1].feature_importances_
Out[137]:
array([ 0.42919509, 0.0524983 , 0.01913177, 0.13067667, 0.20454586,
0.03236881, 0.06266216, 0.02380507, 0.01972648, 0.02538979])
clf.estimators_[0].feature_importances_
Out[138]:
array([ 0.57415072, 0.02156333, 0.01333293, 0.08907816, 0.20841139,
0.02695001, 0.03061188, 0.02447627, 0.0064503 , 0.00497501])
Since every tree is using different features, the feature importances of
each tree should represent the relative importance of the used features
in the tree.
Even though each tree seem to use 10 features, although max_features is
set to log2, which should be log2(106) ~= 7.
However, what does clf.feature_importances_ return?
Is it a mean value of all feature importances? If so, does it makes
sense, since every tree is using a different feature set?
Please let me know, if you need more information.
Kind regards
Piotr Bialecki