Discussion:
sample_weight and features in a single tree
(too old to reply)
Aaron Jacques
2013-08-28 10:41:02 UTC
Permalink
In SO[1] a thread states that weight class for random forest can be achieved by sample_weight function when executing fit() function. If I have a dataset with format (2 dimension) 

 
          categorical_1 numeric categorical_2   ... 
row 1  string_a         182       string_x           ...
row 2  string_b         12         string_y           ...
row 3  string_a         3342     string_z           ... 
...

How can I pass in sample_weight as classes weigh for such cases?  Passing in sample_weight as multiple dimension leads to following error
  preprocessing.balance_weights([[1,2,3,4,5][1,2,3,4,4]])

  TypeError: list indices must be integers, not tuple


Or should I passed in a format like [string_a, string_b, string_a, 182, 12, 3342, string_x ...] with all classes as flat list where string_a is the factor of all classes? Or what is the right way to do that? Or can I just pass in weight for a single tree?

In addition, how can I know what features are used for each tree (RandomForestClassifier.estimators_)? Or RandomForestClassifier uses all features for each tree?   For example, a DataFrame with features f=[Age, Job, Title, ...], when calling fit(), each tree will use all features in f? Or any way we can know which features are used for a single tree?

Thanks

[1]. http://stackoverflow.com/questions/17688147/how-to-weight-classes-in-a-randomforest-implementation 
Gilles Louppe
2013-08-28 11:10:25 UTC
Permalink
Hi Aaron,

Assume that X is your data and y is the labels for X. If classes in y
are not balanced and you want to fix that, you can indeed use sample
weights to simulate class weights. Basically you can simply do:

forest.fit(X, y, sample_weight=balance_weights(y))
In addition, how can I know what features are used for each tree (RandomForestClassifier.estimators_)? Or RandomForestClassifier uses all features for each tree? For example, a DataFrame with features f=[Age, Job, Title, ...], when calling fit(), each tree will use all features in f? Or any way we can know which features are used for a single tree?
Both random forests and single decision trees are built on *all* the
features that provide in X.

If you want to know which ones were the most helpful/important to
build the forest, then you can check the `feature_importances_`
attribute which will give you a score for each feature (the higher,
the more important).

Hope this helps,

Gilles
In SO[1] a thread states that weight class for random forest can be achieved by sample_weight function when executing fit() function. If I have a dataset with format (2 dimension)
categorical_1 numeric categorical_2 ...
row 1 string_a 182 string_x ...
row 2 string_b 12 string_y ...
row 3 string_a 3342 string_z ...
...
How can I pass in sample_weight as classes weigh for such cases? Passing in sample_weight as multiple dimension leads to following error
preprocessing.balance_weights([[1,2,3,4,5][1,2,3,4,4]])
TypeError: list indices must be integers, not tuple
Or should I passed in a format like [string_a, string_b, string_a, 182, 12, 3342, string_x ...] with all classes as flat list where string_a is the factor of all classes? Or what is the right way to do that? Or can I just pass in weight for a single tree?
In addition, how can I know what features are used for each tree (RandomForestClassifier.estimators_)? Or RandomForestClassifier uses all features for each tree? For example, a DataFrame with features f=[Age, Job, Title, ...], when calling fit(), each tree will use all features in f? Or any way we can know which features are used for a single tree?
Thanks
[1]. http://stackoverflow.com/questions/17688147/how-to-weight-classes-in-a-randomforest-implementation
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sergey Feldman
2013-08-28 14:40:56 UTC
Permalink
Huh, cool I didn't know about balance_weights before. I'm also having a
hard time finding documentation on it besides:

"DEPRECATED: balance_weights is an internal function and will be removed in
0.16"

What will it be replaced by in 0.16?

Thanks,
sf
Post by Gilles Louppe
Hi Aaron,
Assume that X is your data and y is the labels for X. If classes in y
are not balanced and you want to fix that, you can indeed use sample
forest.fit(X, y, sample_weight=balance_weights(y))
Post by Aaron Jacques
In addition, how can I know what features are used for each tree
(RandomForestClassifier.estimators_)? Or RandomForestClassifier uses all
features for each tree? For example, a DataFrame with features f=[Age,
Job, Title, ...], when calling fit(), each tree will use all features in f?
Or any way we can know which features are used for a single tree?
Both random forests and single decision trees are built on *all* the
features that provide in X.
If you want to know which ones were the most helpful/important to
build the forest, then you can check the `feature_importances_`
attribute which will give you a score for each feature (the higher,
the more important).
Hope this helps,
Gilles
Post by Aaron Jacques
In SO[1] a thread states that weight class for random forest can be
achieved by sample_weight function when executing fit() function. If I have
a dataset with format (2 dimension)
Post by Aaron Jacques
categorical_1 numeric categorical_2 ...
row 1 string_a 182 string_x ...
row 2 string_b 12 string_y ...
row 3 string_a 3342 string_z ...
...
How can I pass in sample_weight as classes weigh for such cases?
Passing in sample_weight as multiple dimension leads to following error
Post by Aaron Jacques
preprocessing.balance_weights([[1,2,3,4,5][1,2,3,4,4]])
TypeError: list indices must be integers, not tuple
Or should I passed in a format like [string_a, string_b, string_a, 182,
12, 3342, string_x ...] with all classes as flat list where string_a is the
factor of all classes? Or what is the right way to do that? Or can I just
pass in weight for a single tree?
Post by Aaron Jacques
In addition, how can I know what features are used for each tree
(RandomForestClassifier.estimators_)? Or RandomForestClassifier uses all
features for each tree? For example, a DataFrame with features f=[Age,
Job, Title, ...], when calling fit(), each tree will use all features in f?
Or any way we can know which features are used for a single tree?
Post by Aaron Jacques
Thanks
[1].
http://stackoverflow.com/questions/17688147/how-to-weight-classes-in-a-randomforest-implementation
------------------------------------------------------------------------------
Post by Aaron Jacques
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft
technologies
Post by Aaron Jacques
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
Post by Aaron Jacques
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Aaron Jacques
2013-08-29 08:02:08 UTC
Permalink
Some more questions.

Is it possible to know which features are selected for building a tree?
Aaron Jacques
2013-08-29 12:07:54 UTC
Permalink
I come across https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py Does this Tree object holding randomly sampled features for building a single tree? Reading its description seemingly attributes 'feature' and 'value' might be what I am looking for. But given a test the result looks like follow which looks not features in the train set header. Any place I can find which features are sampled for building a tree? 

feature: [        88          0        194         23         80          0
        119         32         90        151         64         64
         73         23        208         66         23        124 ...]

value: [[[ 1757.   952.]]
 [[ 1418.   874.]]

 [[  206.   184.]]

 ...
 [[    0.     1.]]]    

Thanks


----- Mail original -----
De : Aaron Jacques <***@yahoo.fr>
À : Gilles Louppe <***@gmail.com>; "scikit-learn-***@lists.sourceforge.net" <scikit-learn-***@lists.sourceforge.net>
Cc :
Envoyé le : Jeudi 29 août 2013 4h02
Objet : Re: [Scikit-learn-general] sample_weight and features in a single tree

Some more questions.

Is it possible to know which features are selected for building a tree?
Olivier Grisel
2013-08-29 14:02:07 UTC
Permalink
In general all the features are used by the DT algorithm. The
max_features parameter is just a way to control the amount of
randomization injected at each stage during the learning process of
the trees used by the ExtraTrees* or RandomForest* classes. But on
average all features end up selected at one point or another: hence
there is no such think as "selected features".

Ensembles of trees like ExtraTrees* or RandomForest* can however tell
you what where the most important features. See the examples mentioned
in this section of the documentation:

http://scikit-learn.org/dev/modules/ensemble.html#feature-importance-evaluation

Finally you can output a representation of individual tree as a graph, see:

http://scikit-learn.org/dev/modules/tree.html#classification
http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html

The source code of the function might be a good example to walk done
the tree: for instance to mine the frequencies of consecutive decision
rules in a forest:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/export.py
Aaron Jacques
2013-08-30 05:25:21 UTC
Permalink
I read berkeley's doc[1] states that for each tree it randomly sample features from all input features. So I am curious if those randomly sampled features are preserved.    

Thanks for the explain.


[1]. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm


----- Mail original -----
De : Olivier Grisel <***@ensta.org>
À : Aaron Jacques <***@yahoo.fr>; scikit-learn-general <scikit-learn-***@lists.sourceforge.net>
Cc :
Envoyé le : Jeudi 29 août 2013 10h02
Objet : Re: [Scikit-learn-general] sample_weight and features in a single tree

In general all the features are used by the DT algorithm. The
max_features parameter is just a way to control the amount of
randomization injected at each stage during the learning process of
the trees used by the ExtraTrees* or RandomForest* classes. But on
average all features end up selected at one point or another: hence
there is no such think as "selected features".

Ensembles of trees like ExtraTrees* or RandomForest* can however tell
you what where the most important features. See the examples mentioned
in this section of the documentation:

http://scikit-learn.org/dev/modules/ensemble.html#feature-importance-evaluation

Finally you can output a representation of individual tree as a graph, see:

http://scikit-learn.org/dev/modules/tree.html#classification
http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html

The source code of the function might be a good example to walk done
the tree: for instance to mine the frequencies of consecutive decision
rules in a forest:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/export.py
Continue reading on narkive:
Loading...