Discussion:
[Scikit-learn-general] Random Forest Custom Label
Mamun Rashid
2016-03-01 23:11:47 UTC
Permalink
Hi All,

This is my understanding of the Random Forest Algorithm :
Random Forest algorithm creates number of trees using randomly selected subset of samples and features. At each node of the tree it uses the Gini information gain
to find the best feature-threshold (various threshold is tested for each feature) pair to obtain the best separation between the positive and the negative class.

Question 1 :
I have a two class classification problem where the positive labels reside in clusters. A traditional cross validation approach is not aware of this issue and splits data
points from a cluster in to training and test set giving rise to strong classification performance. I wrote a custom cross validation loop to address this issue. However
the bootstrapping method inside the Random Forest algorithm randomly selects samples and features and controls for overfitting.

When it applies the fit method on randomly selected samples, does it do an internal cross validation to prevent overfitting ? I did not find this in the github code.
If yes, Can I specify my groupings to Random Forest ?

Question 2 :
Gini impurity at each node tries to find the best separation between two classes. I care more about obtaining a cleaner separation for my positive class. Is there
any way to give importance to one class during the partitioning.

Thanks in advance.

Mamun
Jacob Schreiber
2016-03-02 00:39:40 UTC
Permalink
Question 1: It does not do an internal cross-validation to prevent
overfitting.
Question 2: Yes, you can put a higher weight on your positive class. Look
at the class_weights parameter in the documentation here:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Post by Mamun Rashid
Hi All,
Random Forest algorithm creates number of trees using randomly selected
subset of samples and features. At each node of the tree it uses the Gini
information gain
to find the best feature-threshold (various threshold is tested for each
feature) pair to obtain the best separation between the positive and the
negative class.
I have a two class classification problem where the positive labels reside
in clusters. A traditional cross validation approach is not aware of this
issue and splits data
points from a cluster in to training and test set giving rise to strong
classification performance. I wrote a custom cross validation loop to
address this issue. However
the bootstrapping method inside the Random Forest algorithm
randomly selects samples and features and controls for overfitting.
When it applies the fit method on randomly selected samples, does it do
an internal cross validation to prevent overfitting ? I did not find this
in the github code.
If yes, Can I specify my groupings to Random Forest ?
Gini impurity at each node tries to find the best separation between two
classes. I care more about obtaining a cleaner separation for my positive
class. Is there
any way to give importance to one class during the partitioning.
Thanks in advance.
Mamun
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...