Mamun Rashid
2016-03-01 23:11:47 UTC
Hi All,
This is my understanding of the Random Forest Algorithm :
Random Forest algorithm creates number of trees using randomly selected subset of samples and features. At each node of the tree it uses the Gini information gain
to find the best feature-threshold (various threshold is tested for each feature) pair to obtain the best separation between the positive and the negative class.
Question 1 :
I have a two class classification problem where the positive labels reside in clusters. A traditional cross validation approach is not aware of this issue and splits data
points from a cluster in to training and test set giving rise to strong classification performance. I wrote a custom cross validation loop to address this issue. However
the bootstrapping method inside the Random Forest algorithm randomly selects samples and features and controls for overfitting.
When it applies the fit method on randomly selected samples, does it do an internal cross validation to prevent overfitting ? I did not find this in the github code.
If yes, Can I specify my groupings to Random Forest ?
Question 2 :
Gini impurity at each node tries to find the best separation between two classes. I care more about obtaining a cleaner separation for my positive class. Is there
any way to give importance to one class during the partitioning.
Thanks in advance.
Mamun
This is my understanding of the Random Forest Algorithm :
Random Forest algorithm creates number of trees using randomly selected subset of samples and features. At each node of the tree it uses the Gini information gain
to find the best feature-threshold (various threshold is tested for each feature) pair to obtain the best separation between the positive and the negative class.
Question 1 :
I have a two class classification problem where the positive labels reside in clusters. A traditional cross validation approach is not aware of this issue and splits data
points from a cluster in to training and test set giving rise to strong classification performance. I wrote a custom cross validation loop to address this issue. However
the bootstrapping method inside the Random Forest algorithm randomly selects samples and features and controls for overfitting.
When it applies the fit method on randomly selected samples, does it do an internal cross validation to prevent overfitting ? I did not find this in the github code.
If yes, Can I specify my groupings to Random Forest ?
Question 2 :
Gini impurity at each node tries to find the best separation between two classes. I care more about obtaining a cleaner separation for my positive class. Is there
any way to give importance to one class during the partitioning.
Thanks in advance.
Mamun