[Scikit-learn-general] Random forest probability calculation

Eve V. E. Kovacs

2016-05-20 22:51:34 UTC

Dear Scikit-learn gurus,
Sorry to spam the whole list but I couldn't find a better email for my
question regarding the results of the predict_proba method in the Random
Forest classifier.

I tried to reproduce the output of this method by following the description
given in the documentation: That is, I averaged over the class probabilities
for each tree in the forest. I computed the class probability
for each tree, for each object in my test data, by
first determining in which leaf of the tree my test datum landed. Then I set
the class probabilities equal to the fraction of objects in each class in the
training data that also landed in the same leaf.

For example, if my test datum landed in node 55 of tree #0,
and supposing that 10 objects from my training data also landed in node 55 of
tree #0, with 4 objects in the first cllass and 6 in the second, then the
probabilities for that tree would be [0.4, 0.6]. (And then I average these
probabilities over all the trees in the forest.)

Unfortunately, the answers that I get for the probabilities from the above
algorithm and the results of predict_proba don't agree.
For example, for 4 objects in my test data I get the following probabilites:
[ 0.99718369 0.00281631]
[ 0.99711619 0.00288381]
[ 0.99680974 0.00319026]
[ 0.55153962 0.44846038]

but predict_proba gives

[1.0 0.0]
[1.0 0.0]
[1.0 0.0]
[0.4 0.6]

Can anyone please tell me what I am doing wrong? I have checked the source code
and the averaging step seems to be correct. I must be misinterpreting how to
compute the class probabilities.

Thanks
Eve

***************************************************************
Eve Kovacs
Argonne National Laboratory,
Room L-177, Bldg. 360, HEP
9700 S. Cass Ave.
Argonne, IL 60439 USA
Phone: (630)-252-6208
Fax: (630)-252-5047
email: ***@anl.gov
***************************************************************