Oh, I see.
I think random forest is just a different approach … I would say that xgboost is kind of a hybrid algorithm borrowing ideas from random forests and boosting. Random forests, Adaboost, xgboost, etc. are just different algorithms (like logistic regression, SVMs, and multi-layer perceptrons are different). What I was trying to say is that I wouldn’t fundamentally change the random forest algorithm in scikit-learn using ideas from xgboost, since it wouldn’t be a random forest anymore, then. Please don’t get me wrong, I’d also like to see a more efficient (predictive and/or computational performance), but I think that it should be a separate estimator, not modification of the random forest itself.
Post by Sebastian RaschkaPost by Raphael Cwondering what changes are needed to make
RandomForestClassifier competitive with xgboost and H20 at
Do you mean in terms of predictive performance (not computational efficiency)? Not sure what other's think, but I wouldn't change the core algorithm since otherwise it's not really a "Random forest" anymore as it is described in literature -- and that would be very confusing for users and researchers.
I really meant just to ask the question, what is preventing the scikit learn random forest implementation from a) scaling as well as xgboost and h20 and b) getting as good AUC?
If the answer is that this is fundamentally the limit of bagging random forests ( and that xgboost and h20 both implement boosting or something else that scales and performs better) then that is already very interesting.
Raphael
Post by Raphael CPost by Gael Varoquaux- In tree-based Not handling categorical variables as such hurts us a lot
https://github.com/scikit-learn/scikit-learn/pull/4899
This is a conversation moved from
https://github.com/scikit-learn/scikit-learn/pull/4899 .
In the light of the comment above and comments in the PR, I was
wondering what changes are needed to make
RandomForestClassifier competitive with xgboost and H20 at
http://datascience.la/benchmarking-random-forest-implementations/ .
Raphael
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general