[Scikit-learn-general] Comparisons of classifiers

Post by Raphael C
https://github.com/szilard/benchm-ml
The upshot is that in some cases it seems that the scikit-learn
versions have room for improvement.

The various main lessons that I can see from those results are:

* Linear models (aka LogisticRegression) don't scale very well:

- The page benches the default, which is liblinear.
I would be very curious to see how the other solvers (Newton, and
SAG) fair on this dataset.
It would be useful to introduce a 'solver="auto"' for logistic
regression, based on heavy benchmarks and heuristics.
I have created an issue about this, to discuss if we want to do this:
https://github.com/scikit-learn/scikit-learn/issues/5736

- Having fused types to avoid increased memory would be useful.
For this we first need to finish adding cython as a build dependency:
https://github.com/scikit-learn/scikit-learn/pull/5492

- In tree-based Not handling categorical variables as such hurts us a lot
There's a PR to fix that, it still needs a bit of love:
https://github.com/scikit-learn/scikit-learn/pull/4899

Gaël

------------------------------------------------------------------------------

Andreas Mueller

2015-11-05 16:14:42 UTC

Comparing VW and liblinear seems pretty meaningless (and calling
liblinear Python is also odd).

It's clear that there are faster gbm packages (and it seems random
forests for some settings of
the parameters)
We recently had some improvements to the trees and it would be
interesting to benchmark again.
Still, this is only a single dataset, and not really a serious attempt
at a benchmark
(which is really hard).
It would be awesome to have real benchmarks on something like openml.org

Post by Gael Varoquaux

Post by Raphael C
https://github.com/szilard/benchm-ml
The upshot is that in some cases it seems that the scikit-learn
versions have room for improvement.

- The page benches the default, which is liblinear.
I would be very curious to see how the other solvers (Newton, and
SAG) fair on this dataset.
It would be useful to introduce a 'solver="auto"' for logistic
regression, based on heavy benchmarks and heuristics.
https://github.com/scikit-learn/scikit-learn/issues/5736
- Having fused types to avoid increased memory would be useful.
https://github.com/scikit-learn/scikit-learn/pull/5492
- In tree-based Not handling categorical variables as such hurts us a lot
https://github.com/scikit-learn/scikit-learn/pull/4899
Gaël
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Raphael C

2015-11-08 16:32:43 UTC

On 5 November 2015 at 13:38, Gael Varoquaux

Post by Gael Varoquaux

Post by Raphael C
https://github.com/szilard/benchm-ml
The upshot is that in some cases it seems that the scikit-learn
versions have room for improvement.

Thank you for this very helpful reply. One perhaps naive
question, why does not handling categorical variables hurt a lot?

In terms of computational efficiency, one-hot encoding combined with
the support for sparse feature vectors seems to work well, at least
for me. I assume therefore
the problem must be in terms of classification accuracy. Is that
right and if so, why?

Raphael

------------------------------------------------------------------------------

Sebastian Raschka

2015-11-08 17:50:59 UTC

Post by Raphael C
In terms of computational efficiency, one-hot encoding combined with
the support for sparse feature vectors seems to work well, at least
for me. I assume therefore
the problem must be in terms of classification accuracy.

One thing comes to mind regarding the different solvers for the linear models. E.g., Newton’s method is O(n * d^2), and even gradient descent is O(n *d)

For decision trees, I don’t see a substantial difference in terms of computational complexity if a categorical feature, let’s say it can take 4 values, is split into 4 binary questions (i.e., using one-hot encoding). One the other hand, I think the problem is that the decision algorithm does not no that these 4 binary questions “belong” to one observation, which could make the decision tree grow much larger in depth and width; this is bad for computational efficiency and would more likely produce trees with higher variance.

I’d be curious how to handle categorical feature columns implementation-wise though. I think additional parameters in the method call would be necessary (e.g., .fit(categorical=(1, 4, 19), nominal=(1, 4)) to distinguish ordinal from nominal variables?
Or, alternatively, I think this would be a good use-case for numpy’s structured arrays?

------------------------------------------------------------------------------

Gael Varoquaux

2015-11-08 18:37:39 UTC

Newton is never d**2 because every body uses a truncated Newton, which is in effect linear in d.

GaÃ«l

Sent from my phone. Please forgive brevity and mis spelling

One thing comes to mind regarding the different solvers for the linear
models. E.g., Newtonâs method is O(n * d^2), and even gradient descent
is O(n *d)
For decision trees, I donât see a substantial difference in terms of
computational complexity if a categorical feature, letâs say it can
take 4 values, is split into 4 binary questions (i.e., using one-hot
encoding). One the other hand, I think the problem is that the decision
algorithm does not no that these 4 binary questions âbelongâ to one
observation, which could make the decision tree grow much larger in
depth and width; this is bad for computational efficiency and would
more likely produce trees with higher variance.
Iâd be curious how to handle categorical feature columns
implementation-wise though. I think additional parameters in the method
call would be necessary (e.g., .fit(categorical=(1, 4, 19), nominal=(1,
4)) to distinguish ordinal from nominal variables?
Or, alternatively, I think this would be a good use-case for numpyâs
structured arrays?
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Raphael C

2015-11-08 19:13:22 UTC

I am unclear what difference it makes for decision trees myself. I am
no expert on the construction algorithms but I assume that they would
never split on a feature which depends 100% on a parent node as one
branch will just be empty. If that is right, it seems the decision
tree should not grow much larger. It might take more time I suppose
for the construction algorithm to work this out of course.

It would be great if anyone had a concrete example where it made a
difference for a decision tree (or any classifier which uses decision
trees).

Raphael

------------------------------------------------------------------------------

Sebastian Raschka

2015-11-08 20:42:02 UTC

Hm, I have to think about this more. But another case where I think that the handling of categorical features could be useful is in non-binary trees; not necessarily while learning but in making predictions more efficiently. E.g., assuming 3 classes that are perfectly separable by a "color" attribute:

color
/ | \
red green blue

vs.

red
/ \
green
/ \
blue
/\

Also, I think one other problem with one-hot encoding are random forests. Let's say you have a dataset consisting of 5 features, 4 numerical features and 1 categorical feature. Now, if your categorical variable has let's say 30 possible values. After one-hot encoding, you have 34 features now, and the majority of the decision trees will only get the different "flavors" of the categorical variable to see -- you will basically build a random forest that effectively only "considers" one of the variables in the training set if I am not missing anything here.

I am unclear what difference it makes for decision trees myself. I am
no expert on the construction algorithms but I assume that they would
never split on a feature which depends 100% on a parent node as one
branch will just be empty. If that is right, it seems the decision
tree should not grow much larger. It might take more time I suppose
for the construction algorithm to work this out of course.
It would be great if anyone had a concrete example where it made a
difference for a decision tree (or any classifier which uses decision
trees).
Raphael
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Raphael C

2015-11-09 07:30:53 UTC

Post by Sebastian Raschka
color
/ | \
red green blue
vs.
red
/ \
green
/ \
blue
/\
Also, I think one other problem with one-hot encoding are random forests. Let's say you have a dataset consisting of 5 features, 4 numerical features and 1 categorical feature. Now, if your categorical variable has let's say 30 possible values. After one-hot encoding, you have 34 features now, and the majority of the decision trees will only get the different "flavors" of the categorical variable to see -- you will basically build a random forest that effectively only "considers" one of the variables in the training set if I am not missing anything here.

Your second point is particularly strong. You are right that one-hot
encoding could massively overemphasise the importance of categorical
features with many categories under all sorts of regularisation
schemes (including the method used by random forests).

I look forward to https://github.com/scikit-learn/scikit-learn/pull/4899 now :)

Raphael

Raphael C

2016-03-22 11:52:42 UTC

Post by Gael Varoquaux
- In tree-based Not handling categorical variables as such hurts us a lot
https://github.com/scikit-learn/scikit-learn/pull/4899

Sebastian Raschka

2016-03-25 21:25:27 UTC

Post by Raphael C
wondering what changes are needed to make
RandomForestClassifier competitive with xgboost and H20 at

Post by Gael Varoquaux
- In tree-based Not handling categorical variables as such hurts us a lot
https://github.com/scikit-learn/scikit-learn/pull/4899

This is a conversation moved from
https://github.com/scikit-learn/scikit-learn/pull/4899 .
In the light of the comment above and comments in the PR, I was
wondering what changes are needed to make
RandomForestClassifier competitive with xgboost and H20 at
http://datascience.la/benchmarking-random-forest-implementations/ .
Raphael
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2016-03-25 21:30:28 UTC

PS: What I meant by "core algorithm" was something like changing random forests from bagging to boosting or so (like xgboost)

Post by Raphael C
wondering what changes are needed to make
RandomForestClassifier competitive with xgboost and H20 at

Post by Gael Varoquaux
- In tree-based Not handling categorical variables as such hurts us a lot
https://github.com/scikit-learn/scikit-learn/pull/4899

Raphael C

2016-03-26 20:52:47 UTC

Post by Raphael C
wondering what changes are needed to make
RandomForestClassifier competitive with xgboost and H20 at

I really meant just to ask the question, what is preventing the scikit
learn random forest implementation from a) scaling as well as xgboost and
h20 and b) getting as good AUC?

If the answer is that this is fundamentally the limit of bagging random
forests ( and that xgboost and h20 both implement boosting or something
else that scales and performs better) then that is already very interesting.
Raphael

Post by Gael Varoquaux
- In tree-based Not handling categorical variables as such hurts us a

lot

Post by Gael Varoquaux
https://github.com/scikit-learn/scikit-learn/pull/4899

------------------------------------------------------------------------------

Post by Raphael C
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2016-03-26 21:31:36 UTC

Oh, I see.

I think random forest is just a different approach … I would say that xgboost is kind of a hybrid algorithm borrowing ideas from random forests and boosting. Random forests, Adaboost, xgboost, etc. are just different algorithms (like logistic regression, SVMs, and multi-layer perceptrons are different). What I was trying to say is that I wouldn’t fundamentally change the random forest algorithm in scikit-learn using ideas from xgboost, since it wouldn’t be a random forest anymore, then. Please don’t get me wrong, I’d also like to see a more efficient (predictive and/or computational performance), but I think that it should be a separate estimator, not modification of the random forest itself.

Post by Raphael C
wondering what changes are needed to make
RandomForestClassifier competitive with xgboost and H20 at

Do you mean in terms of predictive performance (not computational efficiency)? Not sure what other's think, but I wouldn't change the core algorithm since otherwise it's not really a "Random forest" anymore as it is described in literature -- and that would be very confusing for users and researchers.
I really meant just to ask the question, what is preventing the scikit learn random forest implementation from a) scaling as well as xgboost and h20 and b) getting as good AUC?
If the answer is that this is fundamentally the limit of bagging random forests ( and that xgboost and h20 both implement boosting or something else that scales and performs better) then that is already very interesting.
Raphael

Post by Gael Varoquaux
- In tree-based Not handling categorical variables as such hurts us a lot
https://github.com/scikit-learn/scikit-learn/pull/4899

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2016-04-13 06:09:44 UTC