Discussion:
Default value n_estimators for GBRT
(too old to reply)
Olivier Grisel
2012-03-25 01:49:39 UTC
Permalink
Hi all,

I have been playing a bit with GradientBoostingClassifier and
AdaBoostClassifier and ExtraTrees and while extra trees and adaboost
are reasonably fast to fit with there default params (n_estimators=10)
on a non toy dataset such as the olivetti faces dataset, the
GradientBoostingClassifier was taking ages (I killed it).

The current default value is n_estimators=100 for
GradientBoostingClassifier. Maybe it should be aligned to
n_estimators=10 as in the other ensemble methods of the scikit?

Or was I doing something very stupid by naively running it with the
default params on a dataset with size n_samples=400, n_features=4096
and n_classes=40 without any kind of preprocessing?

Another way to rephrase that question: what is the typical sweet spot
for the dataset shape when doing classification Gradient Boosted
Trees? What are reasonable values for the number of estimators in
various application domains?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gilles Louppe
2012-03-25 10:16:25 UTC
Permalink
Hi Olivier,

The higher the number of estimators, the better. The more random the
trees (e.g., the lower max_features), the more important it usually is
to have a large forest to decrease the variance. To me, 10 is actually
a very low default value. In my daily research, I deal with hundreds
of trees. But yeah, it also takes longer.

By the way I am curious, what kind of dataset are you testing those
methods on? :)

Gilles
Post by Olivier Grisel
Hi all,
I have been playing a bit with GradientBoostingClassifier and
AdaBoostClassifier and ExtraTrees and while extra trees and adaboost
are reasonably fast to fit with there default params (n_estimators=10)
on a non toy dataset such as the olivetti faces dataset, the
GradientBoostingClassifier was taking ages (I killed it).
The current default value is n_estimators=100 for
GradientBoostingClassifier. Maybe it should be aligned to
n_estimators=10 as in the other ensemble methods of the scikit?
Or was I doing something very stupid by naively running it with the
default params on a dataset with size n_samples=400, n_features=4096
and n_classes=40 without any kind of preprocessing?
Another way to rephrase that question: what is the typical sweet spot
for the dataset shape when doing classification Gradient Boosted
Trees? What are reasonable values for the number of estimators in
various application domains?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-03-25 10:44:11 UTC
Permalink
Olivier,

In my experience GBRT usually requires more base learners than random
forests to get the same level of accuracy. I hardly use less than 100.
Regarding the poor performance of GBRT on the olivetti dataset:
multi-class GBRT fits ``k`` trees at each stage, thus, if you have
``n_estimators`` this means you have to grow ``k * n_estimators``
trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
used multi-class GBRT much (part of the reason is that GBM does not
support it) - I know that the learning to rank folks use multi-class
GBRT for ordinal scaled output values (e.g. "not-relevant",
"relevant", "highly relevant") but these involve usually less than 5
classes.

That said, the major drawback of GBRT is computational complexity: we
could speed up multi-class GBRT by training the ``k`` trees at each
stage in parallel but still it is much less efficient than random
forests. Maybe Scott (on CC) can comment on this as well - he has
worked on the multi-class support in GBRT and knows much more about
it.

@Olivier: it would be great if you could send me your benchmark script
so that I can look into the issue in more detail.

thanks,
Peter
Post by Gilles Louppe
Hi Olivier,
The higher the number of estimators, the better. The more random the
trees (e.g., the lower max_features), the more important it usually is
to have a large forest to decrease the variance. To me, 10 is actually
a very low default value. In my daily research, I deal with hundreds
of trees. But yeah, it also takes longer.
By the way I am curious, what kind of dataset are you testing those
methods on? :)
Gilles
Post by Olivier Grisel
Hi all,
I have been playing a bit with GradientBoostingClassifier and
AdaBoostClassifier and ExtraTrees and while extra trees and adaboost
are reasonably fast to fit with there default params (n_estimators=10)
on a non toy dataset such as the olivetti faces dataset, the
GradientBoostingClassifier was taking ages (I killed it).
The current default value is n_estimators=100 for
GradientBoostingClassifier. Maybe it should be aligned to
n_estimators=10 as in the other ensemble methods of the scikit?
Or was I doing something very stupid by naively running it with the
default params on a dataset with size n_samples=400, n_features=4096
and n_classes=40 without any kind of preprocessing?
Another way to rephrase that question: what is the typical sweet spot
for the dataset shape when doing classification Gradient Boosted
Trees? What are reasonable values for the number of estimators in
various application domains?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Peter Prettenhofer
2012-03-25 10:55:30 UTC
Permalink
Post by Peter Prettenhofer
Post by Olivier Grisel
Post by Olivier Grisel
[..]
Another way to rephrase that question: what is the typical sweet spot
for the dataset shape when doing classification Gradient Boosted
Trees? What are reasonable values for the number of estimators in
various application domains?
GBRT is relatively robust w.r.t. overfitting if you use shrinkage (
see the docs). The choice of the number of estimators is mainly
governed by the available computational resources. As usual you have
to balance between training vs. test time performance - if you don't
care about test time performance you should use as much base learners
as you can afford**. If test time performance matters (e.g. latency
constraints in a web app) things look different :-)

** I remember that a student of mine once used more than 10K base
learners for the yahoo learning to rank competition.
Post by Peter Prettenhofer
Post by Olivier Grisel
Post by Olivier Grisel
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
--
Peter Prettenhofer
Olivier Grisel
2012-03-25 13:40:52 UTC
Permalink
Le 25 mars 2012 12:44, Peter Prettenhofer
Post by Gilles Louppe
Olivier,
In my experience GBRT usually requires more base learners than random
forests to get the same level of accuracy. I hardly use less than 100.
multi-class GBRT fits ``k`` trees at each stage, thus, if you have
``n_estimators`` this means you have to grow ``k * n_estimators``
trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
used multi-class GBRT much (part of the reason is that GBM does not
support it) - I know that the learning to rank folks use multi-class
GBRT for ordinal scaled output values (e.g. "not-relevant",
"relevant", "highly relevant") but these involve usually less than 5
classes.
Interesting I think this kind of practical considerations should be
added to the docs.
Post by Gilles Louppe
That said, the major drawback of GBRT is computational complexity: we
could speed up multi-class GBRT by training the ``k`` trees at each
stage in parallel but still it is much less efficient than random
forests. Maybe Scott (on CC) can comment on this as well - he has
worked on the multi-class support in GBRT and knows much more about
it.
@Olivier: it would be great if you could send me your benchmark script
so that I can look into the issue in more detail.
from sklearn.ensemble import *
from sklearn.datasts import fetch_olivetti_faces
from sklearn.cross_validation import cross_val_score
olivetti = fetch_olivetti_faces
cross_val_score(ExtraTreesClassifier(), olivetti.data, olivetti.target)
# should yield 3 scores around 0.85 in 10s
Post by Gilles Louppe
cross_val_score(GradientBoostingClassifier(), olivetti.data, olivetti.target)
# was too long to complete on my laptop

But indeed using GBRT on a 40 classes dataset is stupid in lights of
what you explained.

BTW: this paper on a workshop about the results of the Yahoo Learning
to Rank challenge is comparing GBRT and RF and their computational
complexity: very interesting (I read it after sending my question on
the mailing list...):

http://jmlr.csail.mit.edu/proceedings/papers/v14/mohan11a/mohan11a.pdf

BTW Learning-to-Rank seems to be a very important application domain
that we do not cover well in scikit-learn. I think it would be great
to provide a dataset loader + maybe some sample feature extractor or
example script for point-wise & pair-wise setups (and maybe list-wise
too). I wonder if it would be possible to make a small dataset
excerpt or generator suitable as a short example with fast execution
when building the doc with a CLI switch to use a larger chunk the
Yahoo or MSLR datasets for running the example in a more realistic
setting.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2012-03-25 13:47:04 UTC
Permalink
Post by Olivier Grisel
BTW Learning-to-Rank seems to be a very important application domain
that we do not cover well in scikit-learn.
Yes, @fabianp is working on things in this flavour currently. I think
that he needs a bit of cheering to integrate these in the scikit :).
Olivier Grisel
2012-03-25 13:51:18 UTC
Permalink
Post by Gael Varoquaux
Post by Olivier Grisel
BTW Learning-to-Rank seems to be a very important application domain
that we do not cover well in scikit-learn.
that he needs a bit of cheering to integrate these in the scikit :).
BTW, I forgot to mention that @pprett already provided the following
script for the Yahoo data in the comments of his PR:

https://gist.github.com/2027241
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Peter Prettenhofer
2012-03-25 15:09:48 UTC
Permalink
Post by Olivier Grisel
Le 25 mars 2012 12:44, Peter Prettenhofer
Post by Gilles Louppe
Olivier,
In my experience GBRT usually requires more base learners than random
forests to get the same level of accuracy. I hardly use less than 100.
multi-class GBRT fits ``k`` trees at each stage, thus, if you have
``n_estimators`` this means you have to grow ``k * n_estimators``
trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
used multi-class GBRT much (part of the reason is that GBM does not
support it) - I know that the learning to rank folks use multi-class
GBRT for ordinal scaled output values (e.g. "not-relevant",
"relevant", "highly relevant") but these involve usually less than 5
classes.
Interesting I think this kind of practical considerations should be
added to the docs.
Absolutely - I'll add them immediately.
Post by Olivier Grisel
Post by Gilles Louppe
That said, the major drawback of GBRT is computational complexity: we
could speed up multi-class GBRT by training the ``k`` trees at each
stage in parallel but still it is much less efficient than random
forests. Maybe Scott (on CC) can comment on this as well - he has
worked on the multi-class support in GBRT and knows much more about
it.
@Olivier: it would be great if you could send me your benchmark script
so that I can look into the issue in more detail.
from sklearn.ensemble import *
from sklearn.datasts import fetch_olivetti_faces
from sklearn.cross_validation import cross_val_score
olivetti = fetch_olivetti_faces
cross_val_score(ExtraTreesClassifier(), olivetti.data, olivetti.target)
# should yield 3 scores around 0.85 in 10s
Post by Gilles Louppe
cross_val_score(GradientBoostingClassifier(), olivetti.data, olivetti.target)
# was too long to complete on my laptop
But indeed using GBRT on a 40 classes dataset is stupid in lights of
what you explained.
Unfortunately. As soon as I have some spare time I'll write a PR that
uses Joblib to parallelize tree induction in the case of multi-class
classification.
Post by Olivier Grisel
BTW: this paper on a workshop about the results of the Yahoo Learning
to Rank challenge is comparing GBRT and RF and their computational
complexity: very interesting (I read it after sending my question on
 http://jmlr.csail.mit.edu/proceedings/papers/v14/mohan11a/mohan11a.pdf
Thx - I remember the paper but I haven't read it in depth - The master
thesis of Ananth Mohan provides some implementation detail for
regression tree induction and gradient boosting - highly recommended!
Post by Olivier Grisel
BTW Learning-to-Rank seems to be a very important application domain
that we do not cover well in scikit-learn. I think it would be great
to provide a dataset loader + maybe some sample feature extractor or
example script for point-wise & pair-wise setups (and maybe list-wise
too).  I wonder if it would be possible to make a small dataset
excerpt or generator suitable as a short example with fast execution
when building the doc with a CLI switch to use a larger chunk the
Yahoo or MSLR datasets for running the example in a more realistic
setting.
Yeah - unfortunately one needs to register in order to download the
Yahoo LTRC dataset but I think we could use LETOR as a prove of
concept dataset. There are a number of issues though:

1. We need to support the query id (=``qid``) field in
``svmlight_loader``; Pair-wise approaches such as RankingSVMs need
this information to form example pairs. My personal experience is that
RankingSVMs do surprisingly poor on learning-to-rank problems - but
they are great for binary classification problems with highly skewed
class distributions [1] (in this case we also don't need qids) - so
they would be a great addition to sklearn. Point-wise approaches don't
take query affiliation into account so we don't need to expose the
qids.

2. We need ranking evaluation metrics such as (N)DCG, average
precision and precision at k - but these metrics need to take the
query ids into account. Thus, we need do modify our some of our
existing metrics.


BTW: We might implement a SGD-based ranking svm simply by creating a
new ``sklearn.utils.seq_dataset.Dataset`` subclass that implements
``next`` such that it samples a positive and a negative example from
the training set and creates a new feature vector on the fly - so we
don't need to create the cross-product of examples explicitly -
something along the lines of D. Sculleys work [2].


[1] D. Sculley et al. Detecting Adversarial Advertisements in the Wild, KDD 2011
[2] D. Sculley, Large-scale learning to rank, NIPS 2009,
http://www.eecs.tufts.edu/~dsculley/papers/large-scale-rank.pdf
--
Peter Prettenhofer
Olivier Grisel
2012-03-25 15:14:54 UTC
Permalink
Le 25 mars 2012 17:09, Peter Prettenhofer
Post by Peter Prettenhofer
Post by Olivier Grisel
Le 25 mars 2012 12:44, Peter Prettenhofer
Post by Gilles Louppe
Olivier,
In my experience GBRT usually requires more base learners than random
forests to get the same level of accuracy. I hardly use less than 100.
multi-class GBRT fits ``k`` trees at each stage, thus, if you have
``n_estimators`` this means you have to grow ``k * n_estimators``
trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
used multi-class GBRT much (part of the reason is that GBM does not
support it) - I know that the learning to rank folks use multi-class
GBRT for ordinal scaled output values (e.g. "not-relevant",
"relevant", "highly relevant") but these involve usually less than 5
classes.
Interesting I think this kind of practical considerations should be
added to the docs.
Absolutely - I'll add them immediately.
Great, thanks. Please use `n_classes` instead of `k` in the docstrings
or narrative doc.

And thanks for the other comments and references.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2012-03-25 15:28:33 UTC
Permalink
On Mon, Mar 26, 2012 at 12:09 AM, Peter Prettenhofer
 1. We need to support the query id (=``qid``) field in
``svmlight_loader``; Pair-wise approaches such as RankingSVMs need
this information to form example pairs. My personal experience is that
RankingSVMs do surprisingly poor on learning-to-rank problems - but
they are great for binary classification problems with highly skewed
class distributions [1] (in this case we also don't need qids) - so
they would be a great addition to sklearn. Point-wise approaches don't
take query affiliation into account so we don't need to expose the
qids.
I think that the qids are not a strict requirement but are useful to
form pairs belonging to the same qid only.
Considering all the possible pairs would be too expensive. In his
papers, D. Sculley use the term "query sharding" to refer to this kind
of heuristic.
2. We need ranking evaluation metrics such as (N)DCG,  average
precision and precision at k - but these metrics need to take the
query ids into account. Thus, we need do modify our some of our
existing metrics.
Since "query" is more of an informational retrieval term, I was
thinking we could use the word "group" instead.
BTW: We might implement a SGD-based ranking svm simply by creating a
new ``sklearn.utils.seq_dataset.Dataset`` subclass that implements
``next`` such that it samples a positive and a negative example from
the training set and creates a new feature vector on the fly - so we
don't need to create the cross-product of examples explicitly -
something along the lines of D. Sculleys work [2].
+1!

Along the same lines, sampling a positive and negative example every
two SGD steps optimizes AUC.

Mathieu
Scott White
2012-03-25 17:35:43 UTC
Permalink
Btw, another thing I think we should add is the ability to monitor the
out-of-bag estimates after each iteration and allow the fitting to be
terminated early. It's usually hard to guess the right number of
iterations required and if one can terminate the fitting early based
on good oob results that can help speed things up a lot.

cheers,
Scott
I have noticed also that GBRT can be quite slow even for small k.
Since GBRT fits n*k decision trees, by design, it's crucial that the
decision tree code be highly optimized. I did some quick performance
profiling the other day, which showed that the performance bottleneck
was spent in the method  _find_best_split.
Looking into that method I found what I think is a major inefficiency
that needs to be addressed. Nested within a for loop is a 'while True'
that iterates over the splitting points for a given feature to test
for the best split. One problem I see is that the splitting points are
not indexed ahead of time so a lot of time is wasted finding the same
splitting points over and over again. By allowing the splitting points
to be indexed ahead of time, I believe the performance bottleneck of
finding the splits would would be reduced by factor of O(m) to O(u).
This should have a marked improvement on datasets that have a lot of
repeating values. Of course, this would come at the additional cost of
storing those indexed splitting points.
n = # iterations
k = # classes
m = # samples
u = max # of unique values across all features
This is from a cursory inspection as I have not had enough time to
study this code as I would like so feel free to correct me if I'm
wrong here.
cheers,
Scott
On Sun, Mar 25, 2012 at 3:44 AM, Peter Prettenhofer
Post by Gilles Louppe
Olivier,
In my experience GBRT usually requires more base learners than random
forests to get the same level of accuracy. I hardly use less than 100.
multi-class GBRT fits ``k`` trees at each stage, thus, if you have
``n_estimators`` this means you have to grow ``k * n_estimators``
trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
used multi-class GBRT much (part of the reason is that GBM does not
support it) - I know that the learning to rank folks use multi-class
GBRT for ordinal scaled output values (e.g. "not-relevant",
"relevant", "highly relevant") but these involve usually less than 5
classes.
That said, the major drawback of GBRT is computational complexity: we
could speed up multi-class GBRT by training the ``k`` trees at each
stage in parallel but still it is much less efficient than random
forests. Maybe Scott (on CC) can comment on this as well - he has
worked on the multi-class support in GBRT and knows much more about
it.
@Olivier: it would be great if you could send me your benchmark script
so that I can look into the issue in more detail.
thanks,
 Peter
Post by Gilles Louppe
Hi Olivier,
The higher the number of estimators, the better. The more random the
trees (e.g., the lower max_features), the more important it usually is
to have a large forest to decrease the variance. To me, 10 is actually
a very low default value. In my daily research, I deal with hundreds
of trees. But yeah, it also takes longer.
By the way I am curious, what kind of dataset are you testing those
methods on? :)
Gilles
Post by Olivier Grisel
Hi all,
I have been playing a bit with GradientBoostingClassifier and
AdaBoostClassifier and ExtraTrees and while extra trees and adaboost
are reasonably fast to fit with there default params (n_estimators=10)
on a non toy dataset such as the olivetti faces dataset, the
GradientBoostingClassifier was taking ages (I killed it).
The current default value is n_estimators=100 for
GradientBoostingClassifier. Maybe it should be aligned to
n_estimators=10 as in the other ensemble methods of the scikit?
Or was I doing something very stupid by naively running it with the
default params on a dataset with size n_samples=400, n_features=4096
and n_classes=40 without any kind of preprocessing?
Another way to rephrase that question: what is the typical sweet spot
for the dataset shape when doing classification Gradient Boosted
Trees? What are reasonable values for the number of estimators in
various application domains?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Peter Prettenhofer
2012-03-25 17:40:08 UTC
Permalink
Post by Scott White
Btw, another thing I think we should add is the ability to monitor the
out-of-bag estimates after each iteration and allow the fitting to be
terminated early. It's usually hard to guess the right number of
iterations required and if one can terminate the fitting early based
on good oob results that can help speed things up a lot.
cheers,
Scott
+1 - we should really start sketching the API for monitoring the
progress of our models during training.

``oob_scores_`` only allows you to do "early-stopping" after the fact :-)
--
Peter Prettenhofer
Scott White
2012-03-25 18:07:47 UTC
Permalink
Yes, and with oob_estimates and monitor some of the basic building blocks are in place but needs fleshing out.

Also would be nice to resume a model from where you left off if terminated too early.

Sent from my iPhone
Post by Peter Prettenhofer
Post by Scott White
Btw, another thing I think we should add is the ability to monitor the
out-of-bag estimates after each iteration and allow the fitting to be
terminated early. It's usually hard to guess the right number of
iterations required and if one can terminate the fitting early based
on good oob results that can help speed things up a lot.
cheers,
Scott
+1 - we should really start sketching the API for monitoring the
progress of our models during training.
``oob_scores_`` only allows you to do "early-stopping" after the fact :-)
--
Peter Prettenhofer
Andreas
2012-03-25 19:55:52 UTC
Permalink
Post by Peter Prettenhofer
Post by Scott White
Btw, another thing I think we should add is the ability to monitor the
out-of-bag estimates after each iteration and allow the fitting to be
terminated early. It's usually hard to guess the right number of
iterations required and if one can terminate the fitting early based
on good oob results that can help speed things up a lot.
cheers,
Scott
+1 - we should really start sketching the API for monitoring the
progress of our models during training.
``oob_scores_`` only allows you to do "early-stopping" after the fact :-)
+1 on that. Didn't want to bloat the code without having
a real design when I wrote this. Should be pretty useful!
Peter Prettenhofer
2012-03-25 17:37:20 UTC
Permalink
I have noticed also that GBRT can be quite slow even for small k.
Since GBRT fits n*k decision trees, by design, it's crucial that the
decision tree code be highly optimized. I did some quick performance
profiling the other day, which showed that the performance bottleneck
was spent in the method  _find_best_split.
Looking into that method I found what I think is a major inefficiency
that needs to be addressed. Nested within a for loop is a 'while True'
that iterates over the splitting points for a given feature to test
for the best split. One problem I see is that the splitting points are
not indexed ahead of time so a lot of time is wasted finding the same
splitting points over and over again. By allowing the splitting points
to be indexed ahead of time, I believe the performance bottleneck of
finding the splits would would be reduced by factor of O(m) to O(u).
This should have a marked improvement on datasets that have a lot of
repeating values. Of course, this would come at the additional cost of
storing those indexed splitting points.
I totally agree - the current implementation is tailored towards data
sets with numerical features (and lots of potential split points) -
the ``sample_mask`` approach indeed does poorly on data sets with
categorical features. On the other hand, the indexing approach has
significant pre-processing costs and the cost of updating the data
structure is much higher too - so I assume that there is a break-even
somewhere and at the end of the day it really depends on the data you
are working with.

I'm currently using the GBRT classifier in a setting which involves
both numerical features and indicator features - so I would really
appreciate a performance enhancement here! On the other hand, when
speaking about performance issues which should not focus solely on
training time performance! In fact, currently I'm much more concerned
about test time performance which limits the applicability of sklearn
for certain scenarios (e.g. low latency scenarios such as web apps or
real-time CV or using sklearn classifiers as building blocks in greedy
decoders for NLP applications).
Scott White
2012-03-25 18:16:01 UTC
Permalink
Post by Peter Prettenhofer
I have noticed also that GBRT can be quite slow even for small k.
Since GBRT fits n*k decision trees, by design, it's crucial that the
decision tree code be highly optimized. I did some quick performance
profiling the other day, which showed that the performance bottleneck
was spent in the method _find_best_split.
Looking into that method I found what I think is a major inefficiency
that needs to be addressed. Nested within a for loop is a 'while True'
that iterates over the splitting points for a given feature to test
for the best split. One problem I see is that the splitting points are
not indexed ahead of time so a lot of time is wasted finding the same
splitting points over and over again. By allowing the splitting points
to be indexed ahead of time, I believe the performance bottleneck of
finding the splits would would be reduced by factor of O(m) to O(u).
This should have a marked improvement on datasets that have a lot of
repeating values. Of course, this would come at the additional cost of
storing those indexed splitting points.
I totally agree - the current implementation is tailored towards data
sets with numerical features (and lots of potential split points) -
the ``sample_mask`` approach indeed does poorly on data sets with
categorical features. On the other hand, the indexing approach has
significant pre-processing costs and the cost of updating the data
structure is much higher too - so I assume that there is a break-even
somewhere and at the end of the day it really depends on the data you
are working with.
Agreed but maybe the thing to do is add indexing as an option so if an index is provided to the DT, it can take advantage of it? And similarly if a flag is provided to GBT it can build it? For data with a lot of sparse features, I think it would make a big difference.
Post by Peter Prettenhofer
I'm currently using the GBRT classifier in a setting which involves
both numerical features and indicator features - so I would really
appreciate a performance enhancement here! On the other hand, when
speaking about performance issues which should not focus solely on
training time performance!
Yes agreed!
Post by Peter Prettenhofer
In fact, currently I'm much more concerned
about test time performance which limits the applicability of sklearn
for certain scenarios (e.g. low latency scenarios such as web apps or
real-time CV or using sklearn classifiers as building blocks in greedy
decoders for NLP applications).
Scott White
2012-03-25 17:10:30 UTC
Permalink
I have noticed also that GBRT can be quite slow even for small k.
Since GBRT fits n*k decision trees, by design, it's crucial that the
decision tree code be highly optimized. I did some quick performance
profiling the other day, which showed that the performance bottleneck
was spent in the method _find_best_split.

Looking into that method I found what I think is a major inefficiency
that needs to be addressed. Nested within a for loop is a 'while True'
that iterates over the splitting points for a given feature to test
for the best split. One problem I see is that the splitting points are
not indexed ahead of time so a lot of time is wasted finding the same
splitting points over and over again. By allowing the splitting points
to be indexed ahead of time, I believe the performance bottleneck of
finding the splits would would be reduced by factor of O(m) to O(u).
This should have a marked improvement on datasets that have a lot of
repeating values. Of course, this would come at the additional cost of
storing those indexed splitting points.

Notation I am using:
n = # iterations
k = # classes
m = # samples
u = max # of unique values across all features

This is from a cursory inspection as I have not had enough time to
study this code as I would like so feel free to correct me if I'm
wrong here.

cheers,
Scott

On Sun, Mar 25, 2012 at 3:44 AM, Peter Prettenhofer
Post by Gilles Louppe
Olivier,
In my experience GBRT usually requires more base learners than random
forests to get the same level of accuracy. I hardly use less than 100.
multi-class GBRT fits ``k`` trees at each stage, thus, if you have
``n_estimators`` this means you have to grow ``k * n_estimators``
trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
used multi-class GBRT much (part of the reason is that GBM does not
support it) - I know that the learning to rank folks use multi-class
GBRT for ordinal scaled output values (e.g. "not-relevant",
"relevant", "highly relevant") but these involve usually less than 5
classes.
That said, the major drawback of GBRT is computational complexity: we
could speed up multi-class GBRT by training the ``k`` trees at each
stage in parallel but still it is much less efficient than random
forests. Maybe Scott (on CC) can comment on this as well - he has
worked on the multi-class support in GBRT and knows much more about
it.
@Olivier: it would be great if you could send me your benchmark script
so that I can look into the issue in more detail.
thanks,
 Peter
Post by Gilles Louppe
Hi Olivier,
The higher the number of estimators, the better. The more random the
trees (e.g., the lower max_features), the more important it usually is
to have a large forest to decrease the variance. To me, 10 is actually
a very low default value. In my daily research, I deal with hundreds
of trees. But yeah, it also takes longer.
By the way I am curious, what kind of dataset are you testing those
methods on? :)
Gilles
Post by Olivier Grisel
Hi all,
I have been playing a bit with GradientBoostingClassifier and
AdaBoostClassifier and ExtraTrees and while extra trees and adaboost
are reasonably fast to fit with there default params (n_estimators=10)
on a non toy dataset such as the olivetti faces dataset, the
GradientBoostingClassifier was taking ages (I killed it).
The current default value is n_estimators=100 for
GradientBoostingClassifier. Maybe it should be aligned to
n_estimators=10 as in the other ensemble methods of the scikit?
Or was I doing something very stupid by naively running it with the
default params on a dataset with size n_samples=400, n_features=4096
and n_classes=40 without any kind of preprocessing?
Another way to rephrase that question: what is the typical sweet spot
for the dataset shape when doing classification Gradient Boosted
Trees? What are reasonable values for the number of estimators in
various application domains?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Olivier Grisel
2012-03-25 13:48:37 UTC
Permalink
Post by Gilles Louppe
Hi Olivier,
The higher the number of estimators, the better. The more random the
trees (e.g., the lower max_features), the more important it usually is
to have a large forest to decrease the variance. To me, 10 is actually
a very low default value. In my daily research, I deal with hundreds
of trees. But yeah, it also takes longer.
Indeed. I think we should put some practical scales somewhere in the
doc, maybe something along the lines:

Depending on the max_depth of the trees and the size of the dataset:

- 10+ trees: a couple of seconds or minutes of sequential CPU time,
suitable for debugging
- 500+ trees: a couple of minutes or hours of sequential CPU time,
suitable for getting interesting results (requires multi-core
computation in practice)
- 5000+ trees: a couple of hours or days of sequential CPU time,
suitable for getting appearing in the leaderboards of machine learning
challenges (requires distributed computation in practice)
Post by Gilles Louppe
By the way I am curious, what kind of dataset are you testing those
methods on? :)
I used http://scikit-learn.org/dev/datasets/index.html#the-olivetti-faces-dataset
which is stupid because of the large number of classes as explained by
Peter.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Loading...