Post by Olivier GriselLe 25 mars 2012 12:44, Peter Prettenhofer
Post by Gilles LouppeOlivier,
In my experience GBRT usually requires more base learners than random
forests to get the same level of accuracy. I hardly use less than 100.
multi-class GBRT fits ``k`` trees at each stage, thus, if you have
``n_estimators`` this means you have to grow ``k * n_estimators``
trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
used multi-class GBRT much (part of the reason is that GBM does not
support it) - I know that the learning to rank folks use multi-class
GBRT for ordinal scaled output values (e.g. "not-relevant",
"relevant", "highly relevant") but these involve usually less than 5
classes.
Interesting I think this kind of practical considerations should be
added to the docs.
Absolutely - I'll add them immediately.
Post by Olivier GriselPost by Gilles LouppeThat said, the major drawback of GBRT is computational complexity: we
could speed up multi-class GBRT by training the ``k`` trees at each
stage in parallel but still it is much less efficient than random
forests. Maybe Scott (on CC) can comment on this as well - he has
worked on the multi-class support in GBRT and knows much more about
it.
@Olivier: it would be great if you could send me your benchmark script
so that I can look into the issue in more detail.
from sklearn.ensemble import *
from sklearn.datasts import fetch_olivetti_faces
from sklearn.cross_validation import cross_val_score
olivetti = fetch_olivetti_faces
cross_val_score(ExtraTreesClassifier(), olivetti.data, olivetti.target)
# should yield 3 scores around 0.85 in 10s
Post by Gilles Louppecross_val_score(GradientBoostingClassifier(), olivetti.data, olivetti.target)
# was too long to complete on my laptop
But indeed using GBRT on a 40 classes dataset is stupid in lights of
what you explained.
Unfortunately. As soon as I have some spare time I'll write a PR that
uses Joblib to parallelize tree induction in the case of multi-class
classification.
Post by Olivier GriselBTW: this paper on a workshop about the results of the Yahoo Learning
to Rank challenge is comparing GBRT and RF and their computational
complexity: very interesting (I read it after sending my question on
http://jmlr.csail.mit.edu/proceedings/papers/v14/mohan11a/mohan11a.pdf
Thx - I remember the paper but I haven't read it in depth - The master
thesis of Ananth Mohan provides some implementation detail for
regression tree induction and gradient boosting - highly recommended!
Post by Olivier GriselBTW Learning-to-Rank seems to be a very important application domain
that we do not cover well in scikit-learn. I think it would be great
to provide a dataset loader + maybe some sample feature extractor or
example script for point-wise & pair-wise setups (and maybe list-wise
too). I wonder if it would be possible to make a small dataset
excerpt or generator suitable as a short example with fast execution
when building the doc with a CLI switch to use a larger chunk the
Yahoo or MSLR datasets for running the example in a more realistic
setting.
Yeah - unfortunately one needs to register in order to download the
Yahoo LTRC dataset but I think we could use LETOR as a prove of
concept dataset. There are a number of issues though:
1. We need to support the query id (=``qid``) field in
``svmlight_loader``; Pair-wise approaches such as RankingSVMs need
this information to form example pairs. My personal experience is that
RankingSVMs do surprisingly poor on learning-to-rank problems - but
they are great for binary classification problems with highly skewed
class distributions [1] (in this case we also don't need qids) - so
they would be a great addition to sklearn. Point-wise approaches don't
take query affiliation into account so we don't need to expose the
qids.
2. We need ranking evaluation metrics such as (N)DCG, average
precision and precision at k - but these metrics need to take the
query ids into account. Thus, we need do modify our some of our
existing metrics.
BTW: We might implement a SGD-based ranking svm simply by creating a
new ``sklearn.utils.seq_dataset.Dataset`` subclass that implements
``next`` such that it samples a positive and a negative example from
the training set and creates a new feature vector on the fly - so we
don't need to create the cross-product of examples explicitly -
something along the lines of D. Sculleys work [2].
[1] D. Sculley et al. Detecting Adversarial Advertisements in the Wild, KDD 2011
[2] D. Sculley, Large-scale learning to rank, NIPS 2009,
http://www.eecs.tufts.edu/~dsculley/papers/large-scale-rank.pdf
--
Peter Prettenhofer