Discussion:
[Scikit-learn-general] lightning vs. scikit-learn benchmark
Mathieu Blondel
2013-08-20 17:41:22 UTC
Permalink
Hi,

I was curious how LinearSVC as implemented in lightning [*] compares to
LinearSVC in scikit-learn so I benchmarked them on the MNIST and News20
datasets.

Here are the results for L1-loss SVM with L2 penalty.

MNIST (class 8 vs. others)
------------------------------------

lightning
Training time: 18.0813570023
Acc: 0.953066666667

scikit-learn
Training time: 17.2167401314
Acc: 0.953183333333

News20 (all classes)
----------------------------

lightning
Training time: 13.0210821629
Acc: 0.966571155683

scikit-learn
Training time: 11.1561429501
Acc: 0.966571155683

So, lightning is slighly slower. This is probably due to the virtual method
calls needed for the dataset abstraction lightning is using. However, the
main advantage of lightning is that it works directly on the NumPy array or
SciPy sparse matrix *without* memory copy (scikit-learn converts the data
to liblinear's sparse data structure).

I'm attaching the script I used.

Mathieu

[*] https://github.com/mblondel/lightning
Olivier Grisel
2013-08-20 18:04:51 UTC
Permalink
Thanks for sharing. It might be indeed a pass to get rid of our
liblinear bindings and move to a pure-cython implementation for
LinearSVC and LogisticRegression.

The ongoing work by Fabian and Gael on alternative optimizers for
LogisticRegression to add support for warm restarts and regularization
path would also probably benefit from non-memory-copy pure cython
implementations of the liblinear algorithm.
--
Olivier
Andreas Mueller
2013-08-20 18:42:40 UTC
Permalink
Post by Olivier Grisel
Thanks for sharing. It might be indeed a pass to get rid of our
liblinear bindings and move to a pure-cython implementation for
LinearSVC and LogisticRegression.
The ongoing work by Fabian and Gael on alternative optimizers for
LogisticRegression to add support for warm restarts and regularization
path would also probably benefit from non-memory-copy pure cython
implementations of the liblinear algorithm.
I'd very much like to get rid of liblinear, but we really have to be careful
in the analysis. I'm pretty sure they benchmarked with a lot of sparse
and dense
data with lots of different amounts of noise, regularization,
n_features, n_samples.

Also, thanks for sharing your results Mathieu, that looks really promising!
Mathieu Blondel
2013-08-20 19:59:12 UTC
Permalink
On Wed, Aug 21, 2013 at 3:42 AM, Andreas Mueller
Post by Andreas Mueller
I'd very much like to get rid of liblinear, but we really have to be careful
in the analysis. I'm pretty sure they benchmarked with a lot of sparse
and dense
data with lots of different amounts of noise, regularization,
n_features, n_samples.
Actually, my implementation of the dual coordinate descent solver is a
straight Cython port of liblinear's C++ code with some modifications made
over time to support warm start ;-)

Replacing some selected solvers like the dual cd one seems feasible but
lightning doesn't support several solvers in liblinear. So, getting
entirely rid of liblinear doesn't seem possible.

Mathieu

PS: some authors use the term "dual coordinate ascent". The liblinear team
just uses "coordinate descent" since they minimize the dual with flipped
signs.

PPS: It would be nice if the liblinear team could open-source their test
suite ;-)
Gael Varoquaux
2013-08-20 20:02:15 UTC
Permalink
Post by Olivier Grisel
Thanks for sharing. It might be indeed a pass to get rid of our
liblinear bindings and move to a pure-cython implementation for
LinearSVC and LogisticRegression.
Yes, indeed. Thanks Matthieu. This is interesting. I am more and more
inclined to think that, in the long run, we might want our own
implementation of SVM and logistic solvers. It's just a matter of fiding
time to do them.

G
Olivier Grisel
2013-08-20 18:05:45 UTC
Permalink
BTW, do you use the same stopping criterion as liblinear?
Sean Violante
2013-08-22 12:01:46 UTC
Permalink
I'm sure you will hate this suggestion, but what about creating a text
file/command line "interface" to existing machine learning executables.
advantages:
a) no problem with data copy: the executable loads data from file (you
don't need to keep in sklearn)
b) most ML algos are available from command line with text file input.
c) sklearn is great for the added extras ( cross validation, metrics, grid
search, feature selection etc)
d) less time to integrate new algos, in fact algo development is left to
original authors.

--
sean


Thanks for sharing. It might be indeed a pass to get rid of our
Post by Olivier Grisel
liblinear bindings and move to a pure-cython implementation for
LinearSVC and LogisticRegression.
The ongoing work by Fabian and Gael on alternative optimizers for
LogisticRegression to add support for warm restarts and regularization
path would also probably benefit from non-memory-copy pure cython
implementations of the liblinear algorithm.
--
Olivier
I'd very much like to get rid of liblinear, but we really have to be
careful
in the analysis. I'm pretty sure they benchmarked with a lot of sparse
and dense
data with lots of different amounts of noise, regularization,
n_features, n_samples.
Also, thanks for sharing your results Mathieu, that looks really promising!
Olivier Grisel
2013-08-22 12:10:40 UTC
Permalink
This post might be inappropriate. Click to display it.
Mathieu Blondel
2013-08-22 12:20:55 UTC
Permalink
I agree with Olivier's remarks.

lightning supports a rudimentary command-line interface [*] but that's
because I want to make it easy to non-Python users to try my algorithm on
their data.

Mathieu

[*] http://www.mblondel.org/code/mlj2013/
Post by Olivier Grisel
Post by Sean Violante
I'm sure you will hate this suggestion, but what about creating a text
file/command line "interface" to existing machine learning executables.
a) no problem with data copy: the executable loads data from file (you
don't
Post by Sean Violante
need to keep in sklearn)
b) most ML algos are available from command line with text file input.
c) sklearn is great for the added extras ( cross validation, metrics,
grid
Post by Sean Violante
search, feature selection etc)
d) less time to integrate new algos, in fact algo development is left to
original authors.
The goal of scikit-learn is to work well in the numpy / scipy
ecosystem, typically in an interactive IPython shell session where the
user is responsible to load the data in memory as a numpy array and
never touch the disk again after that (assuming the data is small
enough to fit in memory).
We don't want to wrap external libraries written in c++ or anything
else. Quite the opposite, we would like to move away from that
paradigm and have fine control of the memory layout of the data. What
you describe is reasonable but at the opposite end of the current
interests of the scikit-learn developers team: it means that its a
good opportunity to start your own project ;)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2013-08-22 12:32:28 UTC
Permalink
BTW, the advantages of scikit-learn's approach over text-file based
programs are also briefly discussed in our recent paper:

http://staff.science.uva.nl/~buitinck/papers/scikit-learn-api.pdf

Mathieu
Post by Mathieu Blondel
I agree with Olivier's remarks.
lightning supports a rudimentary command-line interface [*] but that's
because I want to make it easy to non-Python users to try my algorithm on
their data.
Mathieu
[*] http://www.mblondel.org/code/mlj2013/
Post by Olivier Grisel
Post by Sean Violante
I'm sure you will hate this suggestion, but what about creating a text
file/command line "interface" to existing machine learning executables.
a) no problem with data copy: the executable loads data from file (you
don't
Post by Sean Violante
need to keep in sklearn)
b) most ML algos are available from command line with text file input.
c) sklearn is great for the added extras ( cross validation, metrics,
grid
Post by Sean Violante
search, feature selection etc)
d) less time to integrate new algos, in fact algo development is left to
original authors.
The goal of scikit-learn is to work well in the numpy / scipy
ecosystem, typically in an interactive IPython shell session where the
user is responsible to load the data in memory as a numpy array and
never touch the disk again after that (assuming the data is small
enough to fit in memory).
We don't want to wrap external libraries written in c++ or anything
else. Quite the opposite, we would like to move away from that
paradigm and have fine control of the memory layout of the data. What
you describe is reasonable but at the opposite end of the current
interests of the scikit-learn developers team: it means that its a
good opportunity to start your own project ;)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Lars Buitinck
2013-08-22 12:58:04 UTC
Permalink
Post by Sean Violante
a) no problem with data copy: the executable loads data from file (you don't
need to keep in sklearn)
Quite the contrary. What if only raw data (text files, JSON, etc.) is
on disk, and you still need to do feature extraction on it? Then you
need a pipeline of a feature extraction script and a learner, so
you're copying the raw data from disk into the feature extraction
script, then into kernel buffers, and finally into the learning
program. What about feature selection, is that an extra script with
two additional copies?
Post by Sean Violante
b) most ML algos are available from command line with text file input.
Python is a great tool for controlling external programs, but it's
still a hard problem because usually the CLI interfaces to those
programs are poorly defined. Error handling in particular can be very
difficult and installation, deployment, and testing code must
rewritten for each program.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Sean Violante
2013-08-23 11:50:27 UTC
Permalink
Thanks Lars - I would really like to clarify the problems with my
suggestion, in particular if/how a CLI interface would break the scikit
learn interface. You obviously can immediately identify the problems.

The kind of thing I would like to do is run vowpal-wabbit from within
scikit learn. There are lots of programs out there implementing a single
algorithm. What would be nice is to have an easy way of investigating them
[ doing preprocessing, cross validation etc, metrics in scikit learn]
I am just suggesting that CLI might be a useful additional interface to
enable quick incorporation of new algorithms. The number of algo's/bugs
grows each year - the number of scikit developers doesn't! This is
different from R, where essentially each algorithm is maintained by its own
developer. Clearly a CLI interface will be less polished- but it is better
than nothing!
Post by Sean Violante
b) most ML algos are available from command line with text file input.
Python is a great tool for controlling external programs, but it's
still a hard problem because usually the CLI interfaces to those
programs are poorly defined. Error handling in particular can be very
difficult and installation, deployment, and testing code must
rewritten for each program.

I agree the CLIs are not going to be consistent. What I imagined would be
scikit developers providing some generic utility functions/interface. Then
anyone wanting to use some new algo would write the relevant scripts
mapping parameters to CLI/ text file. How hard could it be? ;).


a) no problem with data copy: the executable loads data from file (you don't
Post by Sean Violante
need to keep in sklearn)
Quite the contrary. What if only raw data (text files, JSON, etc.) is
on disk, and you still need to do feature extraction on it? Then you
need a pipeline of a feature extraction script and a learner, so
you're copying the raw data from disk into the feature extraction
script, then into kernel buffers, and finally into the learning
program. What about feature selection, is that an extra script with
two additional copies?


Whenever you package external algo's you are likely to have a problem that
the internal data format is not the same as numpy array. My issue with data
copy is one of memory limits: ie if you are keeping two copies in memory
you are halving the maximum poss data you can handle.

for me this seems to be the main problem with my suggestion : clearly the
.fit(X,y) interface doesn't allow me to clear the training data within
sklearn.

sean
Eustache DIEMERT
2013-08-23 12:01:24 UTC
Permalink
Post by Sean Violante
The kind of thing I would like to do is run vowpal-wabbit from within
scikit learn.
I know VW has a C interface now, so it is theoretically possible to develop
a python binding (hunch.net seems down as of now, but John Langford wrote
about it on the blog).

However, memory structures possibly don't fit properly with numpy arrays
(which are the recommended API choice for sklearn I believe), so it could
become a bit hard.

Also I would be inclined to think it could hurt VW performance to feed him
with a different pattern of data input as the VW folks have passed a good
amount of effort in optimizing I/Os I believe.

HTH
Mathieu Blondel
2013-08-23 13:51:24 UTC
Permalink
A poor-man's scikit-learn compatible wrapper around VW would be to call the
command line via popen and feed it data through stdin.

If you do that, create a gist and add it to the third-party snippet list in
https://github.com/scikit-learn/scikit-learn/wiki/Useful-Snippets

Mathieu
Post by Sean Violante
The kind of thing I would like to do is run vowpal-wabbit from within
Post by Sean Violante
scikit learn.
I know VW has a C interface now, so it is theoretically possible to
develop a python binding (hunch.net seems down as of now, but John
Langford wrote about it on the blog).
However, memory structures possibly don't fit properly with numpy arrays
(which are the recommended API choice for sklearn I believe), so it could
become a bit hard.
Also I would be inclined to think it could hurt VW performance to feed him
with a different pattern of data input as the VW folks have passed a good
amount of effort in optimizing I/Os I believe.
HTH
------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...