[Scikit-learn-general] Multi Layer Perceptron / Neural Network in Sklearn

Post by Andreas MÃ¼ller
Hi everybody.
I was thinking about putting some work into making a multi layer
perceptron implementation
for sklearn. I think it would be a good addition to the other, mostly
linear, classifiers
in sklearn. Together with the decision trees / boosting that many people
are working
on at the moment, I think sklearn would cover most of the classifiers
used today
My question is: has anyone started with a mlp implementation yet? Or is
there any
code lying around that people think is already pretty good?
I would try to keep it simple with support only for one hidden layer and do
a pure python implementation to start with.

In the past (before getting involved in scikit-learn) I had started an
unfinished library in pure C + python ctypes bindings for MLP and
stacked autoencoders. This is basically the same datastructure and
algorithms but one is supervised and the other is unsupervised.

https://bitbucket.org/ogrisel/libsgd/wiki/Home

I think it should be pretty straightforward to rewrite this in cython
directly. The important trick is to pre-allocate the memory buffer of
the minibatch size for both the hidden and output layers.

Post by Andreas MÃ¼ller
I'm also open for any suggestions.
- online, minibatch and batch learning

I would start with minibatch (pure online with one sample at a time is
useless with python because of the interpreter overhead IMHO). Batch
learning seems less interesting than minibatch.

Post by Andreas MÃ¼ller
- vanilla gradient descent and rprop
- l2 weight decay optional

l2 weight decay is equivalent to l2 regularizer. I would add l1 and
elastic net too (or projection based regularization).

Post by Andreas MÃ¼ller
- tanh nonlinearities

Also momentum seems important (and averaging might work too even
though the objective function is non convex in general).

Post by Andreas MÃ¼ller
- a class for regression and one for classification
- MSE and cross entropy (for classification only) loss functions

We need several loss functions and there gradient in cython (we cannot
reuse the loss function from the SGD module of since the output of a
MLP can be a multi-variate). For classification we will need hnigeloss
and squared hingeloss (and hubert for regression). See the source of
libsgd for a list of useful loss function.

Post by Andreas MÃ¼ller
I think that would be a reasonable amount of features and should
be pretty easy to maintain.

I think we are several developers with a good understanding of SGD so
I don't think it would be a big maintenance burden.

In any case, before embarking in this please read or re-read:

http://yann.lecun.com/exdb/publis/#lecun-98b

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Andreas Müller

2011-11-04 17:13:13 UTC

In the past (before getting involved in scikit-learn) I had started an
unfinished library in pure C + python ctypes bindings for MLP and
stacked autoencoders. This is basically the same datastructure and
algorithms but one is supervised and the other is unsupervised.
https://bitbucket.org/ogrisel/libsgd/wiki/Home
I think it should be pretty straightforward to rewrite this in cython
directly. The important trick is to pre-allocate the memory buffer of
the minibatch size for both the hidden and output layers.

Why not wrap your C in cython? Then we could take advantage
of your SSE code.

Olivier Grisel

2011-11-04 17:18:42 UTC

In the past (before getting involved in scikit-learn) I had started an
unfinished library in pure C + python ctypes bindings for MLP and
stacked autoencoders. This is basically the same datastructure and
algorithms but one is supervised and the other is unsupervised.
https://bitbucket.org/ogrisel/libsgd/wiki/Home
I think it should be pretty straightforward to rewrite this in cython
directly. The important trick is to pre-allocate the memory buffer of
the minibatch size for both the hidden and output layers.

Why not wrap your C in cython? Then we could take advantage
of your SSE code.

The code would be much simpler in cython (I did not know about cython
at that time). Also we don't want SSE-specific code in scikit learn to
keep it portable and easy to install. Debugging SSE related
segmentation fauls (because of memory alignment issues for instance)
can be very tricky and that is a huge maintenance burden.

People who want efficient vectorized code should use pylearn and theano instead.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

David Warde-Farley

2011-11-04 17:36:42 UTC

In the past (before getting involved in scikit-learn) I had started an
unfinished library in pure C + python ctypes bindings for MLP and
stacked autoencoders. This is basically the same datastructure and
algorithms but one is supervised and the other is unsupervised.
https://bitbucket.org/ogrisel/libsgd/wiki/Home
I think it should be pretty straightforward to rewrite this in cython
directly. The important trick is to pre-allocate the memory buffer of
the minibatch size for both the hidden and output layers.

Why not wrap your C in cython? Then we could take advantage
of your SSE code.

https://github.com/dwf/backproppy/tree/master/backproppy

This stuff should be pretty simple to Cythonize/optimize a bit/directly call
BLAS, if anyone's interested in doing it, I don't really have the time
unfortunately.

David

Andreas Müller

2011-11-16 16:38:46 UTC

Post by Andreas MÃ¼ller
- a class for regression and one for classification
- MSE and cross entropy (for classification only) loss functions

Can you explain how hinge-loss works for multiple classes?
Or would you train a separate mlp for each class?

Alexandre Passos

2011-11-16 22:51:52 UTC

Post by Andreas MÃ¼ller
- a class for regression and one for classification
- MSE and cross entropy (for classification only) loss functions

Can you explain how hinge-loss works for multiple classes?
Or would you train a separate mlp for each class?

Usually the multiclass hinge loss minimizes max(0, 1 + max_(c !=
correct_class)(score(c, x)) - score(correct class, x)). That is, the
correct class must have score 1 higher than any other classes (or,
equivalently, than the highest-scoring of all other classes).

--
- Alexandre

Lars Buitinck

2011-11-04 13:59:24 UTC

Post by Andreas MÃ¼ller
My question is: has anyone started with a mlp implementation yet?

I was just working on one :)
I have the predict function for an arbitrary number of hidden layers
(classifier case) and some snippets of the RPROP algorithm. I've been
using weight vectors that come out of a Matlab implementation for now.

There used to be an MLP implementation in older versions (around 0.2,
I believe) but it was abandoned.

Post by Andreas MÃ¼ller
- online, minibatch and batch learning

I only need batch learning and classification for now... shall we keep
it simple?

Post by Andreas MÃ¼ller
- vanilla gradient descent and rprop
- l2 weight decay optional
- tanh nonlinearities

Logistic activation functions seem fashionable; that's what Bishop and
other textbooks use. I'm not sure if there's a big difference, but it
seems to me that gradient computations might be slightly more
efficient (guesswork, I admit). We can always add a steepness
parameter later.

I've been reading the RPROP papers and it looks like IRPROP- is the
algorithm to go for; it's simple and not significantly worse than
RPROP+. We could look at the RPROP implementation in Wapiti (and maybe
even copy bits of it, it's MIT-licensed).

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Andreas Müller

2011-11-04 14:03:53 UTC

Post by Andreas MÃ¼ller
My question is: has anyone started with a mlp implementation yet?

I was just working on one :)
I have the predict function for an arbitrary number of hidden layers
(classifier case) and some snippets of the RPROP algorithm. I've been
using weight vectors that come out of a Matlab implementation for now.
There used to be an MLP implementation in older versions (around 0.2,
I believe) but it was abandoned.

Are you using pure Python at the moment?
Where can I find your code? And is the goal of your code to
be included in the scikits?

Post by Andreas MÃ¼ller
- online, minibatch and batch learning

I only need batch learning and classification for now... shall we keep
it simple?

I think it is necessary to have minibatch learning and so I think
building that into the code from the beginning is good.

Post by Andreas MÃ¼ller
- vanilla gradient descent and rprop
- l2 weight decay optional
- tanh nonlinearities

In my personal experience, tanh works better. LeCun uses tanh ;)

Post by Lars Buitinck
I've been reading the RPROP papers and it looks like IRPROP- is the
algorithm to go for; it's simple and not significantly worse than
RPROP+. We could look at the RPROP implementation in Wapiti (and maybe
even copy bits of it, it's MIT-licensed).

Peter Prettenhofer

2011-11-04 14:12:07 UTC

I'd love to see mlp in the scikit!

best,
Peter

Post by Andreas MÃ¼ller
My question is: has anyone started with a mlp implementation yet?

I was just working on one :)
I have the predict function for an arbitrary number of hidden layers
(classifier case) and some snippets of the RPROP algorithm. I've been
using weight vectors that come out of a Matlab implementation for now.
There used to be an MLP implementation in older versions (around 0.2,
I believe) but it was abandoned.

Are you using pure Python at the moment?
Where can I find your code? And is the goal of your code to
be included in the scikits?

Post by Andreas MÃ¼ller
- online, minibatch and batch learning

I only need batch learning and classification for now... shall we keep
it simple?

I think it is necessary to have minibatch learning and so I think
building that into the code from the beginning is good.

Post by Andreas MÃ¼ller
- vanilla gradient descent and rprop
- l2 weight decay optional
- tanh nonlinearities

In my personal experience, tanh works better. LeCun uses tanh ;)

RPROP is very easy to implement. I use it in my lab all the time.
I have no personal experience with IRPROP-? How is that different
than IRPROP? What is RPROP+? Can you give me references?
Cheers,
Andy
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

Alexandre Passos

2011-11-04 14:30:55 UTC

On Fri, Nov 4, 2011 at 10:12, Peter Prettenhofer

Post by Peter Prettenhofer
I'd love to see mlp in the scikit!

I have a question: why not just use Theano for this? I doubt that we
can write neural network code that's as fast as their automatically
generated code.

--
- Alexandre

Lars Buitinck

2011-11-04 14:34:04 UTC

Post by Alexandre Passos
I have a question: why not just use Theano for this? I doubt that we
can write neural network code that's as fast as their automatically
generated code.

Would that mean an extra run-time dependency?

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Alexandre Passos

2011-11-04 14:42:37 UTC

Post by Alexandre Passos
I have a question: why not just use Theano for this? I doubt that we
can write neural network code that's as fast as their automatically
generated code.

Would that mean an extra run-time dependency?

Yes, as theano needs a compiler (gcc or nvcc if you want to use cuda)
available at run time, but even still it's faster even than most
hand-coded implementations of neural networks. James Bergstra reads
this list occasionally, and he's one of the main people behind theano,
so he can give more info here.

--
- Alexandre

Andreas Müller

2011-11-04 14:49:35 UTC

Post by Alexandre Passos
I have a question: why not just use Theano for this? I doubt that we
can write neural network code that's as fast as their automatically
generated code.

Would that mean an extra run-time dependency?

I think sklearn does not aim at beating cuda implementations.
For using theano: that's a huge and imho unnecessary dependency.
For a simple mlp, I think theano will not beat a hand implemented version.
Afaik, torch7 is faster than theano for cnns and mlps and there
is no compilation of algorithms there. If you want to vary your
implementation
a lot and want to do fancy things, theano is probably faster.
But I thought more about an easy to use classifier.

Andreas Müller

2011-11-04 14:54:02 UTC

Post by Alexandre Passos
I have a question: why not just use Theano for this? I doubt that we
can write neural network code that's as fast as their automatically
generated code.

Would that mean an extra run-time dependency?

As an afterthought: you could use the same argument for SGD,
kmeans and many other algorithms inside sklearn.
Do you want all of them to be replaced by theano implementations?

Vlad Niculae

2011-11-04 15:00:45 UTC

Post by Alexandre Passos
I have a question: why not just use Theano for this? I doubt that we
can write neural network code that's as fast as their automatically
generated code.

Would that mean an extra run-time dependency?

As an afterthought: you could use the same argument for SGD,
kmeans and many other algorithms inside sklearn.
Do you want all of them to be replaced by theano implementations?

This is a great point. Also, I'd like to cite Gael's recent remark
"Machine learning should be a commodity", so if we could have simpler
code with less dependencies that would do just well enough, then a
niche would be filled, in my opinion.

Vlad

Frédéric Bastien

2011-11-04 15:15:40 UTC

On Fri, Nov 4, 2011 at 10:49 AM, Andreas Müller

Post by Alexandre Passos
I have a question: why not just use Theano for this? I doubt that we
can write neural network code that's as fast as their automatically
generated code.

Would that mean an extra run-time dependency?

Yes Theano is a huge dependency. Yes there is no MLP implementation in
Theano, but there is the Deep Learning Tutorial[1] that have one. It
also have implementation for Logistic Regression, Deep Convolutional
Network, Stacked Denoising Auto-Encoders, Restricted Boltzmann
Machines and Deep Belief Networks.

Those implementation are in a tutorial format. They are not in a
librarie with an easy to use interface like what scikit.learn use. But
many people already modified them to there need. We want from a long
time to have an easier interface, but we don't know how well to do
automatically the hyper-parameter selection.

I don't have a good understanding of scikit.learn, but I think that
all the hyper-parameter selection is a hot research topic for now. How
do you plan to include this in the current scikit.learn interface of
the fit method?

About torch7 being faster then Theano. I have heard that a few times,
but never see the papers, numbers, code or whatever subtential for
this. I would love to have any number with something about that. Do
you have some? But don't forget that in the Theano framework, we can
just implement all the trick that other people used to beat Theano. So
if torch7 is faster in some case, this will tell us where we can make
Theano faster! Can you tell us more about the comparison you refer to?

Just a side note. I don't imply the comparison you refer to is biased,
but benchmarking is VERY HARD. So I like to have information on how
the comparison is done. We tried to make the Theano comparison as fair
as we could at that time. We spend days compiling each applications
with the same blas and other stuff like that. But since them torch
have new version released.

Thanks and I hope to have more info on the comparison people used to
tell that torch7 is faster then Theano and how you plan to work around
the hyper-paramete selection problem. That would be very valuable to
every body I think.

Frédéric Bastien

[1] http://deeplearning.net/tutorial/

p.s. I'm one of the core Theano developer.

Andreas Müller

2011-11-04 15:33:39 UTC

Hey Frederic.

Post by FrÃ©dÃ©ric Bastien
I don't have a good understanding of scikit.learn, but I think that
all the hyper-parameter selection is a hot research topic for now. How
do you plan to include this in the current scikit.learn interface of
the fit method?

Depends on what you think of when you say hyper parameters?
Things like learning rate, weight decay and size of the hidden
layer can be cross validated.

Of course there are many other possibilities like pretraining,
deeper networks, different learning rate schedules etc..
You are right, this is somewhat of an active research field.
Though I have not seen conclusive evidence that any
of these methods are consistently better than a vanilla mlp.

Does that answer your question?

Post by FrÃ©dÃ©ric Bastien
About torch7 being faster then Theano. I have heard that a few times,
but never see the papers, numbers, code or whatever subtential for
this. I would love to have any number with something about that. Do
you have some? But don't forget that in the Theano framework, we can
just implement all the trick that other people used to beat Theano. So
if torch7 is faster in some case, this will tell us where we can make
Theano faster! Can you tell us more about the comparison you refer to?

There will be a paper at this years NIPS large scale learning
workshop. They just give total times on running networks
so it is not clear why they are faster :( This is a bit
unhelpful in improving existing implementations, I think.
They claim to have pretty fast convolutions, I think.

Post by FrÃ©dÃ©ric Bastien
Just a side note. I don't imply the comparison you refer to is biased,
but benchmarking is VERY HARD. So I like to have information on how
the comparison is done. We tried to make the Theano comparison as fair
as we could at that time. We spend days compiling each applications
with the same blas and other stuff like that. But since them torch
have new version released.

I just trusted the torch people there, though I haven't seen
the benchmark code or anything. I know that you put a lot
of effort into this.
The point I was trying to make is that if one codes a simple
mlp, then I think there is a good chance in being as fast as
Theano since it is pretty clear what is going on and
computation is dominated by the matrix products.

Post by FrÃ©dÃ©ric Bastien
Thanks and I hope to have more info on the comparison people used to
tell that torch7 is faster then Theano and how you plan to work around
the hyper-paramete selection problem. That would be very valuable to
every body I think.

Side note from me: Don't get me wrong, I really like your project
and I think you are making a great effort as a community.
In particular the deep learning tutorials are great!

I just think the goal is different to the sklearn one and I don't
think it is a good idea to make sklearn dependent on theano.

btw: I'm not a core sklearn developer so consider that my opinion
and not sklearn opinion ;)

Cheers,
Andy

David Warde-Farley

2011-11-04 17:31:42 UTC

Post by Andreas MÃ¼ller
Hey Frederic.

Depends on what you think of when you say hyper parameters?
Things like learning rate, weight decay and size of the hidden
layer can be cross validated.

Cross-validating one or two hyperparameters is fine, but once you get into
the regime of 5-10 hyperparameters (initial learning rate, momentum,
annealing schedule, batch size, activation function, initialization
distributions...), grid search becomes quite costly, and yet tuning these
things can be essential if you want to even equal the performance of an SVM
(you can, of course, do things like randomly sample your hyperparameters, but
this requires a bit of domain expertise in determining what constitutes
a"reasonable" distribution to draw each one from).

Really one of the best ways of avoiding overfitting is to do early stopping,
but in order to do this properly in the context of cross-validation, you need
two held-out sets, one validation set for monitoring when to stop and one to
estimate your test error for this CV fold. The rabbit hole just gets deeper
from there, I'm afraid.

Post by Andreas MÃ¼ller
Of course there are many other possibilities like pretraining,
deeper networks, different learning rate schedules etc..
You are right, this is somewhat of an active research field
Though I have not seen conclusive evidence that any
of these methods are consistently better than a vanilla mlp.

http://www.dumitru.ca/files/publications/icml_07.pdf the table on page 7
makes a pretty compelling case, I'd say.

Now, there's also the results out of Juergen Schmidhuber's lab that show that
if you train for months on a GPU, add all kinds of prior knowledge into the
preprocessing pipeline, make careful choices about the learning rate
schedule, initialization, and activation function (some of this is pretty
easy and well-documented in that paper by Yann LeCun that Olivier sent around
earlier in the thread, other parts will take a lot of fiddling), then you
*can* make vanilla MLPs perform really well on MNIST, but this says more
about the devotion of the practitioners to this (rather artificial) task, and
the sorts of built-in prior knowledge they used, than it does about the
strength of the learning algorithm.

David

Andreas Müller

2011-11-04 18:22:15 UTC

http://www.dumitru.ca/files/publications/icml_07.pdf the table on page 7
makes a pretty compelling case, I'd say.

These numbers are weired.
A basic grid search with rbf svm gives 1.4% error on mnist.
Using a vanilla MLP with 500 hidden units and RPROP (no momentum or
weight decay)
and early stopping or cross-validating a constant
learning rate in the same setup gives 2%, I think.

Post by David Warde-Farley
Now, there's also the results out of Juergen Schmidhuber's lab that show that
if you train for months on a GPU, add all kinds of prior knowledge into the
preprocessing pipeline, make careful choices about the learning rate
schedule, initialization, and activation function (some of this is pretty
easy and well-documented in that paper by Yann LeCun that Olivier sent around
earlier in the thread, other parts will take a lot of fiddling), then you
*can* make vanilla MLPs perform really well on MNIST, but this says more
about the devotion of the practitioners to this (rather artificial) task, and
the sorts of built-in prior knowledge they used, than it does about the
strength of the learning algorithm.

Don't get me wrong. I'm not a fan of the MNIST focused research.
One of the reasons I want an MLP in sklearn is so it is easier
to compare with other learning algorithms on a wide range of
tasks.
I am pretty sceptical about neural networks myself but as
they scale very well, they definitely seem an alternative
to linear classification.

Cheers,
Andy

ps: I would have never imagined that at some point in my life
I'll argue _for_ mlps... I think my advisor got to me.

Gael Varoquaux

2011-11-04 19:11:22 UTC

Post by Andreas MÃ¼ller
One of the reasons I want an MLP in sklearn is so it is easier
to compare with other learning algorithms on a wide range of
tasks.

I guess that this is one of the most compeling reasons to have them in. I
tend to believe the MLPs are not 'machine learning without learning the
machinery': they require a lot of domain knowledge and tweaking. It seems
to me that this is not the kind of method that we want to advertise in
the scikit: non experts might loose a lot of time on them.

However, as they are definitely part of the state of the art, if we can
get an implementation that is readable, debuggeable, and that performs
reasonnably in terms of computational efficiency and prediction power, I
think that integrating them is an option _as a reference for comparison_.
We will need to point out clearly in the documentation the better
implementations that we will not attempt to beat because they are too
technical, either on the computation side (GPUs) or on the ML side (heaps
of domain knowledge embedded).

My 2 cents,

Gaël

David Warde-Farley

2011-11-04 20:26:37 UTC

http://www.dumitru.ca/files/publications/icml_07.pdf the table on page 7
makes a pretty compelling case, I'd say.

These numbers are weired.
A basic grid search with rbf svm gives 1.4% error on mnist.

This was on only 10,000 examples from MNIST (1000 digits per class).
Back in 2007, SVM solvers weren't very fast, so they scaled back the problem
a bit.

David

Andreas Mueller

2011-11-04 20:37:40 UTC

http://www.dumitru.ca/files/publications/icml_07.pdf the table on page 7
makes a pretty compelling case, I'd say.

These numbers are weired.
A basic grid search with rbf svm gives 1.4% error on mnist.

This was on only 10,000 examples from MNIST (1000 digits per class).
Back in 2007, SVM solvers weren't very fast, so they scaled back the problem
a bit.

Oh, sorry. Just skimmed the paper as you might have guessed ;)

David Warde-Farley

2011-11-04 15:33:11 UTC

Post by Andreas MÃ¼ller
I think sklearn does not aim at beating cuda implementations.

As Alex said, it is pretty darned competitive on the CPU as well.

Post by Andreas MÃ¼ller
For using theano: that's a huge and imho unnecessary dependency.

Agreed.

Post by Andreas MÃ¼ller
For a simple mlp, I think theano will not beat a hand implemented version.

I think you'd be in for a rather rude surprise, at least on your first
attempt. :)

Post by Andreas MÃ¼ller
Afaik, torch7 is faster than theano for cnns and mlps and there
is no compilation of algorithms there.

Haven't looked at Torch7, though I know we beat Torch5 pretty painfully.

Post by Andreas MÃ¼ller
But I thought more about an easy to use classifier.

That, I think, is the fundamental flaw in the plan. Neural networks are
anything but "easy to use", and getting good results out of them takes quite
a bit of work.

I say this (perhaps at my own peril) as a student in one of the larger labs
that still study this stuff, but there are really three regimes where neural
networks make sense over the stuff already in scikit-learn:

- The dataset is *gigantic*, online learning is essential, and simpler
algorithms don't cut it.

- The dataset is huge and the task complex enough that it requires multiple
layers of representation and/or sophisticated pre-training algorithms
(unsupervised feature learning).

- The dataset is slightly smaller, linear learning doesn't suffice,
but model compactness and speed/efficiency of evaluation is of great
importance, so kernel methods won't work.

In my experience, about 95% of the time, people trying to apply MLPs and
failing are not in any of these situations and would be better served with
methods that are easily "canned" for non-expert use.

I'm not outright against scikits-learn including a basic implementation, I
even have old Theano-free implementations of modular stackable layers that
could be fixed up and Cythonized if anyone wants it. But the documentation
should probably qualify MLPs as useful only if you're willing to do a lot of
work with preprocessing, hyperparameter optimization, and other methods
are without exception failing you. Otherwise I foresee a lot of frustrated
users and a lot of frustrated mailing list participants trying to answer the
same questions over and over again.

My $0.02,

David

Andreas Müller

2011-11-04 15:41:55 UTC

Post by Andreas MÃ¼ller
For a simple mlp, I think theano will not beat a hand implemented version.

I think you'd be in for a rather rude surprise, at least on your first
attempt. :)

It'll not be my first attempt but I must confess, I never benchmarked
my labs GPU mlp against yours ;)

Post by Andreas MÃ¼ller
Afaik, torch7 is faster than theano for cnns and mlps and there
is no compilation of algorithms there.

Haven't looked at Torch7, though I know we beat Torch5 pretty painfully.

Post by Andreas MÃ¼ller
But I thought more about an easy to use classifier.

I am only part of a very small lab that still study this stuff, so I guess
you have more experience in these things.
I was mainly thinking about the first use case.
For example, in this paper:
http://www.cs.cornell.edu/~ainur/pubs/empirical.pdf
neural networks fare pretty well, it seems without to much tuning.

In my experience, the hardest thing to find is a good learning rate.
Using RPROP, I always got pretty decent results on the first try.

What kind of datasets have you used? And what kind of tuning
did you have to do?

Cheers,
Andy

Olivier Grisel

2011-11-04 17:04:56 UTC

Post by Andreas MÃ¼ller
For a simple mlp, I think theano will not beat a hand implemented version.

I think you'd be in for a rather rude surprise, at least on your first
attempt. :)

It'll not be my first attempt but I must confess, I never benchmarked
my labs GPU mlp against yours ;)

Post by Andreas MÃ¼ller
Afaik, torch7 is faster than theano for cnns and mlps and there
is no compilation of algorithms there.

Haven't looked at Torch7, though I know we beat Torch5 pretty painfully.

Post by Andreas MÃ¼ller
But I thought more about an easy to use classifier.

I am only part of a very small lab that still study this stuff, so I guess
you have more experience in these things.
I was mainly thinking about the first use case.
http://www.cs.cornell.edu/~ainur/pubs/empirical.pdf
neural networks fare pretty well, it seems without to much tuning.
In my experience, the hardest thing to find is a good learning rate.
Using RPROP, I always got pretty decent results on the first try.
What kind of datasets have you used? And what kind of tuning
did you have to do?

In my case I don't use RPROP (I don't know what it is and I just use a
simple backprop) and I use Leon Bottou's trick to perform a burn-in on
the first 10k samples with a grid search of learning rate parameters
and then select the most effective learning rate and multiply it by 2
(it brings robustness). In my experiment it did work pretty well.

I used to use a 1/t style learning rate schedule but yesterday Francis
Bach convinced me to use 1/sqrt(t) and use averaging instead.

Here is the calibration stuff:
https://bitbucket.org/ogrisel/libsgd/src/0a232b053b5b/lib/architecture.c#cl-360

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Andreas Müller

2011-11-04 17:12:48 UTC

Post by Olivier Grisel
In my case I don't use RPROP (I don't know what it is and I just use a
simple backprop) and I use Leon Bottou's trick to perform a burn-in on
the first 10k samples with a grid search of learning rate parameters
and then select the most effective learning rate and multiply it by 2
(it brings robustness). In my experiment it did work pretty well.

I only learned about this trick recently and haven't really used it yet.
We tried it on convolutional nets and it didn't work well :-/
Maybe I'll give it another shot.

RPROP maintains a dynamic learn rate for each parameter.
It only looks at the sign of the gradient. There are two parameters
but these are always set to the values described in the paper
A direct adaptive method for faster backpropagation learning: The RPROP
algorithm.
So actually there are no parameters at all, which is pretty convenient.

Post by Olivier Grisel
I used to use a 1/t style learning rate schedule but yesterday Francis
Bach convinced me to use 1/sqrt(t) and use averaging instead.

I think Leon Bottou also uses something different for averaging, but I
thought
it was t^0.75 or something. Maybe I'll do it without averaging first.

Post by Olivier Grisel
https://bitbucket.org/ogrisel/libsgd/src/0a232b053b5b/lib/architecture.c#cl-360

Olivier Grisel

2011-11-04 17:28:59 UTC

I only learned about this trick recently and haven't really used it yet.
We tried it on convolutional nets and it didn't work well :-/
Maybe I'll give it another shot.
RPROP maintains a dynamic learn rate for each parameter.
It only looks at the sign of the gradient. There are two parameters
but these are always set to the values described in the paper
A direct adaptive method for faster backpropagation learning: The RPROP
algorithm.
So actually there are no parameters at all, which is pretty convenient.

Sounds good.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Frédéric Bastien

2011-11-04 18:03:59 UTC

Hi,

Just to be sure, I agree that there is space for both an MLP in
scikit.learn and Theano/Pylearn or similar. I was curious on how you
planned to "solve" the hyper-parameter selection problem. That is not
an easy problem. The answer was interesting. I didn't know about the
Leon Bottou trick.

I was also interested in the benchmark that tell Theano is slower. It
was not the first time that I heard it but I didn't got detail. Thanks
for providing some. During our lunch, someone told us also about that
NIPS workshop paper. It is sure I will read it. I where also told that
they accepted that we wrap their GPU code in Theano. So speedup should
come to Theano and we will be able to keep the other advantage of
Theano.

thanks for the interesting discussion.

Fred

David Warde-Farley

2011-11-04 17:33:39 UTC

Sounds a bit like "delta-bar-delta".

Post by Andreas MÃ¼ller
It only looks at the sign of the gradient.

Surely it can't work online then, can it?

David

Andreas Müller

2011-11-04 17:44:28 UTC

Sounds a bit like "delta-bar-delta".

Don't know about that. RPROP is pretty old, 1993 I think:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=298623

Post by Andreas MÃ¼ller
It only looks at the sign of the gradient.

Surely it can't work online then, can it?

It can work with mini-batches that are "large enough",
I think. But not really online, no.
Also you need twice as much memory as you have
to keep gradient information and current learning rates
in memory.
These are definitely two downsides.
We still used it successfully on "big" datasets like NORB
jittered-cluttered and cifar.

If you can afford batch learning, I think it is worth a try
since there are no parameters to tune and it
often works well.

Olivier Grisel

2011-11-04 17:49:07 UTC

Sounds a bit like "delta-bar-delta".

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=298623

Post by Andreas MÃ¼ller
It only looks at the sign of the gradient.

Surely it can't work online then, can it?

It can work with mini-batches that are "large enough",
I think. But not really online, no.

Even if the mini-batch is not large enough, you can remember past
information from a large window in constant memory size using
exponentially weighted moving averages:

https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average

Post by Andreas MÃ¼ller
Also you need twice as much memory as you have
to keep gradient information and current learning rates
in memory.
These are definitely two downsides.

As long as it's pre-allocated before the main loop that should not be a problem.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

David Warde-Farley

2011-11-04 18:00:44 UTC

Sounds a bit like "delta-bar-delta".

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=298623

Hehe, DBD is even older: http://www.bcs.rochester.edu/people/robbie/jacobs.nn88.pdf

Post by Andreas MÃ¼ller
It only looks at the sign of the gradient.

Surely it can't work online then, can it?

I'd think "large enough" minibatches are an even bigger problem, though yes, DBD and its variants suffer from the same problem. You can probably reparameterize somewhat smartly and use less precision to store the "variable" parts of the learning rates, though.

Post by Andreas MÃ¼ller
These are definitely two downsides.
We still used it successfully on "big" datasets like NORB
jittered-cluttered and cifar.

Interesting.

Post by Andreas MÃ¼ller
If you can afford batch learning, I think it is worth a try
since there are no parameters to tune and it
often works well.

Indeed, this is the "holy grail" after all. :)

David

Olivier Grisel

2011-11-04 15:44:03 UTC

I think it make sense to have a pure cython implementation in
scikit-learn without having runtime dependency on a compiler nor CUDA
/ OpenCL and have advanced, theano based neural networks (with more
parameter auto-tuning and pluggable exotic objective functions) in
pylearn.

I think there is room for both.

--
Olivier

Mathieu Blondel

2011-11-04 16:25:22 UTC

On Sat, Nov 5, 2011 at 12:44 AM, Olivier Grisel

Post by Olivier Grisel
I think it make sense to have a pure cython implementation in
scikit-learn without having runtime dependency on a compiler nor CUDA
/ OpenCL and have advanced, theano based neural networks (with more
parameter auto-tuning and pluggable exotic objective functions) in
pylearn.

I'm +1 with having a Cython-based implementation in scikit-learn even
if it's a little bit behind a Theanos-based implementation.

Another possibility is to host a Theanos-based implementation as a
side project on github and make the API scikit-learn compatible.

# In general, I don't really buy the "why implement X if it already
exists in Y" argument because it can be said of pretty much every
module in scikit-learn. Since we came up with a quite rigorous review
process, even if we reimplement something that already exists
elsewhere, in the end we usually obtain a very high-quality module (in
code and documentation). Think of the tree module :)

Mathieu

Kenneth C. Arnold

2011-11-04 20:24:23 UTC

Post by Mathieu Blondel
Another possibility is to host a Theanos-based implementation as a
side project on github and make the API scikit-learn compatible.
# In general, I don't really buy the "why implement X if it already
exists in Y" argument because it can be said of pretty much every
module in scikit-learn. Since we came up with a quite rigorous review
process, even if we reimplement something that already exists
elsewhere, in the end we usually obtain a very high-quality module (in
code and documentation). Think of the tree module :)

+1 for the sklearn review process AND for cooperating with other
projects and sharing a good API. It would be great to be able to
prototype things in sklearn and then drop in something like Theano or
a map-reduce implementation or some experimental new algorithm to
improve speed or accuracy. The more the ecosystem converges on quality
code with similar APIs, the closer we are to that.

There's currently a wiki page for other libraries with compatible
APIs. It only had one project on it last I checked. Perhaps that could
be extended to "other projects that can help you get your machine
learning job done in Python, whom we're talking to about API
alignment" :)

-Ken

Frédéric Bastien

2011-11-04 20:45:18 UTC

On Fri, Nov 4, 2011 at 4:24 PM, Kenneth C. Arnold

Post by Kenneth C. Arnold
+1 for the sklearn review process AND for cooperating with other
projects and sharing a good API. It would be great to be able to
prototype things in sklearn and then drop in something like Theano or
a map-reduce implementation or some experimental new algorithm to
improve speed or accuracy. The more the ecosystem converges on quality
code with similar APIs, the closer we are to that.
There's currently a wiki page for other libraries with compatible
APIs. It only had one project on it last I checked. Perhaps that could
be extended to "other projects that can help you get your machine
learning job done in Python, whom we're talking to about API
alignment" :)

Kenneth C. Arnold

2011-11-04 21:10:36 UTC

I had nothing to do with that page, but it's
https://github.com/scikit-learn/scikit-learn/wiki/Related-Projects.

-Ken

I didn't found that page. Can you find it again? When we have
something that suppose scikit.learn api, I will send you info about
this to add to this page. But I would be interrested to see it first.
thanks
Fred
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Frédéric Bastien

2011-11-04 21:11:51 UTC

thanks

Fred

On Fri, Nov 4, 2011 at 5:10 PM, Kenneth C. Arnold

Post by Kenneth C. Arnold
I had nothing to do with that page, but it's
https://github.com/scikit-learn/scikit-learn/wiki/Related-Projects.
-Ken

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Lars Buitinck

2011-11-04 14:52:13 UTC

Post by Andreas MÃ¼ller
Are you using pure Python at the moment?
Where can I find your code? And is the goal of your code to
be included in the scikits?

My goal is to improve on somebody else's result and get a paper
published ;), but if the sklearn community can peer review and adopt
the code I use to obtain that result, I'd be more than happy.

This is more or less what I used:

https://github.com/larsmans/scikit-learn/tree/mlperceptron
Again, with weight vectors loaded from a Matlab file by hand, so no fit yet.

Post by Andreas MÃ¼ller
I think it is necessary to have minibatch learning and so I think
building that into the code from the beginning is good.

Alright.