Discussion:
[Scikit-learn-general] partial_fit in SGD module
Mathieu Blondel
2011-03-31 11:39:32 UTC
Permalink
As you may remember from a thread on the mailing-list (back a few
months ago), there was an agreement that online algorithms should
implement a partial_fit(X, y) method. The reason for adding a new
method was mainly a matter of semantics: partial_fit makes it clear
that the previous model is not erased when partial_fit is called
again.

I started to look into adding partial_fit to the SGD module. My
original idea was to rename the fit method in BaseSGD to _fit, add a
partial=True|False option and initialize the model parameters only
when partial=False or the parameters are not present yet. This way,
fit and partial_fit could easily be implemented in terms of _fit.
However, it is more difficult than I thought and I found potential
issues.

The first one is that the vector y may contain only a subset of the
classes (or in the extreme case, only one class). This is a problem
since SGD pre-allocate the coef_ matrix (n_classes x n_features). The
obvious solution is to use a dictionary to store the weight vectors of
each class instead of a numpy 2d-array. For compatibility with other
classifiers, we can implement coef_ as a property.

The second potential problem is about the learning schedules. The
routines written in Cython need a n_iter argument. If the user makes
several passes over the dataset (see below) and call partial_fit
repeatedly, we would need to save the state of the learning rate?

Peter, what areas of the code do you think need to be changed and do
you have ideas how to factor as much code as possible?

Another thing I was wondering: is it possible to extract reusable
utils from the SGD module such as dense-sparse dot product,
dense-sparse addition etc? (I suppose we would need a pyd header
file?) I was wondering about that because of custom loss functions
too.

Also to put partial_fit into more context: although partial_fit can
potentially be used in a pure online setting, the plan was mainly to
use it for large scale datasets, i.e. make several iterations over the
datasets but load the data by blocks. The plan was to create an
iterator object which can be reset:

reader = SvmlightReader("file.txt", block_size=10000)
for n in range(n_iter):
for X, y in reader:
clf.partial_fit(X, y)
reader.reset()

It could also be useful to have a method to generate a mini-batch
block randomly:
X, y = reader.random_minibatch(blocksize=1000)

A text-based file format like Svmlight's doesn't offer a direct way to
quickly retrieve a random line. We would need to build a "line => byte
offset" index (can be produced in memory when needed).

# All in all, this made me think that if we want to start playing with
an online API, it would probably be easier to start with a good old
averaged perceptron at first than trying to modify the current SGD
module.

Mathieu
Peter Prettenhofer
2011-03-31 12:36:31 UTC
Permalink
Hi Mathieu,
Post by Mathieu Blondel
[..]
The first one is that the vector y may contain only a subset of the
classes (or in the extreme case, only one class). This is a problem
since SGD pre-allocate the coef_ matrix (n_classes x n_features). The
obvious solution is to use a dictionary to store the weight vectors of
each class instead of a numpy 2d-array. For compatibility with other
classifiers, we can implement coef_ as a property.
I haven't thought about this... a quick and dirty way to solve it is
to specify the number of classes as a constructor argument (similar to
the TheanoSGD classifier in jaberg's image-patch branch).

Anyways complete online multi-class classification requires a serious
refactoring of the current SGD code base!
Post by Mathieu Blondel
The second potential problem is about the learning schedules. The
routines written in Cython need a n_iter argument. If the user makes
several passes over the dataset (see below) and call partial_fit
repeatedly, we would need to save the state of the learning rate?
That's true - the learning rate has to be stored.
Post by Mathieu Blondel
Peter, what areas of the code do you think need to be changed and do
you have ideas how to factor as much code as possible?
When you look at the current cython code you will notice that it
pretty much relies on numpy ndarray or scipy's sparse matrix. However,
if we change the loop over the training examples from row indices [1]
to something which returns a pair of x and y [2] where x may be a
ndarray for the dense case or a sparse matrix with a single row or a
recarray (as in bolt) for the sparse case. This will require just
minor refactorings and will make the current code a little bit slower
(factor of 2).

[1] https://github.com/pprett/scikit-learn/blob/master/scikits/learn/linear_model/sgd_fast.pyx#L310

[2] https://github.com/pprett/bolt/blob/master/bolt/trainer/sgd.pyx#L428
Post by Mathieu Blondel
Another thing I was wondering: is it possible to extract reusable
utils from the SGD module such as dense-sparse dot product,
dense-sparse addition etc? (I suppose we would need a pyd header
file?) I was wondering about that because of custom loss functions
too.
Mathieu Blondel
2011-03-31 14:28:20 UTC
Permalink
Peter,

On Thu, Mar 31, 2011 at 9:36 PM, Peter Prettenhofer
Post by Peter Prettenhofer
I haven't thought about this... a quick and dirty way to solve it is
to specify the number of classes as a constructor argument (similar to
the TheanoSGD classifier in jaberg's image-patch branch).
I thought about this too but it implies that we have two distinct
classes: one for batch and one for online. It would make things
simpler if we could have fit and partial_fit in the same class (or
fit_iterable if we decide to go this way).
Alexandre Passos
2011-03-31 14:48:38 UTC
Permalink
Post by Mathieu Blondel
Peter,
On Thu, Mar 31, 2011 at 9:36 PM, Peter Prettenhofer
Post by Peter Prettenhofer
I haven't thought about this... a quick and dirty way to solve it is
to specify the number of classes as a constructor argument (similar to
the TheanoSGD classifier in jaberg's image-patch branch).
I thought about this too but it implies that we have two distinct
classes: one for batch and one for online. It would make things
simpler if we could have fit and partial_fit in the same class (or
fit_iterable if we decide to go this way).
Peter Prettenhofer
2011-03-31 15:20:34 UTC
Permalink
Wow - that sounds interesting indeed - now I know why vowpal wabit has
such an odd formular for the weight update - multiplying the gradient
by the weight was too simple to be true...

But before you integrate it into the cython code we should merge my
learningrate branch which introduces constant and inverse scaling
learning rates.

best,
Peter
[..]
I was thinking of user-defined losses for example to address cost
sensitive learning (e.g., incur a stronger loss for some classes than
others).
I think it's simpler to adapt all staandard losses to be
cost-sensitive (with an optional parameter which is a vector  with the
cost of getting each example wrong) than it is to handle general
losses. There is even a way to adapt common online/non-online SGD to
http://arxiv.org/pdf/1011.1576 .
I can implement a variant of this for the batch sgd tomorrow in the sprint
if you're interested.
--
 - Alexandre
------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself;
WebMatrix provides all the features you need to develop and
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Alexandre Passos
2011-03-31 16:10:27 UTC
Permalink
On Thu, Mar 31, 2011 at 12:20, Peter Prettenhofer
Post by Peter Prettenhofer
Wow - that sounds interesting indeed - now I know why vowpal wabit has
such an odd formular for the weight update - multiplying the gradient
by the weight was too simple to be true...
But before you integrate it into the cython code we should merge my
learningrate branch which introduces constant and inverse scaling
learning rates.
Or I can fork from your branch, whichever way is easier.
--
 - Alexandre
Peter Prettenhofer
2011-03-31 16:51:47 UTC
Permalink
Post by Alexandre Passos
[..]
Or I can fork from your branch, whichever way is easier.
This would be even better - then you could review the changes (API,
default settings, etc.) :-)

best,
Peter
--
Peter Prettenhofer
Mathieu Blondel
2011-04-02 10:02:12 UTC
Permalink
So what do people think about giving up on partial_fit and use
fit_iterable instead?

reader = SvmlightReader("file.txt", block_size=10000)
clf.fit_iterable(reader, n_iter=10)

To solve the problem that the label set cardinality is not known in
advance, we can require any reader object to implement a n_classes
property.

reader.n_classes

BTW, ProbabilisticPCA can be fit with EM, which means that it could
implement a fit_iterable method too.

Mathieu
Gael Varoquaux
2011-04-02 10:10:45 UTC
Permalink
Post by Mathieu Blondel
So what do people think about giving up on partial_fit and use
fit_iterable instead?
reader = SvmlightReader("file.txt", block_size=10000)
clf.fit_iterable(reader, n_iter=10)
I would rather prefer partial_fit, because it means that in this case it
is clearly the role of the framework (not in the scikit) to distribute
the computation to the algorithm. It is an inversion of control issue:
where do we want the control to be? Why would you prefer fit_iterable?

Iterables can also be nasty beasts, as Vincent discovered yesterday, as
consuming them is a side effect that modifies that. Things can therefore
happen behind your back. I guess that's why I prefer the code outside the
algorithm to be in charge of distributing the data.
Post by Mathieu Blondel
To solve the problem that the label set cardinality is not known in
advance, we can require any reader object to implement a n_classes
property.
reader.n_classes
That I would clearly frown upon, because it means that the contract with
the input object is no longer that it is a plain Python iterable, but
that it is a custom object that has a new interface that people must
learn and code to.
Post by Mathieu Blondel
BTW, ProbabilisticPCA can be fit with EM, which means that it could
implement a fit_iterable method too.
Yes, clearly, which would be great, as it would give us a data reduction
code that works with data not fitting in memory.

G
Mathieu Blondel
2011-04-02 10:29:09 UTC
Permalink
On Sat, Apr 2, 2011 at 7:10 PM, Gael Varoquaux
Post by Gael Varoquaux
I would rather prefer partial_fit, because it means that in this case it
is clearly the role of the framework (not in the scikit) to distribute
where do we want the control to be? Why would you prefer fit_iterable?
Semantically, partial_fit is a pure online setting while fit_iterable
is a large-scale learning setting. With partial_fit, you loose the
notion of iteration over the entire dataset, which may be a problem
for algorithms which update the learning rate after each iteration.

I would like to hear Peter's opinion as he has already thought about
those problems for bolt.
Post by Gael Varoquaux
  reader.n_classes
That I would clearly frown upon, because it means that the contract with
the input object is no longer that it is a plain Python iterable, but
that it is a custom object that has a new interface that people must
learn and code to.
It will indeed be a custom iterable: as mentioned earlier in the
thread, the plan is to have a reset method too.

But indeed reader.n_classes will be a problem for UNIX pipe or
network-based iterables...

Mathieu
Mathieu Blondel
2011-04-02 10:36:21 UTC
Permalink
Post by Mathieu Blondel
But indeed reader.n_classes will be a problem for UNIX pipe or
network-based iterables...
As Peter was saying, it seems hard to support both the pure-online
setting and the large-scale learning setting.

Mathieu
Alexandre Passos
2011-04-02 10:44:08 UTC
Permalink
Post by Mathieu Blondel
On Sat, Apr 2, 2011 at 7:10 PM, Gael Varoquaux
Post by Gael Varoquaux
I would rather prefer partial_fit, because it means that in this case it
is clearly the role of the framework (not in the scikit) to distribute
where do we want the control to be? Why would you prefer fit_iterable?
Semantically, partial_fit is a pure online setting while fit_iterable
is a large-scale learning setting. With partial_fit, you loose the
notion of iteration over the entire dataset, which may be a problem
for algorithms which update the learning rate after each iteration.
I'm partial towards partial_fit. I think it can be used in more
settings, specially as a building block. If you do want something like
fit_iterable, it can be implemented with partial_fit and something
like finish_iteration() to update the learning rates or something
similar.
Post by Mathieu Blondel
It will indeed be a custom iterable: as mentioned earlier in the
thread, the plan is to have a reset method too.
But indeed reader.n_classes will be a problem for UNIX pipe or
network-based iterables...
I think this is probably going to end up as something highly confusing
to the average scikit user, that you can't just say
classifier.fit_iterable(parse_line_as_feature_vector(l) for l in
file(...)).
--
 - Alexandre
Mathieu Blondel
2011-04-02 11:11:31 UTC
Permalink
Post by Alexandre Passos
I'm partial towards partial_fit. I think it can be used in more
settings, specially as a building block. If you do want something like
fit_iterable, it can be implemented with partial_fit and something
like finish_iteration() to update the learning rates or something
similar.
So to illustrate what it could look like:

class OnlineMixin:

def fit_reader(reader, n_iter):
for n in range(n_iter):
for X, y in reader:
self.partial_fit(X, y)
self.finish_iteration()
reader.reset()

reader = SvmlightReader("file.txt", block_size=10000)
clf.fit_reader(reader)
Post by Alexandre Passos
I think this is probably going to end up as something highly confusing
to the average scikit user, that you can't just say
classifier.fit_iterable(parse_line_as_feature_vector(l) for l in
file(...)).
Indeed. We can call it fit_reader to prevent any confusion. The
motivation for having a reader object rather than a plain iterable is
that by definition iterables are for the pure online setting : they
make only one pass over the data.

To solve the problem that the label set cardinality is not known in
advance, we would need to use suitable data structures such as a
dictionary of weight vectors instead of a 2d-matrix. (That seems like
a lot of refactoring for the SGD module)

Mathieu