Mathieu Blondel

2011-03-31 11:39:32 UTC

As you may remember from a thread on the mailing-list (back a few

months ago), there was an agreement that online algorithms should

implement a partial_fit(X, y) method. The reason for adding a new

method was mainly a matter of semantics: partial_fit makes it clear

that the previous model is not erased when partial_fit is called

again.

I started to look into adding partial_fit to the SGD module. My

original idea was to rename the fit method in BaseSGD to _fit, add a

partial=True|False option and initialize the model parameters only

when partial=False or the parameters are not present yet. This way,

fit and partial_fit could easily be implemented in terms of _fit.

However, it is more difficult than I thought and I found potential

issues.

The first one is that the vector y may contain only a subset of the

classes (or in the extreme case, only one class). This is a problem

since SGD pre-allocate the coef_ matrix (n_classes x n_features). The

obvious solution is to use a dictionary to store the weight vectors of

each class instead of a numpy 2d-array. For compatibility with other

classifiers, we can implement coef_ as a property.

The second potential problem is about the learning schedules. The

routines written in Cython need a n_iter argument. If the user makes

several passes over the dataset (see below) and call partial_fit

repeatedly, we would need to save the state of the learning rate?

Peter, what areas of the code do you think need to be changed and do

you have ideas how to factor as much code as possible?

Another thing I was wondering: is it possible to extract reusable

utils from the SGD module such as dense-sparse dot product,

dense-sparse addition etc? (I suppose we would need a pyd header

file?) I was wondering about that because of custom loss functions

too.

Also to put partial_fit into more context: although partial_fit can

potentially be used in a pure online setting, the plan was mainly to

use it for large scale datasets, i.e. make several iterations over the

datasets but load the data by blocks. The plan was to create an

iterator object which can be reset:

reader = SvmlightReader("file.txt", block_size=10000)

for n in range(n_iter):

for X, y in reader:

clf.partial_fit(X, y)

reader.reset()

It could also be useful to have a method to generate a mini-batch

block randomly:

X, y = reader.random_minibatch(blocksize=1000)

A text-based file format like Svmlight's doesn't offer a direct way to

quickly retrieve a random line. We would need to build a "line => byte

offset" index (can be produced in memory when needed).

# All in all, this made me think that if we want to start playing with

an online API, it would probably be easier to start with a good old

averaged perceptron at first than trying to modify the current SGD

module.

Mathieu

months ago), there was an agreement that online algorithms should

implement a partial_fit(X, y) method. The reason for adding a new

method was mainly a matter of semantics: partial_fit makes it clear

that the previous model is not erased when partial_fit is called

again.

I started to look into adding partial_fit to the SGD module. My

original idea was to rename the fit method in BaseSGD to _fit, add a

partial=True|False option and initialize the model parameters only

when partial=False or the parameters are not present yet. This way,

fit and partial_fit could easily be implemented in terms of _fit.

However, it is more difficult than I thought and I found potential

issues.

The first one is that the vector y may contain only a subset of the

classes (or in the extreme case, only one class). This is a problem

since SGD pre-allocate the coef_ matrix (n_classes x n_features). The

obvious solution is to use a dictionary to store the weight vectors of

each class instead of a numpy 2d-array. For compatibility with other

classifiers, we can implement coef_ as a property.

The second potential problem is about the learning schedules. The

routines written in Cython need a n_iter argument. If the user makes

several passes over the dataset (see below) and call partial_fit

repeatedly, we would need to save the state of the learning rate?

Peter, what areas of the code do you think need to be changed and do

you have ideas how to factor as much code as possible?

Another thing I was wondering: is it possible to extract reusable

utils from the SGD module such as dense-sparse dot product,

dense-sparse addition etc? (I suppose we would need a pyd header

file?) I was wondering about that because of custom loss functions

too.

Also to put partial_fit into more context: although partial_fit can

potentially be used in a pure online setting, the plan was mainly to

use it for large scale datasets, i.e. make several iterations over the

datasets but load the data by blocks. The plan was to create an

iterator object which can be reset:

reader = SvmlightReader("file.txt", block_size=10000)

for n in range(n_iter):

for X, y in reader:

clf.partial_fit(X, y)

reader.reset()

It could also be useful to have a method to generate a mini-batch

block randomly:

X, y = reader.random_minibatch(blocksize=1000)

A text-based file format like Svmlight's doesn't offer a direct way to

quickly retrieve a random line. We would need to build a "line => byte

offset" index (can be produced in memory when needed).

# All in all, this made me think that if we want to start playing with

an online API, it would probably be easier to start with a good old

averaged perceptron at first than trying to modify the current SGD

module.

Mathieu