[Scikit-learn-general] LogisticRegression versus SGDClassifier(loss="log")?

Discussion:

Fred Mailhot

2012-06-15 20:25:09 UTC

Dear all,

What are the advantages of choosing one of the Subject line classifiers
over the other? At a quick glance, I see the following:

- LogisticRegression implements predict_proba for the multiclass case,
while SGDClassifier doesn't
- SGDClassifier(loss="log") lets you specify multiple CPUs for the OVA
training, while LogisticRegression doesn't

Are there other obvious differences that might influence this decision?

Regards,
Fred.

Peter Prettenhofer

2012-06-15 20:53:14 UTC

Permalink

Hi Fred,

the major difference is the optimization algorithm:
Liblinear/Coordinate Descent vs. Stochastic Gradient Descent.

If your problem is high dimensional (10K or more) and you have a large
number of examples (100K or more) you should choose the latter -
otherwise, LogisticRegression should be fine.

Both are not proper multinomial logistic regression models;
LogisticRegression does not care and simply computes the probability
estimates of each OVR classifier and normalized to make sure they sum
to one. You could do the same for SGDClassifier(loss='log') but you
have to implement it on your own. You should be aware of the fact that
SGDClassifier(n_jobs > 1) uses multiple processes, thus, if your
dataset (``X``) is too large (more than 50% of your RAM) you'll run
into troubles.

best,
Peter

Post by Fred Mailhot
Dear all,
What are the advantages of choosing one of the Subject line classifiers over
- LogisticRegression implements predict_proba for the multiclass case, while
SGDClassifier doesn't
- SGDClassifier(loss="log") lets you specify multiple CPUs for the OVA
training, while LogisticRegression doesn't
Are there other obvious differences that might influence this decision?
Regards,
Fred.
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

Fred Mailhot

2012-06-15 21:22:05 UTC

Permalink

Thanks for the prompt reply, Peter. I may be in a situation that will call
for SGDClassifier, so I have two follow-up questions:

1) I'd like to compute the class probs; are the probs for the individual
OvR classifiers (easily) accessible? My intuition is that I can compute
these from the returned vals from decision_function(), then do the
normalization afterward...

2) How "online" is the SGD implementation? Specifically, would it be
possible do to something like continuous training from a "neverending"
stream of data (e.g. coming in over a network socket)?

Thanks again,
Fred.

Post by Peter Prettenhofer
Hi Fred,
Liblinear/Coordinate Descent vs. Stochastic Gradient Descent.
If your problem is high dimensional (10K or more) and you have a large
number of examples (100K or more) you should choose the latter -
otherwise, LogisticRegression should be fine.
Both are not proper multinomial logistic regression models;
LogisticRegression does not care and simply computes the probability
estimates of each OVR classifier and normalized to make sure they sum
to one. You could do the same for SGDClassifier(loss='log') but you
have to implement it on your own. You should be aware of the fact that
SGDClassifier(n_jobs > 1) uses multiple processes, thus, if your
dataset (``X``) is too large (more than 50% of your RAM) you'll run
into troubles.
best,
Peter

Post by Fred Mailhot
Dear all,
What are the advantages of choosing one of the Subject line classifiers

over

Post by Fred Mailhot
- LogisticRegression implements predict_proba for the multiclass case,

while

Post by Fred Mailhot
SGDClassifier doesn't
- SGDClassifier(loss="log") lets you specify multiple CPUs for the OVA
training, while LogisticRegression doesn't
Are there other obvious differences that might influence this decision?
Regards,
Fred.

------------------------------------------------------------------------------

Post by Fred Mailhot
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Peter Prettenhofer

2012-06-18 06:49:09 UTC

Permalink

Post by Fred Mailhot
Thanks for the prompt reply, Peter. I may be in a situation that will call
1) I'd like to compute the class probs; are the probs for the individual OvR
classifiers (easily) accessible? My intuition is that I can compute these
from the returned vals from decision_function(), then do the normalization
afterward...

Correct, you can get the class probability for each OvR classifier via
decision_function::

P = 1.0 / (1.0 + np.exp(-self.decision_function(X)))

Post by Fred Mailhot
2) How "online" is the SGD implementation? Specifically, would it be
possible do to something like continuous training from a "neverending"
stream of data (e.g. coming in over a network socket)?

You can do "online" learning via SGDClassifier.partial_fit. I'm not
really familiar with "practical" online learning; the partial_fit
method mainly targets at "sequential learning" which is useful when
your training data does not fit into main memory. The major issue here
is again the learning rate. Currently, partial_fit records the
learning rate/schedule from previous calls to partial_fit which means
that at some point in time you hardy update your model based on new
examples because the learning rate became too small. If you need to
"adapt" to new data it might be better to "reset" the learning rate
before calling partial_fit or to train a new classifier on the new
data and combine the old model (i.e. parameter vector) with the new
(e.g. an exponential average). Again, I've no practical experience
with online learning so please take this with a grain of salt.

A practical note: make sure you buffer the stream before you call
``partial_fit``; calling ``partial_fit`` with a single example at a
time will be rather inefficient (housekeeping and function call
overhead in python).

best,
Peter

Post by Fred Mailhot
Thanks again,
Fred.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer

Lars Buitinck

2012-06-15 22:02:29 UTC

Permalink

Post by Peter Prettenhofer
Both are not proper multinomial logistic regression models;
LogisticRegression does not care and simply computes the probability
estimates of each OVR classifier and normalized to make sure they sum
to one. You could do the same for SGDClassifier(loss='log') but you
have to implement it on your own. You should be aware of the fact that
SGDClassifier(n_jobs > 1) uses multiple processes, thus, if your
dataset (``X``) is too large (more than 50% of your RAM) you'll run
into troubles.

Any reason why we haven't implemented "multiclass" (OvR) predict_proba
on SGDClassifier?

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Andreas Mueller

2012-06-17 11:19:04 UTC

Permalink

Post by Lars Buitinck

Post by Peter Prettenhofer
Both are not proper multinomial logistic regression models;
LogisticRegression does not care and simply computes the probability
estimates of each OVR classifier and normalized to make sure they sum
to one. You could do the same for SGDClassifier(loss='log') but you
have to implement it on your own. You should be aware of the fact that
SGDClassifier(n_jobs> 1) uses multiple processes, thus, if your
dataset (``X``) is too large (more than 50% of your RAM) you'll run
into troubles.

Any reason why we haven't implemented "multiclass" (OvR) predict_proba
on SGDClassifier?

Maybe because it is not really meaningful? Would you do softmax there?

Lars Buitinck

2012-06-17 11:24:24 UTC

Permalink

Post by Andreas Mueller

Post by Lars Buitinck
Any reason why we haven't implemented "multiclass" (OvR) predict_proba
on SGDClassifier?

Maybe because it is not really meaningful? Would you do softmax there?

Yes. Isn't that what LogisticRegression does too?

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Andreas Mueller

2012-06-17 11:27:56 UTC

Permalink

Post by Lars Buitinck

Post by Andreas Mueller

Post by Lars Buitinck
Any reason why we haven't implemented "multiclass" (OvR) predict_proba
on SGDClassifier?

Maybe because it is not really meaningful? Would you do softmax there?

Yes. Isn't that what LogisticRegression does too?

I'm a bit on the go right now and don't have time to look it up. I would
have expected LogisticRegression
to use Platt scaling.
As you said, the two-class version can be generalized but the
multi-class version
is not trained in a way that would be consistent with this probabilistic
model, I think.

Peter Prettenhofer

2012-06-18 06:54:40 UTC

Permalink

Post by Andreas Mueller

Post by Lars Buitinck

Post by Andreas Mueller

Post by Lars Buitinck
Any reason why we haven't implemented "multiclass" (OvR) predict_proba
on SGDClassifier?

Maybe because it is not really meaningful? Would you do softmax there?

Yes. Isn't that what LogisticRegression does too?

I'm not totally sure - I've looked at the code
(linear.cpp/predict_probability) and it seems that they just do the
soft-max and no Platt scaling but I might have missed something.

Post by Andreas Mueller
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Peter Prettenhofer