Discussion:
Platt's sigmoid probability estimation technique
(too old to reply)
Lucas Wiman
2011-05-18 00:04:40 UTC
Permalink
Hello,

I'm new to the scikits.learn mailing list, but have been using the library
for several months. I'm interested in contributing an estimator using
Platt's method of generating a sigmoid function to learn probability
estimates from the outputs of SVMs. The original method is described here:
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
and Lin et al's numerical improvement is described here:
http://www.csie.ntu.edu.tw/~htlin/paper/doc/plattprob.pdf

This method is also implemented in LibSVM (indeed included in the svm.cpp
file in scikits.learn in the function sigmoid_train). I'm thinking
something along the lines of the following (where variable_X and variable_Y
are the training feature vectors and labels respectively):

svc = LinearSVC()
svc.fit(train_X, train_Y)
platt_estimator = SigmoidProbabilityEstimator()
platt_estimator.fit(svc.decision_function(test_X), test_Y)

# Outputs an array of estimated probabilities
platt_estimator.predict(svc.decision_function(X))


We could also add functions to LinearSVC and SVC classes which take an
estimator as an input and set a prob_estimator field on the estimator. When
this field is defined, predict_proba will return probabilities rather than
their current behavior of raising NotImplementedError.

Thoughts?

Additionally, I'm not particularly familiar with using Cython (or indeed
C/C++ in general), so any pointers about how to wrap C functionality in
scikits.learn would be greatly appreciated.

Thanks and best wishes,
Lucas Wiman
Lucas Wiman
2011-05-18 00:24:49 UTC
Permalink
Scratch the "and SVC", as that seems to already be implemented by both
LibSVM and the scikits wrapper class. Curious that LibLinear doesn't just
implement this method as well.

- Lucas
Post by Lucas Wiman
Hello,
I'm new to the scikits.learn mailing list, but have been using the library
for several months. I'm interested in contributing an estimator using
Platt's method of generating a sigmoid function to learn probability
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
http://www.csie.ntu.edu.tw/~htlin/paper/doc/plattprob.pdf
This method is also implemented in LibSVM (indeed included in the svm.cpp
file in scikits.learn in the function sigmoid_train). I'm thinking
something along the lines of the following (where variable_X and variable_Y
svc = LinearSVC()
svc.fit(train_X, train_Y)
platt_estimator = SigmoidProbabilityEstimator()
platt_estimator.fit(svc.decision_function(test_X), test_Y)
# Outputs an array of estimated probabilities
platt_estimator.predict(svc.decision_function(X))
We could also add functions to LinearSVC and SVC classes which take an
estimator as an input and set a prob_estimator field on the estimator. When
this field is defined, predict_proba will return probabilities rather than
their current behavior of raising NotImplementedError.
Thoughts?
Additionally, I'm not particularly familiar with using Cython (or indeed
C/C++ in general), so any pointers about how to wrap C functionality in
scikits.learn would be greatly appreciated.
Thanks and best wishes,
Lucas Wiman
Alexandre Gramfort
2011-05-18 01:16:39 UTC
Permalink
Hi Lucas,

I think it would be great to have such a feature.
Do you already have code to do this? Is it pure C? How large is the code?
Is it doable with pure python or can it be written in cython only?

I guess it could be also used by the SGD module. @peter: what do you think?

Alex
Post by Lucas Wiman
Scratch the "and SVC", as that seems to already be implemented by both
LibSVM and the scikits wrapper class.  Curious that LibLinear doesn't just
implement this method as well.
- Lucas
Post by Lucas Wiman
Hello,
I'm new to the scikits.learn mailing list, but have been using the library
for several months.  I'm interested in contributing an estimator using
Platt's method of generating a sigmoid function to learn probability
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
http://www.csie.ntu.edu.tw/~htlin/paper/doc/plattprob.pdf
This method is also implemented in LibSVM (indeed included in the svm.cpp
file in scikits.learn in the function sigmoid_train).  I'm thinking
something along the lines of the following (where variable_X and variable_Y
svc = LinearSVC()
svc.fit(train_X, train_Y)
platt_estimator = SigmoidProbabilityEstimator()
platt_estimator.fit(svc.decision_function(test_X), test_Y)
# Outputs an array of estimated probabilities
platt_estimator.predict(svc.decision_function(X))
We could also add functions to LinearSVC and SVC classes which take an
estimator as an input and set a prob_estimator field on the estimator.  When
this field is defined, predict_proba will return probabilities rather than
their current behavior of raising NotImplementedError.
Thoughts?
Additionally, I'm not particularly familiar with using Cython (or indeed
C/C++ in general), so  any pointers about how to wrap C functionality in
scikits.learn would be greatly appreciated.
Thanks and best wishes,
Lucas Wiman
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Lucas Wiman
2011-05-18 01:37:14 UTC
Permalink
I have a pure Python implementation that's just just follows the code in the
Lin reference above. It performs OK, though faster is always better. :-)

But there is already a C++ implementation in LibSVM, and it would probably
make sense to use that. I don't see why this method couldn't be applied to
any linear classifier (perceptron, SGD, etc), though I'm not aware of any
reference doing so.

Best,
Lucas

On Tue, May 17, 2011 at 6:16 PM, Alexandre Gramfort <
Post by Alexandre Gramfort
Hi Lucas,
I think it would be great to have such a feature.
Do you already have code to do this? Is it pure C? How large is the code?
Is it doable with pure python or can it be written in cython only?
Alex
Post by Lucas Wiman
Scratch the "and SVC", as that seems to already be implemented by both
LibSVM and the scikits wrapper class. Curious that LibLinear doesn't
just
Post by Lucas Wiman
implement this method as well.
- Lucas
Post by Lucas Wiman
Hello,
I'm new to the scikits.learn mailing list, but have been using the
library
Post by Lucas Wiman
Post by Lucas Wiman
for several months. I'm interested in contributing an estimator using
Platt's method of generating a sigmoid function to learn probability
estimates from the outputs of SVMs. The original method is described
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
http://www.csie.ntu.edu.tw/~htlin/paper/doc/plattprob.pdf
This method is also implemented in LibSVM (indeed included in the
svm.cpp
Post by Lucas Wiman
Post by Lucas Wiman
file in scikits.learn in the function sigmoid_train). I'm thinking
something along the lines of the following (where variable_X and
variable_Y
Post by Lucas Wiman
Post by Lucas Wiman
svc = LinearSVC()
svc.fit(train_X, train_Y)
platt_estimator = SigmoidProbabilityEstimator()
platt_estimator.fit(svc.decision_function(test_X), test_Y)
# Outputs an array of estimated probabilities
platt_estimator.predict(svc.decision_function(X))
We could also add functions to LinearSVC and SVC classes which take an
estimator as an input and set a prob_estimator field on the estimator.
When
Post by Lucas Wiman
Post by Lucas Wiman
this field is defined, predict_proba will return probabilities rather
than
Post by Lucas Wiman
Post by Lucas Wiman
their current behavior of raising NotImplementedError.
Thoughts?
Additionally, I'm not particularly familiar with using Cython (or indeed
C/C++ in general), so any pointers about how to wrap C functionality in
scikits.learn would be greatly appreciated.
Thanks and best wishes,
Lucas Wiman
------------------------------------------------------------------------------
Post by Lucas Wiman
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2011-05-18 01:45:15 UTC
Permalink
Post by Lucas Wiman
I have a pure Python implementation that's just just follows the code in the
Lin reference above.  It performs OK, though faster is always better.  :-)
But there is already a C++ implementation in LibSVM, and it would probably
make sense to use that.
if you use something like :

svc = SVC(probability=True)

that's what is used.

do you have in mind to isolate the LibSVM code to make it usable by
other classifiers?
Personally I'd rather have a pure cython version.
Post by Lucas Wiman
I don't see why this method couldn't be applied to
any linear classifier (perceptron, SGD, etc), though I'm not aware of any
reference doing so.
the method is not restricted to linear classifiers as far as I can tell.
Let's wait for Peter's reaction.

Alex
Paolo Losi
2011-05-18 08:12:47 UTC
Permalink
I wonder if it's worthwhile implementing an ad hoc optimization method
considering that:

1) the function to be minimized is strictly convex and so easy to handle
2) scipy provides well tested and efficient optimization methods
3) usually the problem is very "small"

hal daume seems to confirm this [1] and suggest a nice (obvious :-) extension
to the multi-class case

the real effort is to double check objective function and gradient formulation
in order to avoid numerical difficulties. But that is fairly easy as well.

I'm my prototype I'm using BFGS (scipy.fmin_bfgs) with good results.

Paolo

[1] http://agbs.kyb.tuebingen.mpg.de/km/bb/showthread.php?tid=23&pid=49#pid49
Post by Lucas Wiman
I have a pure Python implementation that's just just follows the code in the
Lin reference above.  It performs OK, though faster is always better.  :-)
But there is already a C++ implementation in LibSVM, and it would probably
make sense to use that.  I don't see why this method couldn't be applied to
any linear classifier (perceptron, SGD, etc), though I'm not aware of any
reference doing so.
Best,
Lucas
On Tue, May 17, 2011 at 6:16 PM, Alexandre Gramfort
Post by Alexandre Gramfort
Hi Lucas,
I think it would be great to have such a feature.
Do you already have code to do this? Is it pure C? How large is the code?
Is it doable with pure python or can it be written in cython only?
Alex
Post by Lucas Wiman
Scratch the "and SVC", as that seems to already be implemented by both
LibSVM and the scikits wrapper class.  Curious that LibLinear doesn't just
implement this method as well.
- Lucas
Post by Lucas Wiman
Hello,
I'm new to the scikits.learn mailing list, but have been using the library
for several months.  I'm interested in contributing an estimator using
Platt's method of generating a sigmoid function to learn probability
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
http://www.csie.ntu.edu.tw/~htlin/paper/doc/plattprob.pdf
This method is also implemented in LibSVM (indeed included in the svm.cpp
file in scikits.learn in the function sigmoid_train).  I'm thinking
something along the lines of the following (where variable_X and variable_Y
svc = LinearSVC()
svc.fit(train_X, train_Y)
platt_estimator = SigmoidProbabilityEstimator()
platt_estimator.fit(svc.decision_function(test_X), test_Y)
# Outputs an array of estimated probabilities
platt_estimator.predict(svc.decision_function(X))
We could also add functions to LinearSVC and SVC classes which take an
estimator as an input and set a prob_estimator field on the estimator.  When
this field is defined, predict_proba will return probabilities rather than
their current behavior of raising NotImplementedError.
Thoughts?
Additionally, I'm not particularly familiar with using Cython (or indeed
C/C++ in general), so  any pointers about how to wrap C functionality in
scikits.learn would be greatly appreciated.
Thanks and best wishes,
Lucas Wiman
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Paolo Losi
e-mail: ***@enuan.com
mob:   +39 348 7705261

ENUAN Srl
Via XX Settembre, 12 - 29100 Piacenza
Mathieu Blondel
2011-05-18 08:32:40 UTC
Permalink
Post by Paolo Losi
I wonder if it's worthwhile implementing an ad hoc optimization method
1) the function to be minimized is strictly convex and so easy to handle
2) scipy provides well tested and efficient optimization methods
3) usually the problem is very "small"
hal daume seems to confirm this [1] and suggest a nice (obvious :-) extension
to the multi-class case
Interesting! It seems to me that the natural way to implement this in
the scikit would be to use a transformer. Also, since it's a
log-linear model, I wonder if we can't just use Peter's multinomial
logistic regression directly?

Mathieu
Gael Varoquaux
2011-05-18 09:05:10 UTC
Permalink
Post by Mathieu Blondel
Post by Paolo Losi
I wonder if it's worthwhile implementing an ad hoc optimization method
1) the function to be minimized is strictly convex and so easy to handle
2) scipy provides well tested and efficient optimization methods
3) usually the problem is very "small"
hal daume seems to confirm this [1] and suggest a nice (obvious :-) extension
to the multi-class case
Interesting! It seems to me that the natural way to implement this in
the scikit would be to use a transformer. Also, since it's a
log-linear model, I wonder if we can't just use Peter's multinomial
logistic regression directly?
I'd much rather have a method 'predict_proba' added to objects who do not
have it (maybe using a Mixin in the back end, if we want to do fancy
soft-ingeneering).

In terms of user interface, it would make things much simpler.

G
Mathieu Blondel
2011-05-18 10:26:33 UTC
Permalink
On Wed, May 18, 2011 at 6:05 PM, Gael Varoquaux
Post by Gael Varoquaux
I'd much rather have a method 'predict_proba' added to objects who do not
have it (maybe using a Mixin in the back end, if we want to do fancy
soft-ingeneering).
In terms of user interface, it would make things much simpler.
+1 but a Calibrator object would still be useful (and predict_proba
can be implemented using it).

# Currently, Paolo uses aggregation which Gael does not like IIRC :)

Mathieu
Paolo Losi
2011-05-18 10:38:12 UTC
Permalink
Post by Mathieu Blondel
# Currently, Paolo uses aggregation which Gael does not like IIRC :)
What do you mean exactly for aggregation? 8-)

Paolo
Mathieu Blondel
2011-05-18 10:41:49 UTC
Permalink
Post by Paolo Losi
Post by Mathieu Blondel
# Currently, Paolo uses aggregation which Gael does not like IIRC :)
What do you mean exactly for aggregation? 8-)
The fact that you a require a classifier object in the constructor of
your calibrator (object composition).

Mathieu
Paolo Losi
2011-05-18 10:45:35 UTC
Permalink
Post by Mathieu Blondel
Post by Paolo Losi
Post by Mathieu Blondel
# Currently, Paolo uses aggregation which Gael does not like IIRC :)
What do you mean exactly for aggregation? 8-)
The fact that you a require a classifier object in the constructor of
your calibrator (object composition).
Clear, thanks.

@Gael, which is the reason for which you dislike that solution?
It is used in at least a couple of places (I rembember GridSearchCV, RFECV).
And in Platt scaling as well we need to do cross validation ...

Paolo
Gael Varoquaux
2011-05-18 11:01:54 UTC
Permalink
Post by Paolo Losi
@Gael, which is the reason for which you dislike that solution?
Make more complex APIs.
Post by Paolo Losi
It is used in at least a couple of places (I rembember GridSearchCV, RFECV).
Only when it is striclty necessary. This does not seem the case here.

Gael
Paolo Losi
2011-05-18 11:59:27 UTC
Permalink
On Wed, May 18, 2011 at 1:01 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Paolo Losi
@Gael, which is the reason for which you dislike that solution?
Make more complex APIs.
Post by Paolo Losi
It is used in at least a couple of places (I rembember GridSearchCV, RFECV).
Only when it is striclty necessary. This does not seem the case here.
Since internal cross validation is required for Platt I don't have idea of how
to do it differently. Any suggestion?

Thanks!

Paolo
Gael Varoquaux
2011-05-18 12:12:56 UTC
Permalink
Post by Paolo Losi
Post by Gael Varoquaux
Post by Paolo Losi
@Gael, which is the reason for which you dislike that solution?
Make more complex APIs.
Since internal cross validation is required for Platt I don't have idea of how
to do it differently. Any suggestion?
OK, so first rule: simpler solutions should be preferred. So maybe I am
talking non-sense and if the code really screams for OOP, we should do
it.

Now one of the big problems that I see with aggregation is the necessity
to have delegation patterns if we want uniform APIs: delegation of the
methods, and delegation of the parameter setting. On top of it, the more
objects are required to be instantiated, the more users are confused.

To answer your question, it seems to me that the most reasonable way to
do things would be to implement subclasses of models in which the fit
method called the parent's class fit method in a resampling loop, and
that add a predict_proba method. This cannot really be implemented using
a mixin pattern, as there would be diamond inheritance in the fit method.

I can see only two of such classes so far that need to be implemented
(subclassing LinearSVC and SGDClassifier).

Does that seem reasonable to people?

Out of curiosity, would such approach be possible on a OneClassSVM? I
haven't read the papers, sorry.

Cheers,

Gaël
Paolo Losi
2011-05-18 13:00:52 UTC
Permalink
On Wed, May 18, 2011 at 2:12 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Paolo Losi
Post by Gael Varoquaux
Post by Paolo Losi
@Gael, which is the reason for which you dislike that solution?
Make more complex APIs.
Since internal cross validation is required for Platt I don't have idea of how
to do it differently. Any suggestion?
OK, so first rule: simpler solutions should be preferred. So maybe I am
talking non-sense and if the code really screams for OOP, we should do
it.
I feel subclassing more OOP than composition but YMMV ;-)
Post by Gael Varoquaux
Now one of the big problems that I see with aggregation is the necessity
to have delegation patterns if we want uniform APIs: delegation of the
methods, and delegation of the parameter setting. On top of it, the more
objects are required to be instantiated, the more users are confused.
I understand and agree.
Post by Gael Varoquaux
To answer your question, it seems to me that the most reasonable way to
do things would be to implement subclasses of models in which the fit
method called the parent's class fit method in a resampling loop, and
that add a predict_proba method. This cannot really be implemented using
a mixin pattern, as there would be diamond inheritance in the fit method.
Clear.
Post by Gael Varoquaux
I can see only two of such classes so far that need to be implemented
(subclassing LinearSVC and SGDClassifier).
Naive Bayes could be calibrated as well ...
Post by Gael Varoquaux
Does that seem reasonable to people?
As I said, being a very general calbration method, I would suggest
using composition.
Post by Gael Varoquaux
Out of curiosity, would such approach be possible on a OneClassSVM? I
haven't read the papers, sorry.
I'm not familiar with OneClassSVM but if decision_function can be interpreted
as a score of being member of the class and the "empirical" distribution of
the score vs membership probability is sigmoid like (that's the
empirical observation
on which Platt method is based) the answer is: yes, it can be used for
OneClassSVM.

Thanks for the clarifications

Paolo
Mathieu Blondel
2011-05-18 13:13:37 UTC
Permalink
Post by Paolo Losi
As I said, being a very general calbration method, I would suggest
using composition.
I'm +1 for using composition in the general calibration utility object
and implement the predict_proba mixin in terms of this object.

Mathieu
Paolo Losi
2011-05-18 12:11:23 UTC
Permalink
Post by Mathieu Blondel
Post by Paolo Losi
hal daume seems to confirm this [1] and suggest a nice (obvious :-) extension
to the multi-class case
Interesting! It seems to me that the natural way to implement this in
the scikit would be to use a transformer.
hmm, what would be the output of transform?
Post by Mathieu Blondel
Also, since it's a
log-linear model, I wonder if we can't just use Peter's multinomial
logistic regression directly?
In Platt scaling and l1/l2 regularized logistic regression solve two
slightly different problems (the difference is more evident when
the number of samples is small or the class distribution is skewed).

Moreover SGD is optimized to solve "large" problems (n_features * n_samples)
and doesn't garantee an "exact" solution while for platt scaling
the problem is smaller (n_samples * 2 parameters) and is reasonable to aim
for an "exact" solution.

BTW, Platt Scaling could be used for calibrating l1/l2 log loss models
of SGD/liblinear as well when the laplacian or gaussian prior doesn't
hold in reality
and, still, the model provides good classification performance.

Paolo
Post by Mathieu Blondel
Mathieu
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Paolo Losi
e-mail: ***@enuan.com
mob:   +39 348 7705261

ENUAN Srl
Via XX Settembre, 12 - 29100 Piacenza
Gael Varoquaux
2011-05-18 12:15:32 UTC
Permalink
Post by Paolo Losi
BTW, Platt Scaling could be used for calibrating l1/l2 log loss models
of SGD/liblinear as well when the laplacian or gaussian prior doesn't
hold in reality and, still, the model provides good classification
performance.
That's an interesting remark. Do you believe that it would have more
chances of being accurate?

Basically you are suggesting that the use of Platt is more general than
complementing models that have no predict_proba.

Gael
Paolo Losi
2011-05-18 12:30:22 UTC
Permalink
On Wed, May 18, 2011 at 2:15 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Paolo Losi
BTW, Platt Scaling could be used for calibrating l1/l2 log loss models
of SGD/liblinear as well when the laplacian or gaussian prior doesn't
hold in reality and, still, the model provides good classification
performance.
That's an interesting remark. Do you believe that it would have more
chances of being accurate?
Exactly. More chances of providing a better probability estimate.
Post by Gael Varoquaux
Basically you are suggesting that the use of Platt is more general than
complementing models that have no predict_proba.
Exactly. That's one of the reasons why I'm pushing for the
"aggregation" solution
(the other reason being that it requires internal cross validation).

Paolo
Mathieu Blondel
2011-05-18 12:33:11 UTC
Permalink
Post by Paolo Losi
Post by Mathieu Blondel
Post by Paolo Losi
hal daume seems to confirm this [1] and suggest a nice (obvious :-) extension
to the multi-class case
Interesting! It seems to me that the natural way to implement this in
the scikit would be to use a transformer.
hmm, what would be the output of transform?
I was thinking of using "transform" to transform scores to
probabilities, i.e. you would pass decision_function's output to fit
and transform. But it works only if you do one fold. Sorry for the
misthought.
Post by Paolo Losi
Moreover SGD is optimized to solve "large" problems (n_features * n_samples)
and doesn't garantee an "exact" solution while for platt scaling
the problem is smaller (n_samples * 2 parameters) and is reasonable to aim
for an "exact" solution.
Isn't the number of parameters 2 in the binary case and 2 * n_classes
in the multi-class case (or 2 * (nclasses - 1) if you use the fact
that the softmax must sum to 1)?

My impression was that you could use a 1-dimensional X and then by
using fit_intercept=True, you would recover your 2 parameters per
class. That said, I didn't look into the problem thoroughly.

Anyway +1 for not using SGD.

Mathieu
Paolo Losi
2011-05-18 12:39:46 UTC
Permalink
Post by Mathieu Blondel
Post by Paolo Losi
Post by Mathieu Blondel
Post by Paolo Losi
hal daume seems to confirm this [1] and suggest a nice (obvious :-) extension
to the multi-class case
Interesting! It seems to me that the natural way to implement this in
the scikit would be to use a transformer.
hmm, what would be the output of transform?
I was thinking of using "transform" to transform scores to
probabilities, i.e. you would pass decision_function's output to fit
and transform. But it works only if you do one fold. Sorry for the
misthought.
No prob. I feared I was missing something obvious :-)
Post by Mathieu Blondel
Post by Paolo Losi
Moreover SGD is optimized to solve "large" problems (n_features * n_samples)
and doesn't garantee an "exact" solution while for platt scaling
the problem is smaller (n_samples * 2 parameters) and is reasonable to aim
for an "exact" solution.
Isn't the number of parameters 2 in the binary case and 2 * n_classes
in the multi-class case (or 2 * (nclasses - 1) if you use the fact
that the softmax must sum to 1)?
Your right. So, unless you have a strangely high number of classes, the problem
is usually quite "small".
Post by Mathieu Blondel
My impression was that you could use a 1-dimensional X and then by
using fit_intercept=True, you would recover your 2 parameters per
class. That said, I didn't look into the problem thoroughly.
Clear.
Post by Mathieu Blondel
Anyway +1 for not using SGD.
Lucas Wiman
2011-05-18 21:28:11 UTC
Permalink
I think using optimization routines from scipy makes a ton of sense. It's
pretty straightforward to implement, and the code is massively more readable
than either Platt's or Lin's pseudocode. I implemented a prototype of the
binary case as well here to check against/use, with a check against the
output of my port of Lin's code to Python:
http://pastebin.com/wJxLr9S7

FWIW, I don't have a strong opinion about subclassing vs aggregation, but I
do think that this technique should be at least /available/ as a standalone
estimator so you can easily choose which set of validation data it gets run
against. (I.E. it should be part of the public scikits.learn API somewhere,
even if it is also transparently included in the SGD/LinearSVC classes.)

Thanks,
Lucas
Post by Paolo Losi
I wonder if it's worthwhile implementing an ad hoc optimization method
1) the function to be minimized is strictly convex and so easy to handle
2) scipy provides well tested and efficient optimization methods
3) usually the problem is very "small"
hal daume seems to confirm this [1] and suggest a nice (obvious :-) extension
to the multi-class case
the real effort is to double check objective function and gradient formulation
in order to avoid numerical difficulties. But that is fairly easy as well.
I'm my prototype I'm using BFGS (scipy.fmin_bfgs) with good results.
Paolo
[1]
http://agbs.kyb.tuebingen.mpg.de/km/bb/showthread.php?tid=23&pid=49#pid49
Post by Lucas Wiman
I have a pure Python implementation that's just just follows the code in
the
Post by Lucas Wiman
Lin reference above. It performs OK, though faster is always better.
:-)
Post by Lucas Wiman
But there is already a C++ implementation in LibSVM, and it would
probably
Post by Lucas Wiman
make sense to use that. I don't see why this method couldn't be applied
to
Post by Lucas Wiman
any linear classifier (perceptron, SGD, etc), though I'm not aware of any
reference doing so.
Best,
Lucas
On Tue, May 17, 2011 at 6:16 PM, Alexandre Gramfort
Post by Alexandre Gramfort
Hi Lucas,
I think it would be great to have such a feature.
Do you already have code to do this? Is it pure C? How large is the
code?
Post by Lucas Wiman
Post by Alexandre Gramfort
Is it doable with pure python or can it be written in cython only?
Alex
Post by Lucas Wiman
Scratch the "and SVC", as that seems to already be implemented by both
LibSVM and the scikits wrapper class. Curious that LibLinear doesn't just
implement this method as well.
- Lucas
Post by Lucas Wiman
Hello,
I'm new to the scikits.learn mailing list, but have been using the library
for several months. I'm interested in contributing an estimator
using
Post by Lucas Wiman
Post by Alexandre Gramfort
Post by Lucas Wiman
Post by Lucas Wiman
Platt's method of generating a sigmoid function to learn probability
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
http://www.csie.ntu.edu.tw/~htlin/paper/doc/plattprob.pdf
This method is also implemented in LibSVM (indeed included in the svm.cpp
file in scikits.learn in the function sigmoid_train). I'm thinking
something along the lines of the following (where variable_X and variable_Y
svc = LinearSVC()
svc.fit(train_X, train_Y)
platt_estimator = SigmoidProbabilityEstimator()
platt_estimator.fit(svc.decision_function(test_X), test_Y)
# Outputs an array of estimated probabilities
platt_estimator.predict(svc.decision_function(X))
We could also add functions to LinearSVC and SVC classes which take
an
Post by Lucas Wiman
Post by Alexandre Gramfort
Post by Lucas Wiman
Post by Lucas Wiman
estimator as an input and set a prob_estimator field on the
estimator.
Post by Lucas Wiman
Post by Alexandre Gramfort
Post by Lucas Wiman
Post by Lucas Wiman
When
this field is defined, predict_proba will return probabilities rather than
their current behavior of raising NotImplementedError.
Thoughts?
Additionally, I'm not particularly familiar with using Cython (or indeed
C/C++ in general), so any pointers about how to wrap C functionality in
scikits.learn would be greatly appreciated.
Thanks and best wishes,
Lucas Wiman
------------------------------------------------------------------------------
Post by Lucas Wiman
Post by Alexandre Gramfort
Post by Lucas Wiman
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Lucas Wiman
Post by Alexandre Gramfort
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Lucas Wiman
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Paolo Losi
mob: +39 348 7705261
ENUAN Srl
Via XX Settembre, 12 - 29100 Piacenza
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Paolo Losi
2011-05-18 08:16:53 UTC
Permalink
On Wed, May 18, 2011 at 3:16 AM, Alexandre Gramfort
I started implementing Platt scaling in order to use with Peter's SGD.
But the calibration method can be effectively applied to any
classification method that
produces a score.

Another problem with libsvm platt's implementation is that it doesn't not allow
to modify the cross validation method (it's fixed to KFold with k=5). That is
not ideal if the number of training sample is small.

Paolo
Gael Varoquaux
2011-05-18 05:21:11 UTC
Permalink
Post by Lucas Wiman
I'm new to the scikits.learn mailing list, but have been using the library
for several months. I'm interested in contributing an estimator using
Platt's method of generating a sigmoid function to learn probability
estimates from the outputs of SVMs.
I'm thinking something along the lines of the following (where
variable_X and variable_Y are the training feature vectors and
svc = LinearSVC()
svc.fit(train_X, train_Y)
platt_estimator = SigmoidProbabilityEstimator()
platt_estimator.fit(svc.decision_function(test_X), test_Y)
# Outputs an array of estimated probabilities
platt_estimator.predict(svc.decision_function(X))
We could also add functions to LinearSVC and SVC classes which take an
estimator as an input and set a prob_estimator field on the estimator.
Something like 'predict_proba'? :).

As you have concluded, for the libsvm-based objects, this is probably not
necessary. For liblinear-based objects and SGD, it would certainly be
great.

In terms of API, I would prefer to have either a mixin class, or a few
helper functions, to make the addition of a predict_proba to classes that
do not have one, with the same exact semantics as for libsvm-based
objects.

I would also tend to say that you shouldn't loose time with Cython for
now: first get the big picture right in pure Python. Then we'll see if
any performance gain can be expected from Cython.

Thanks!

Gaël
Paolo Losi
2011-05-18 05:56:27 UTC
Permalink
Hi Lucas,

I've been working on probability calibration in the same last days.

I've implemented Platt method and was planning to implement
Isotonic Regression and multi class case handling via a wrapper

You can find the work on

https://github.com/paolo-losi/scikit-learn/tree/calibration

I would be very pleased if you could review the work and
contribute in any way you think it's useful

Thanks!

Paolo
Post by Lucas Wiman
Hello,
I'm new to the scikits.learn mailing list, but have been using the library
for several months.  I'm interested in contributing an estimator using
Platt's method of generating a sigmoid function to learn probability
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639
http://www.csie.ntu.edu.tw/~htlin/paper/doc/plattprob.pdf
This method is also implemented in LibSVM (indeed included in the svm.cpp
file in scikits.learn in the function sigmoid_train).  I'm thinking
something along the lines of the following (where variable_X and variable_Y
svc = LinearSVC()
svc.fit(train_X, train_Y)
platt_estimator = SigmoidProbabilityEstimator()
platt_estimator.fit(svc.decision_function(test_X), test_Y)
# Outputs an array of estimated probabilities
platt_estimator.predict(svc.decision_function(X))
We could also add functions to LinearSVC and SVC classes which take an
estimator as an input and set a prob_estimator field on the estimator.  When
this field is defined, predict_proba will return probabilities rather than
their current behavior of raising NotImplementedError.
Thoughts?
Additionally, I'm not particularly familiar with using Cython (or indeed
C/C++ in general), so  any pointers about how to wrap C functionality in
scikits.learn would be greatly appreciated.
Thanks and best wishes,
Lucas Wiman
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2011-05-18 15:25:40 UTC
Permalink
Darn, I just lost a carefully crafted mail. This usually never happens to
me :(
Post by Paolo Losi
Post by Gael Varoquaux
Basically you are suggesting that the use of Platt is more general
than
complementing models that have no predict_proba.
Exactly. That's one of the reasons why I'm pushing for the
"aggregation" solution
(the other reason being that it requires internal cross validation).
I am really torn in two: both solutions have their problems, and I don't
know a way without problems.

On the hand aggregation gives genericity, and locally clean code: the
object that you are coding is well structured. However, it dilutes the
problem elsewhere: aggregated objects are not really the same objects and
we need to cater for that, either by putting intelligence in the code
that uses these objects (and the users), or by putting intelligence in
the objects themselves. Such intelligence backfires (when it is
available). If you want examples of code with a lot of intelligence, look
at Mayavi2 or VTK. I've been there, I try to avoid (being intelligent, of
course).

On the other, specific (as opposed to generic) object can be implemented
simpler and expose a more consistent interface, using basic building
blocks, such as functions to factor out common functionality. You do
loose the ability to instantiate directly the code on a new problem out
of the box, but I tend to believe that this is an advanced feature, and
that advanced users should be able to cook up an object that answers
their need easily with the basic building blocks. This may lead to a bit
of redundancy in the code, but I prefer simpler and slightly redundant
code to code that is hard to understand.

To summarize, I believe that it will be hard to make an aggregate that
behaves really like the estimator it wraps and thus can be used blindly.
I would suggest to implement the core functionality as functions, use
them to build a subclass of the couple of classifiers for which this
functionality makes most sens. These functions should also be usable to
build an composer object that takes an estimator as an input. I would
actually like us to make decision on what goes in the scikit based on the
corresponding code: I think that it is worth consideration having both
approaches in the scikit, if the code speaks for itself.

Cheers,

G
Paolo Losi
2011-05-19 07:40:16 UTC
Permalink
On Wed, May 18, 2011 at 5:25 PM, Gael Varoquaux
Post by Gael Varoquaux
I would suggest to implement the core functionality as functions, use
them to build a subclass of the couple of classifiers for which this
functionality makes most sens. These functions should also be usable to
build an composer object that takes an estimator as an input. I would
actually like us to make decision on what goes in the scikit based on the
corresponding code: I think that it is worth consideration having both
approaches in the scikit, if the code speaks for itself.
I agree: implementing both approaches is the best way to make an effective
evaluation. I try to come up with a solution along the lines of what
you suggested.

Thanks for useful insights

Paolo
Peter Prettenhofer
2011-05-19 09:08:59 UTC
Permalink
Hi Lucas, all,

thanks for your contribution and insights. I totally agree that a
generic calibration method would be very useful indeed - not only for
SGDClassifier.

When it comes to the interface I agree with Mathieu and would prefer a
calibration object which implements `predict_proba` instead of a mixin
class - this gives us more flexibility. As far as I can see, with the
mixin approach each classifier class has to choose one specific
approach to implement `predict_proba`. For some classes, however,
there might me multiple approaches to implement it (e.g. SGDClassifier
with log=loss* supports predict_proba out-of-the-box). But maybe this
added flexibility adds an additional burden to the user - what do you
think?

thanks,
Peter
Post by Paolo Losi
On Wed, May 18, 2011 at 5:25 PM, Gael Varoquaux
Post by Gael Varoquaux
I would suggest to implement the core functionality as functions, use
them to build a subclass of the couple of classifiers for which this
functionality makes most sens. These functions should also be usable to
build an composer object that takes an estimator as an input. I would
actually like us to make decision on what goes in the scikit based on the
corresponding code: I think that it is worth consideration having both
approaches in the scikit, if the code speaks for itself.
I agree: implementing both approaches is the best way to make an effective
evaluation. I try to come up with a solution along the lines of what
you suggested.
Thanks for useful insights
Paolo
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Olivier Grisel
2011-05-19 09:18:13 UTC
Permalink
Post by Peter Prettenhofer
Hi Lucas, all,
thanks for your contribution and insights. I totally agree that a
generic calibration method would be very useful indeed - not only for
SGDClassifier.
When it comes to the interface I agree with Mathieu and would prefer a
calibration object which implements `predict_proba` instead of a mixin
class - this gives us more flexibility. As far as I can see, with the
mixin approach each classifier class has to choose one specific
approach to implement `predict_proba`. For some classes, however,
there might me multiple approaches to implement it (e.g. SGDClassifier
with log=loss* supports predict_proba out-of-the-box). But maybe this
added flexibility adds an additional burden to the user - what do you
think?
I am +1 for implementing this on a pull request and then decide based
on the look of the code, the tests and examples whether the API is too
complicated or not.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Peter Prettenhofer
2011-05-19 09:22:57 UTC
Permalink
I totally agree with Olivier, that's a good way to proceed.

best,
Peter
Post by Olivier Grisel
Post by Peter Prettenhofer
Hi Lucas, all,
thanks for your contribution and insights. I totally agree that a
generic calibration method would be very useful indeed - not only for
SGDClassifier.
When it comes to the interface I agree with Mathieu and would prefer a
calibration object which implements `predict_proba` instead of a mixin
class - this gives us more flexibility. As far as I can see, with the
mixin approach each classifier class has to choose one specific
approach to implement `predict_proba`. For some classes, however,
there might me multiple approaches to implement it (e.g. SGDClassifier
with log=loss* supports predict_proba out-of-the-box). But maybe this
added flexibility adds an additional burden to the user - what do you
think?
I am +1 for implementing this on a pull request and then decide based
on the look of the code, the tests and examples whether the API is too
complicated or not.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Loading...