Discussion:
Understand liblinear multiclass decision function
Tobjan Brejicz
2010-12-29 19:16:58 UTC
Hello:

I want to understand better liblinear multiclass decision function.
##TrF is array of 30 training feature vectors with 10000 features
TrF.shape
(30,10000)
TrL
array([2, 2, 0, 1, 2, 1, 0, 0, 0, 2, 2, 2, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0, 0,
2, 2, 2, 0, 2, 2, 0])
import scikits.learn.svm as svm
classifier = svm.LinearSVC(eps = 1e-5, C = 10**4)
classifier.fit(TrF,TrL)
TeF.shape
(80,10000)
prediction = classifier.predict(TeF)
prediction
array([1, 0, 0, 1, 0, 0, 2, 2, 2, 2, 0, 0, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0, 0,
....
0, 0, 2, 2, 2,1, 0])
coef = classifier.coef_
intercept = classifier.intercept_
I want to know what is definition of predict procedure in terms of this
data. In one way, maybe I already understand, because when using my self
OVA_classifier = my_ova_classifier(TrF,TrL)
##OVA_classifier.coef is matrix of coefficients from binary problem for
OVA_classifier.coef.shape
(3,10000)
## where OVA_classifier[i,:] is weights for i-th class binary problem
## OVA_classifier.intercept is array of intercepts also from binary
problem
OVA_prediction = (dot(TeF,myclassifier.coef) +
myclassifer.intercept).argmax(axis = 1)
(OVA_prediction == prediction).all()
True

This work for several problems and many splits so I think it is true that
liblinear is using OVA. (Or please tell me if not true.) BUT: if I try
same maximum margin with original weight and intercept from classifier, is
not_prediction = (dot(TeF,classifier.coef.T) +
classifer.intercept).argmax(axis = 1)
(not_prediction == prediction).all()
False

So, what is relationship between coef, intercept from classifier returned by
LinearSVC, and prediction function? I look at liblinear source code, but I
don't know c++ so I didn't understand. Also documentation was not too
explicit.

Thank you.

Tobjan
Alexandre Gramfort
2010-12-29 22:14:32 UTC
Hi Tobjan,

Mathieu (Blondel) or Fabian probably know better then me.

What I can just tell from the doc of LinearSVC is that it has
an option multi_class:

Linear Support Vector Classification.

Similar to SVC with parameter kernel='linear', but uses internally
liblinear rather than libsvm, so it has more flexibility in the
choice of penalties and loss functions and should be faster for
huge datasets.

Parameters
----------
...

multi_class: boolean, optional
perform multi-class SVM by Cramer and Singer. If active,
options loss, penalty and dual will be ignored.

I would suspect that it does an OVA when multi_class==False

Alex
Post by Tobjan Brejicz
I want to understand better liblinear multiclass decision function.
##TrF is array of 30 training feature vectors with 10000 features
TrF.shape
(30,10000)
TrL
array([2, 2, 0, 1, 2, 1, 0, 0, 0, 2, 2, 2, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0, 0,
2, 2, 2, 0, 2, 2, 0])
import scikits.learn.svm as svm
classifier = svm.LinearSVC(eps = 1e-5, C = 10**4)
classifier.fit(TrF,TrL)
TeF.shape
(80,10000)
prediction = classifier.predict(TeF)
prediction
array([1, 0, 0, 1, 0, 0, 2, 2, 2, 2, 0, 0, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0, 0,
....
0, 0, 2, 2, 2,1, 0])
coef = classifier.coef_
intercept = classifier.intercept_
I want to know what is definition of predict procedure in terms of this
data.   In one way, maybe I already understand, because when using my self
OVA_classifier = my_ova_classifier(TrF,TrL)
##OVA_classifier.coef is matrix of coefficients from binary problem for
OVA_classifier.coef.shape
(3,10000)
## where OVA_classifier[i,:] is weights for i-th class binary problem
## OVA_classifier.intercept is array of intercepts also from binary
problem
>>> OVA_prediction = (dot(TeF,myclassifier.coef) +
myclassifer.intercept).argmax(axis = 1)
(OVA_prediction == prediction).all()
True
This work for several problems and many splits so I think it is true that
liblinear is using OVA.  (Or please tell me if not true.)   BUT:  if I try
same maximum margin with original weight and intercept from classifier, is
not_prediction = (dot(TeF,classifier.coef.T) +
classifer.intercept).argmax(axis = 1)
(not_prediction == prediction).all()
False
So, what is relationship between coef, intercept from classifier returned by
LinearSVC, and prediction function?   I look at liblinear source code, but I
don't know c++ so I didn't understand.   Also documentation was not too
explicit.
Thank you.
Tobjan
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment,
and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Tobjan Brejicz
2010-12-29 23:54:47 UTC
Post by Alexandre Gramfort
What I can just tell from the doc of LinearSVC is that it has
I would suspect that it does an OVA when multi_class==False
Alex, thank you. Yes, I think your statement must be correct. My
experiment confirm this, since result of prediction is same as hand-code OVA
result. I use "maximum margin" prediction function using weight / intercept
data from OVA binary problems.

But question is: what is interpretation of "coef_" and "intercept_",
attached data attribute of returned classifier, for use in prediction
function?

Motivation for question is if I want to make efficient computation of
performance metric beside "score", I sometime need to be be able to compute
some intermediate step of prediction function. Usually prediction
function for SVM methods is in two parts:
1) compute real-valued function of features, like dot product with
weights, and add intercept (this is just rotate and scale feature vectors
along decision planes)
2) then apply discretion criterion, like "argmax" (in case of
multiclass) or " > 0" (in case of binary class) to reduce real-valued to
label-valued function.

Sometime is useful to compute just first step ... and I wanted to make this
computation using Classifier.coef_ and Classifier.intercept_ matrix. If I
use usual "maximum margin" formula with "coef_ " and "intercept_" matrix I
DON'T get expected result. For case of multi_class == False, I COULD work
around this problem by using maximum margin with weights/intercept coming
from hand-coded OVA on top of regular LinearSVC binary classifiers. This
is stupid since it means training SVM twice. So I would like to know how
define prediction function in terms of data actually returned with
classifier?

multi_class = True does Cramer-Singer method, which can be better ... but my
question is really same if multi_class = True or False. For case of
mult_class = True, I haven't the hand-coded Cramer-Singer implementation to
check against anyway. So I would like to know how to answer my question
for both mult_class = True / False.

Thank you.
-Tob
Post by Alexandre Gramfort
Alex
Post by Tobjan Brejicz
I want to understand better liblinear multiclass decision function.
##TrF is array of 30 training feature vectors with 10000 features
TrF.shape
(30,10000)
TrL
array([2, 2, 0, 1, 2, 1, 0, 0, 0, 2, 2, 2, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0,
0,
Post by Tobjan Brejicz
2, 2, 2, 0, 2, 2, 0])
import scikits.learn.svm as svm
classifier = svm.LinearSVC(eps = 1e-5, C = 10**4)
classifier.fit(TrF,TrL)
TeF.shape
(80,10000)
prediction = classifier.predict(TeF)
prediction
array([1, 0, 0, 1, 0, 0, 2, 2, 2, 2, 0, 0, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0,
0,
Post by Tobjan Brejicz
....
0, 0, 2, 2, 2,1, 0])
coef = classifier.coef_
intercept = classifier.intercept_
I want to know what is definition of predict procedure in terms of this
data. In one way, maybe I already understand, because when using my
self
Post by Tobjan Brejicz
OVA classifier, prediction is same for "maximum margin" predictor.
OVA_classifier = my_ova_classifier(TrF,TrL)
##OVA_classifier.coef is matrix of coefficients from binary problem
for
Post by Tobjan Brejicz
OVA_classifier.coef.shape
(3,10000)
## where OVA_classifier[i,:] is weights for i-th class binary problem
## OVA_classifier.intercept is array of intercepts also from binary
problem
OVA_prediction = (dot(TeF,myclassifier.coef) +
myclassifer.intercept).argmax(axis = 1)
(OVA_prediction == prediction).all()
True
This work for several problems and many splits so I think it is true that
liblinear is using OVA. (Or please tell me if not true.) BUT: if I
try
Post by Tobjan Brejicz
same maximum margin with original weight and intercept from classifier,
is
Post by Tobjan Brejicz
not_prediction = (dot(TeF,classifier.coef.T) +
classifer.intercept).argmax(axis = 1)
(not_prediction == prediction).all()
False
So, what is relationship between coef, intercept from classifier returned
by
Post by Tobjan Brejicz
LinearSVC, and prediction function? I look at liblinear source code,
but I
Post by Tobjan Brejicz
don't know c++ so I didn't understand. Also documentation was not too
explicit.
Thank you.
Tobjan
------------------------------------------------------------------------------
Post by Tobjan Brejicz
Learn how Oracle Real Application Clusters (RAC) One Node allows
customers
Post by Tobjan Brejicz
to consolidate database storage, standardize their database environment,
and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2010-12-30 00:01:35 UTC
hi,

again I'm not the expert and the authors of liblinear are probably
the best persons to answer this. However can it be due to the fact
that the coef_ vectors are not normalized which introduces a scaling
in the computation of the margin?

Alex
Post by Alexandre Gramfort
What I can just tell from the doc of LinearSVC is that it has
I would suspect that it does an OVA when multi_class==False
Alex, thank you.   Yes,  I think your statement must be correct.   My
experiment confirm this, since result of prediction is same as hand-code OVA
result.  I use "maximum margin" prediction function using weight / intercept
data from OVA binary problems.
But question is:   what is interpretation of "coef_" and "intercept_",
attached data attribute of returned classifier, for use in prediction
function?
Motivation for question is if I want to make efficient computation of
performance metric beside "score",  I sometime need to be be able to compute
some intermediate step of prediction function.    Usually prediction
1) compute real-valued function of features, like dot product with
weights, and add intercept (this is just rotate and scale feature vectors
along decision planes)
2) then apply discretion criterion, like "argmax" (in case of
multiclass) or " > 0" (in case of binary class) to reduce real-valued to
label-valued function.
Sometime is useful to compute just first step ...  and I wanted to make this
computation using Classifier.coef_ and Classifier.intercept_ matrix.   If I
use usual "maximum margin" formula with "coef_ " and "intercept_" matrix I
DON'T get expected result.   For case of multi_class == False, I COULD work
around this problem by using maximum margin with weights/intercept coming
from hand-coded OVA on top of regular LinearSVC binary classifiers.   This
is stupid since it means training SVM twice.  So I would like to know how
define prediction function in terms of data actually returned with
classifier?
multi_class = True does Cramer-Singer method, which can be better ... but my
question is really same if multi_class = True or False.     For case of
mult_class = True, I haven't the hand-coded Cramer-Singer implementation to
check against anyway.     So I would like to know how to answer my question
for both mult_class = True / False.
Thank you.
-Tob
Post by Alexandre Gramfort
Alex
Post by Tobjan Brejicz
I want to understand better liblinear multiclass decision function.
##TrF is array of 30 training feature vectors with 10000 features
TrF.shape
(30,10000)
TrL
array([2, 2, 0, 1, 2, 1, 0, 0, 0, 2, 2, 2, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0, 0,
2, 2, 2, 0, 2, 2, 0])
import scikits.learn.svm as svm
classifier = svm.LinearSVC(eps = 1e-5, C = 10**4)
classifier.fit(TrF,TrL)
TeF.shape
(80,10000)
prediction = classifier.predict(TeF)
prediction
array([1, 0, 0, 1, 0, 0, 2, 2, 2, 2, 0, 0, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0, 0,
....
0, 0, 2, 2, 2,1, 0])
coef = classifier.coef_
intercept = classifier.intercept_
I want to know what is definition of predict procedure in terms of this
data.   In one way, maybe I already understand, because when using my self
OVA_classifier = my_ova_classifier(TrF,TrL)
##OVA_classifier.coef is matrix of coefficients from binary problem for
OVA_classifier.coef.shape
(3,10000)
## where OVA_classifier[i,:] is weights for i-th class binary problem
## OVA_classifier.intercept is array of intercepts also from binary
problem
>>> OVA_prediction = (dot(TeF,myclassifier.coef) +
myclassifer.intercept).argmax(axis = 1)
(OVA_prediction == prediction).all()
True
This work for several problems and many splits so I think it is true that
liblinear is using OVA.  (Or please tell me if not true.)   BUT:  if I try
same maximum margin with original weight and intercept from classifier, is
not_prediction = (dot(TeF,classifier.coef.T) +
classifer.intercept).argmax(axis = 1)
(not_prediction == prediction).all()
False
So, what is relationship between coef, intercept from classifier returned by
LinearSVC, and prediction function?   I look at liblinear source code, but I
don't know c++ so I didn't understand.   Also documentation was not too
explicit.
Thank you.
Tobjan
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment,
and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment,
and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Tobjan Brejicz
2010-12-30 00:33:58 UTC
Alex, Thank you.

Perhaps I should be write to a different list? Is there liblinear mail
list?

(Infact I did also try several other "formulas" for maximum margin,
including l2-normalize the coef_ vectors, but this didn't solve the problem
anyway.)

-Tob

On Wed, Dec 29, 2010 at 7:01 PM, Alexandre Gramfort <
Post by Alexandre Gramfort
hi,
again I'm not the expert and the authors of liblinear are probably
the best persons to answer this. However can it be due to the fact
that the coef_ vectors are not normalized which introduces a scaling
in the computation of the margin?
Alex
Post by Tobjan Brejicz
Post by Alexandre Gramfort
What I can just tell from the doc of LinearSVC is that it has
I would suspect that it does an OVA when multi_class==False
Alex, thank you. Yes, I think your statement must be correct. My
experiment confirm this, since result of prediction is same as hand-code
OVA
Post by Tobjan Brejicz
result. I use "maximum margin" prediction function using weight /
intercept
Post by Tobjan Brejicz
data from OVA binary problems.
But question is: what is interpretation of "coef_" and "intercept_",
attached data attribute of returned classifier, for use in prediction
function?
Motivation for question is if I want to make efficient computation of
performance metric beside "score", I sometime need to be be able to
compute
Post by Tobjan Brejicz
some intermediate step of prediction function. Usually prediction
1) compute real-valued function of features, like dot product with
weights, and add intercept (this is just rotate and scale feature vectors
along decision planes)
2) then apply discretion criterion, like "argmax" (in case of
multiclass) or " > 0" (in case of binary class) to reduce real-valued to
label-valued function.
Sometime is useful to compute just first step ... and I wanted to make
this
Post by Tobjan Brejicz
computation using Classifier.coef_ and Classifier.intercept_ matrix. If
I
Post by Tobjan Brejicz
use usual "maximum margin" formula with "coef_ " and "intercept_" matrix
I
Post by Tobjan Brejicz
DON'T get expected result. For case of multi_class == False, I COULD
work
Post by Tobjan Brejicz
around this problem by using maximum margin with weights/intercept coming
from hand-coded OVA on top of regular LinearSVC binary classifiers.
This
Post by Tobjan Brejicz
is stupid since it means training SVM twice. So I would like to know how
define prediction function in terms of data actually returned with
classifier?
multi_class = True does Cramer-Singer method, which can be better ... but
my
Post by Tobjan Brejicz
question is really same if multi_class = True or False. For case of
mult_class = True, I haven't the hand-coded Cramer-Singer implementation
to
Post by Tobjan Brejicz
check against anyway. So I would like to know how to answer my
question
Post by Tobjan Brejicz
for both mult_class = True / False.
Thank you.
-Tob
Post by Alexandre Gramfort
Alex
Post by Tobjan Brejicz
I want to understand better liblinear multiclass decision function.
##TrF is array of 30 training feature vectors with 10000 features
TrF.shape
(30,10000)
TrL
array([2, 2, 0, 1, 2, 1, 0, 0, 0, 2, 2, 2, 1, 2, 0, 2, 0, 1, 0, 2, 0,
0,
Post by Tobjan Brejicz
Post by Alexandre Gramfort
Post by Tobjan Brejicz
0,
2, 2, 2, 0, 2, 2, 0])
import scikits.learn.svm as svm
classifier = svm.LinearSVC(eps = 1e-5, C = 10**4)
classifier.fit(TrF,TrL)
TeF.shape
(80,10000)
prediction = classifier.predict(TeF)
prediction
array([1, 0, 0, 1, 0, 0, 2, 2, 2, 2, 0, 0, 1, 2, 0, 2, 0, 1, 0, 2, 0,
0,
Post by Tobjan Brejicz
Post by Alexandre Gramfort
Post by Tobjan Brejicz
0,
....
0, 0, 2, 2, 2,1, 0])
coef = classifier.coef_
intercept = classifier.intercept_
I want to know what is definition of predict procedure in terms of
this
Post by Tobjan Brejicz
Post by Alexandre Gramfort
Post by Tobjan Brejicz
data. In one way, maybe I already understand, because when using my self
OVA classifier, prediction is same for "maximum margin" predictor.
OVA_classifier = my_ova_classifier(TrF,TrL)
##OVA_classifier.coef is matrix of coefficients from binary problem for
OVA_classifier.coef.shape
(3,10000)
## where OVA_classifier[i,:] is weights for i-th class binary
problem
Post by Tobjan Brejicz
Post by Alexandre Gramfort
Post by Tobjan Brejicz
## OVA_classifier.intercept is array of intercepts also from binary
problem
OVA_prediction = (dot(TeF,myclassifier.coef) +
myclassifer.intercept).argmax(axis = 1)
(OVA_prediction == prediction).all()
True
This work for several problems and many splits so I think it is true that
liblinear is using OVA. (Or please tell me if not true.) BUT: if I try
same maximum margin with original weight and intercept from
classifier,
Post by Tobjan Brejicz
Post by Alexandre Gramfort
Post by Tobjan Brejicz
is
not_prediction = (dot(TeF,classifier.coef.T) +
classifer.intercept).argmax(axis = 1)
(not_prediction == prediction).all()
False
So, what is relationship between coef, intercept from classifier returned by
LinearSVC, and prediction function? I look at liblinear source code, but I
don't know c++ so I didn't understand. Also documentation was not
too
Post by Tobjan Brejicz
Post by Alexandre Gramfort
Post by Tobjan Brejicz
explicit.
Thank you.
Tobjan
------------------------------------------------------------------------------
Post by Tobjan Brejicz
Post by Alexandre Gramfort
Post by Tobjan Brejicz
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database
environment,
Post by Tobjan Brejicz
Post by Alexandre Gramfort
Post by Tobjan Brejicz
and,
should the need arise, upgrade to a full multi-node Oracle RAC
database
Post by Tobjan Brejicz
Post by Alexandre Gramfort
Post by Tobjan Brejicz
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Tobjan Brejicz
Learn how Oracle Real Application Clusters (RAC) One Node allows
customers
Post by Tobjan Brejicz
to consolidate database storage, standardize their database environment,
and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2010-12-30 02:20:11 UTC
You can email directly the authors of liblinear. I've done it in the past.

Can you just try one more thing. Replace in your tests the intercept by
0.5 * intercept_ (or maybe 2*intercept_). This intuition comes from line 422
in scikits.learn.svm.base

Alex
Post by Tobjan Brejicz
Alex, Thank you.
Perhaps I should be write to a different list?  Is there liblinear mail
list?
(Infact I did also try several other "formulas" for maximum margin,
including l2-normalize the coef_ vectors, but this didn't solve the problem
anyway.)
-Tob
On Wed, Dec 29, 2010 at 7:01 PM, Alexandre Gramfort
Post by Alexandre Gramfort
hi,
again I'm not the expert and the authors of liblinear are probably
the best persons to answer this. However can it be due to the fact
that the coef_ vectors are not normalized which introduces a scaling
in the computation of the margin?
Alex
Post by Alexandre Gramfort
What I can just tell from the doc of LinearSVC is that it has
I would suspect that it does an OVA when multi_class==False
Alex, thank you.   Yes,  I think your statement must be correct.   My
experiment confirm this, since result of prediction is same as hand-code OVA
result.  I use "maximum margin" prediction function using weight / intercept
data from OVA binary problems.
But question is:   what is interpretation of "coef_" and "intercept_",
attached data attribute of returned classifier, for use in prediction
function?
Motivation for question is if I want to make efficient computation of
performance metric beside "score",  I sometime need to be be able to compute
some intermediate step of prediction function.    Usually prediction
1) compute real-valued function of features, like dot product with
weights, and add intercept (this is just rotate and scale feature vectors
along decision planes)
2) then apply discretion criterion, like "argmax" (in case of
multiclass) or " > 0" (in case of binary class) to reduce real-valued to
label-valued function.
Sometime is useful to compute just first step ...  and I wanted to make this
computation using Classifier.coef_ and Classifier.intercept_ matrix.   If I
use usual "maximum margin" formula with "coef_ " and "intercept_" matrix I
DON'T get expected result.   For case of multi_class == False, I COULD work
around this problem by using maximum margin with weights/intercept coming
from hand-coded OVA on top of regular LinearSVC binary classifiers.
This
is stupid since it means training SVM twice.  So I would like to know how
define prediction function in terms of data actually returned with
classifier?
multi_class = True does Cramer-Singer method, which can be better ... but my
question is really same if multi_class = True or False.     For case of
mult_class = True, I haven't the hand-coded Cramer-Singer implementation to
check against anyway.     So I would like to know how to answer my question
for both mult_class = True / False.
Thank you.
-Tob
Post by Alexandre Gramfort
Alex
Post by Tobjan Brejicz
I want to understand better liblinear multiclass decision function.
##TrF is array of 30 training feature vectors with 10000 features
TrF.shape
(30,10000)
TrL
array([2, 2, 0, 1, 2, 1, 0, 0, 0, 2, 2, 2, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0,
0,
2, 2, 2, 0, 2, 2, 0])
import scikits.learn.svm as svm
classifier = svm.LinearSVC(eps = 1e-5, C = 10**4)
classifier.fit(TrF,TrL)
TeF.shape
(80,10000)
prediction = classifier.predict(TeF)
prediction
array([1, 0, 0, 1, 0, 0, 2, 2, 2, 2, 0, 0, 1, 2, 0, 2, 0, 1, 0, 2, 0, 0,
0,
....
0, 0, 2, 2, 2,1, 0])
coef = classifier.coef_
intercept = classifier.intercept_
I want to know what is definition of predict procedure in terms of this
data.   In one way, maybe I already understand, because when using my self
OVA_classifier = my_ova_classifier(TrF,TrL)
##OVA_classifier.coef is matrix of coefficients from binary problem
for
OVA_classifier.coef.shape
(3,10000)
## where OVA_classifier[i,:] is weights for i-th class binary problem
## OVA_classifier.intercept is array of intercepts also from binary
problem
>>> OVA_prediction = (dot(TeF,myclassifier.coef) +
myclassifer.intercept).argmax(axis = 1)
(OVA_prediction == prediction).all()
True
This work for several problems and many splits so I think it is true that
liblinear is using OVA.  (Or please tell me if not true.)   BUT:  if
I
try
same maximum margin with original weight and intercept from classifier,
is
not_prediction = (dot(TeF,classifier.coef.T) +
classifer.intercept).argmax(axis = 1)
(not_prediction == prediction).all()
False
So, what is relationship between coef, intercept from classifier returned by
LinearSVC, and prediction function?   I look at liblinear source
code,
but I
don't know c++ so I didn't understand.   Also documentation was not too
explicit.
Thank you.
Tobjan
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment,
and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment,
and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment,
and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Fabian Pedregosa
2011-01-02 13:19:28 UTC
Post by Alexandre Gramfort
What I can just tell from the doc of LinearSVC is that it has
I would suspect that it does an OVA when multi_class==False
Alex, thank you.   Yes,  I think your statement must be correct.   My
experiment confirm this, since result of prediction is same as hand-code OVA
result.  I use "maximum margin" prediction function using weight / intercept
data from OVA binary problems.
But question is:   what is interpretation of "coef_" and "intercept_",
attached data attribute of returned classifier, for use in prediction
function?
Motivation for question is if I want to make efficient computation of
performance metric beside "score",  I sometime need to be be able to compute
some intermediate step of prediction function.    Usually prediction
1) compute real-valued function of features, like dot product with
weights, and add intercept (this is just rotate and scale feature vectors
along decision planes)
2) then apply discretion criterion, like "argmax" (in case of
multiclass) or " > 0" (in case of binary class) to reduce real-valued to
label-valued function.
Hi Tobjan!

The documentation on the svm module defines the decision function for
the case of libsvm [0] and explains very briefly the meaning of
support_vectors_ and intercept_ . Also, take a look into the test
test_decision_function() from file scikits/learn/svm/tests/test_svm.py
[1] that reconstructs the decision function from the SVC parameters.
The code is not optimal and could probably be simplified, but could
help get you started.

It would be awesome if you could do the same for liblinear, but some
remarks apply. We recently patched libsvm so that the columns in
decision function are ordered by arithmetical order in classes, but
this has not (yet) been ported to liblinear, this might be the reason
why you don't get the expected results. Applying the fix should be
really straightforward as the code from libsvm and liblinear for those
routines are very similar, I can give you pointers if you are
interested in working on it.

Hope this helps,

Fabian

[0] http://scikit-learn.sourceforge.net/modules/svm.html#mathematical-formulation
[1] https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/svm/tests/test_svm.py
Tobjan Brejicz
2011-01-12 22:44:03 UTC
Fabian:

Thank you very much for help.
Post by Fabian Pedregosa
The code is not optimal and could probably be simplified, but could
help get you started.
I seem the different between this code in [1] and case of liblinear is
different between use dual_coef_ and coef_. (for OVA). Is right? Is it
the same for Cramer Singer method?

Also one thing is bother for me. In [0], says that intercept_ is -1 *
rho. and that decision function is

np.dot(dual_coef_,features) + rho

But in [1], the test_decision_function seems is using:

np.dot(dual_coef_,features) + intercept_

Is sign different? (Sorry if my understand thought is not right.)
Post by Fabian Pedregosa
It would be awesome if you could do the same for liblinear, but some
remarks apply
Yes, I am happy to working on this. It is very important for me.

We recently patched libsvm so that the columns in
Post by Fabian Pedregosa
decision function are ordered by arithmetical order in classes, but
this has not (yet) been ported to liblinear, this might be the reason
why you don't get the expected results.
So your meaning is that the order of the vector in classifier.coef_, along
axis 1, is not same as numerical order of classes as it appear in training
data? How do I determine order then?
Post by Fabian Pedregosa
Applying the fix should be
really straightforward as the code from libsvm and liblinear for those
routines are very similar, I can give you pointers if you are
interested in working on it.
Thank you, I would like to set up test for liblinear. Please show me how I
can begin.

-Tob
Post by Fabian Pedregosa
Hope this helps,
Fabian
[0]
http://scikit-learn.sourceforge.net/modules/svm.html#mathematical-formulation
[1]
https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/svm/tests/test_svm.py
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment,
and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Fabian Pedregosa
2011-01-13 19:26:19 UTC
Post by Tobjan Brejicz
Thank you very much for help.
Post by Fabian Pedregosa
The code is not optimal and could probably be simplified, but could
help get you started.
I seem the different between this code in [1] and case of liblinear is
different between use dual_coef_ and coef_.  (for OVA).   Is right?   Is it
the same for Cramer Singer method?
not only one libsvm uses dual_coef_, while liblinear uses coef_, but
strategy is also different. In libsvm, n_class*(n_class-1) classifiers
are constructed (one-versus-one), while in liblinear k classifiers are
constructed (one-against-all). I don't know how it goes for Cramer
Singer method, liblinear README file has some details, but they are
inscrutable to me.
Post by Tobjan Brejicz
Also one thing is bother for me.   In [0], says that intercept_ is   -1 *
rho. and that decision function is
np.dot(dual_coef_,features) + rho
np.dot(dual_coef_,features) + intercept_
Is sign different?   (Sorry if my understand thought is not right.)
The -rho stands because libsvm stores the negative of the intercept
instead of just the intercept, but in that context it doesn't make
much sense and just makes the documentation fuzzier, I'll fix that
ASAP.
Post by Tobjan Brejicz
Post by Fabian Pedregosa
It would be awesome if you could do the same for liblinear, but some
remarks apply
Yes, I am happy to working on this.   It is very important for me.
We recently patched libsvm so that the columns in
Post by Fabian Pedregosa
decision function are ordered by arithmetical order in classes, but
this has not (yet) been ported to liblinear, this might be the reason
why you don't get the expected results.
So your meaning is that the order of the vector in classifier.coef_, along
axis 1, is not same as numerical order of  classes as it appear in training
data?     How do I determine order then?
Exactly. libsvm and liblinear just take the order of classes as they
appear in the fit method. Take the following as example:

In [12]: clf = svm.LinearSVC()

In [13]: clf.fit([[0], [2], [1]], [0, 2, 1]).coef_
Out[13]:
array([[-0.90907561],
[-0.04874084],
[-0.82927618]])

In [14]: clf.fit([[0], [1], [2]], [0, 1, 2]).coef_
Out[14]:
array([[-0.90907944],
[ 0.63414653],
[-0.24390955]])

simple permutations in the data, change the value of coef_.

Our solution was to group the classes by arithmetical order in method
svm_group_classes. The patch is here:

https://github.com/fabianp/scikit-learn/commit/bd6c8ed1e18253884441572cf7cf337aa5a843b7

It should be fairly immediate to apply a similar patch to liblinear's
function group_classes.

Thanks,

Fabian
Fabian Pedregosa
2011-01-22 22:04:20 UTC
On Thu, Jan 13, 2011 at 8:26 PM, Fabian Pedregosa
Post by Fabian Pedregosa
Post by Tobjan Brejicz
Thank you very much for help.
Post by Fabian Pedregosa
The code is not optimal and could probably be simplified, but could
help get you started.
I seem the different between this code in [1] and case of liblinear is
different between use dual_coef_ and coef_.  (for OVA).   Is right?   Is it
the same for Cramer Singer method?
not only one libsvm uses dual_coef_, while liblinear uses coef_, but
strategy is also different. In libsvm, n_class*(n_class-1) classifiers
are constructed (one-versus-one), while in liblinear k classifiers are
constructed (one-against-all). I don't know how it goes for Cramer
Singer method, liblinear README file has some details, but they are
inscrutable to me.
Post by Tobjan Brejicz
Also one thing is bother for me.   In [0], says that intercept_ is   -1 *
rho. and that decision function is
np.dot(dual_coef_,features) + rho
np.dot(dual_coef_,features) + intercept_
Is sign different?   (Sorry if my understand thought is not right.)
The -rho stands because libsvm stores the negative of the intercept
instead of just the intercept, but in that context it doesn't make
much sense and just makes the documentation fuzzier, I'll fix that
ASAP.
Post by Tobjan Brejicz
Post by Fabian Pedregosa
It would be awesome if you could do the same for liblinear, but some
remarks apply
Yes, I am happy to working on this.   It is very important for me.
We recently patched libsvm so that the columns in
Post by Fabian Pedregosa
decision function are ordered by arithmetical order in classes, but
this has not (yet) been ported to liblinear, this might be the reason
why you don't get the expected results.
So your meaning is that the order of the vector in classifier.coef_, along
axis 1, is not same as numerical order of  classes as it appear in training
data?     How do I determine order then?
Exactly. libsvm and liblinear just take the order of classes as they
In [12]: clf = svm.LinearSVC()
In [13]: clf.fit([[0], [2], [1]], [0, 2, 1]).coef_
array([[-0.90907561],
[-0.04874084],
[-0.82927618]])
In [14]: clf.fit([[0], [1], [2]], [0, 1, 2]).coef_
array([[-0.90907944],
[ 0.63414653],
[-0.24390955]])
simple permutations in the data, change the value of coef_.
Our solution was to group the classes by arithmetical order in method
https://github.com/fabianp/scikit-learn/commit/bd6c8ed1e18253884441572cf7cf337aa5a843b7
It should be fairly immediate to apply a similar patch to liblinear's
function group_classes.
Hi Tobjan, any progress on this ? Otherwise I'll implement it during
the following week as I need to have a consistent decision_function on
liblinear derived classes.

Best,

Fabian.
Tobjan Brejicz
2011-01-25 16:09:34 UTC
Post by Fabian Pedregosa
Hi Tobjan, any progress on this ? Otherwise I'll implement it during
the following week as I need to have a consistent decision_function on
liblinear derived classes.
Hello Fabian:

Please to excuse the delay of my reply. I am work on this, hopefull will
finish by the new weekend. So is that OK?

Sorry!

-Tob
Post by Fabian Pedregosa
Best,
Fabian.
------------------------------------------------------------------------------
Finally, a world-class log management solution at an even better
price-free!
February 28th, so secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Fabian Pedregosa
2011-01-26 08:17:44 UTC
Post by Fabian Pedregosa
Hi Tobjan, any progress on this ? Otherwise I'll implement it during
the following week as I need to have a consistent decision_function on
liblinear derived classes.
Please to excuse the delay of my reply.  I am work on this, hopefull will
finish by the new weekend.    So is that OK?
No hurry, I was just checking this was still active.

Fabian.
Dan Yamins
2011-02-01 00:44:10 UTC
Hi Fabian (and Tobjan):

I've added the patch for ordering labels arithmetically in
liblinear/linear.cpp. It's the identical patch to what you already used for
libsvm. See
https://github.com/yamins81/scikit-learn/commit/e038a8f632b3bf7ce6edf24d29fd01073d2e3c8a.

This is not the whole story though for understanding the liblinear testing
function. I was able to figure out what it is, but it is not the
straightforward "maximum" procedure that one might expect. I've
encapsulated this prediction function as a test in test_svm, called
test_liblinear_predict. See
https://github.com/yamins81/scikit-learn/commit/ab4a200f3d6f3603f3f027a2c569753585b96584

I've submitted pull requests - but let me know if I need to sharpen anything

best,
Dan
Post by Fabian Pedregosa
Post by Tobjan Brejicz
Post by Fabian Pedregosa
Hi Tobjan, any progress on this ? Otherwise I'll implement it during
the following week as I need to have a consistent decision_function on
liblinear derived classes.
Please to excuse the delay of my reply. I am work on this, hopefull will
finish by the new weekend. So is that OK?
No hurry, I was just checking this was still active.
Fabian.
------------------------------------------------------------------------------
Finally, a world-class log management solution at an even better
price-free!
February 28th, so secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Fabian Pedregosa
2011-02-01 08:16:17 UTC
Post by Dan Yamins
I've added the patch for ordering labels arithmetically in
liblinear/linear.cpp.  It's the identical patch to what you already used for
libsvm.
See https://github.com/yamins81/scikit-learn/commit/e038a8f632b3bf7ce6edf24d29fd01073d2e3c8a
.
This is not the whole story though for understanding the liblinear testing
function.   I was able to figure out what it is, but it is not the
straightforward "maximum" procedure that one might expect.   I've
encapsulated this prediction function as a test in test_svm,  called
test_liblinear_predict.
See https://github.com/yamins81/scikit-learn/commit/ab4a200f3d6f3603f3f027a2c569753585b96584
I've submitted pull requests - but let me know if I need to sharpen anything
Thanks a lot Dan. First patch is fine, but I with the others I get
test failures in test_liblinear_predict. Please make sure you are in
sync with latest master from
https://github.com/scikit-learn/scikit-learn (my master branch is
often out of date, sorry for that).

It would be great if you could take a look and submit a reworked patch.

Best,

Fabian.

PS: the other failures in test_svm.py are known failures.
Dan Yamins
2011-02-01 13:49:50 UTC
Please make sure you are in
Post by Fabian Pedregosa
sync with latest master from
https://github.com/scikit-learn/scikit-learn (my master branch is
often out of date, sorry for that).
Ah, that was the problem ... I had been using your branch. Have now
redone things starting with the scikit-learn branch, which
necessitated a small change to the testing code, and now things again
look like they're working on my side.

Dan
Post by Fabian Pedregosa
It would be great if you could take a look and submit a reworked patch.
Best,
Fabian.
PS: the other failures in test_svm.py are known failures.
Fabian Pedregosa
2011-02-01 14:49:24 UTC
Please make sure you are in
Post by Fabian Pedregosa
sync with latest master from
https://github.com/scikit-learn/scikit-learn (my master branch is
often out of date, sorry for that).
Ah, that was the problem ... I had been using your branch.   Have now
redone things starting with the scikit-learn branch, which
necessitated a small change to the testing code, and now things again
look like they're working on my side.
you are done.

Best,

Fabian.
Dan Yamins
2011-02-01 14:52:04 UTC
Post by Fabian Pedregosa
you are done.
I opened one on the scikit-learn master ... I think it's under review?
It's now pep8-compliant!

D
Fabian Pedregosa
2011-02-01 15:00:49 UTC
awesome work, it's in.

Fabian.
Post by Dan Yamins
Post by Fabian Pedregosa
you are done.
I opened one on the scikit-learn master ... I think it's under review?
It's now pep8-compliant!
D
------------------------------------------------------------------------------
Finally, a world-class log management solution at an even better price-free!
February 28th, so secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Dan Yamins
2011-02-01 18:20:40 UTC
Thanks Fabian.

I was able to simplify the liblinear prediction function. (I
submitted a pull request). But more importantly, I think I
understand what's "really going on", and this generates some
questions.

So the new liblinear_prediction_function is:

def liblinear_prediction_function(farray, clas):

weights = clas.raw_coef_
(a,b) = weights.shape
weights = weights.reshape((b,a))

D = np.column_stack([farray,np.ones(farray.shape[0])])
H = np.dot(D,weights)
predict = H.argmax(1)
return predict

Notice that the key things here are: 1) we reshape the weights matrix
in a nontrivial way (it's not just the transpose). 2) we add a column
of 1s to the data.
After that, the "normal" thing happens, e.g. the dot-product / argmax procedure.

The addition of the column of 1s can be explained by looking at
"coef_" and "intercept_". It turns out that basically

raw_coef_ = np.column_stack([coef_, intercept_])

So what's happening in terms of 2) is that basically, the bias is
getting stuck on via that column of 1s. OK, so that makes sense. But
this still leaves the question about the nontrivial reshaping.

Somehow, the way the raw_coef_ matrix (and the coef_ matrix) is bound
via the cython bindings to the c++ code gets the weights matrix order
wrong -- or at least, wrong for using the usual dot-product approach
to defining the prediction function. It should be considered to
change this in the cython code so that the scikits liblinear API
provides a more easily useful object. Is there a motivation for the
way it's done now?

Also: I noticed that sometimes the raw_coef_ values differ from the
coef_ and intercept_ values by small numerical differences. These
differences don't change any of the prediction results. But in fact,
liblinear.predict is using the raw_coef_ values (as you can determine
by sticking some printf's into the linear.cpp source code). Why are
there these differences and should they be fixed?

Dan

On Tue, Feb 1, 2011 at 10:00 AM, Fabian Pedregosa
Post by Fabian Pedregosa
awesome work, it's in.
Fabian.
Post by Dan Yamins
Post by Fabian Pedregosa
you are done.
I opened one on the scikit-learn master ... I think it's under review?
It's now pep8-compliant!
D
------------------------------------------------------------------------------
Finally, a world-class log management solution at an even better price-free!
February 28th, so secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Paolo Losi
2011-02-02 10:39:24 UTC
Hi Dan,

reply inline (sorry for being terse, but I'm in a hurry).
Post by Dan Yamins
weights = clas.raw_coef_
(a,b) = weights.shape
weights = weights.reshape((b,a))
D = np.column_stack([farray,np.ones(farray.shape[0])])
here it should be

D = np.column_stack([farray, clas.intercept_scaling *
np.ones(farray.shape[0])])
Post by Dan Yamins
H = np.dot(D,weights)
predict = H.argmax(1)
return predict
cut...
Post by Dan Yamins
The addition of the column of 1s can be explained by looking at
"coef_" and "intercept_". It turns out that basically
raw_coef_ = np.column_stack([coef_, intercept_])
Thanks correct only when intercept_scaling is set to one.

You can gather more context looking at:

http://sourceforge.net/mailarchive/message.php?msg_id=26732380
https://github.com/scikit-learn/scikit-learn/commit/a7047ca22bd3245ec41e3ab4bc24db7e3835fa6b

Ciao
--
Paolo Losi
e-mail: ***@enuan.com
mob: +39 348 7705261

ENUAN Srl
Via XX Settembre, 12 - 29100 Piacenza
Dan Yamins
2011-02-02 12:49:56 UTC
Post by Paolo Losi
here it should be
D = np.column_stack([farray, clas.intercept_scaling *
np.ones(farray.shape[0])])
Paolo, much obliged ... thanks!

D
Alexandre Gramfort
2011-02-02 13:00:55 UTC
just a quick remark. I prefer to have coef_ of shape [n_classes, n_features]
as it then allows to inverse_transform them as any sample. The second
dimension is always n_features.

thanks guys to this closer look at these issues.

Alex
Post by Dan Yamins
Post by Paolo Losi
here it should be
D = np.column_stack([farray, clas.intercept_scaling *
np.ones(farray.shape[0])])
Paolo, much obliged ... thanks!
D
------------------------------------------------------------------------------
Finally, a world-class log management solution at an even better price-free!
February 28th, so secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Dan Yamins
2011-02-02 13:10:40 UTC
On Wed, Feb 2, 2011 at 8:00 AM, Alexandre Gramfort
Post by Alexandre Gramfort
just a quick remark. I prefer to have coef_ of shape [n_classes, n_features]
as it then allows to inverse_transform them as any sample. The second
dimension is always n_features.
Sure .. that makes sense. But my point is quite independent of
whether the coef_ shape is [n_classes, n_features] or the transpose.
Instead, I'm saying that currently the matrix is laid out at the
cython layer so that NEITHER it nor its transpose represents a weight
matrix properly. Instead, you have to shuffle the elements in a
nontrivial way, first. Do you see what I mean?

D
Post by Alexandre Gramfort
thanks guys to this closer look at these issues.
Alex
Post by Dan Yamins
Post by Paolo Losi
here it should be
D = np.column_stack([farray, clas.intercept_scaling *
np.ones(farray.shape[0])])
Paolo, much obliged ... thanks!
D
------------------------------------------------------------------------------
Finally, a world-class log management solution at an even better price-free!
February 28th, so secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2011-02-02 14:21:58 UTC
Sure .. that makes sense.   But my point is quite independent of
whether the coef_ shape is [n_classes, n_features] or the transpose.
I understand. I would bet the coefs are stored in fortran order for
efficiency of the internal coordinate descent but read as a C array
like any other array taken out of libsvm or liblinear.

note that the LassoCV and ElasticNetCD also update X to fortran
order for the same reason.

Alex
Dan Yamins
2011-02-02 14:31:06 UTC
Post by Alexandre Gramfort
I understand. I would bet the coefs are stored in fortran order for
efficiency of the internal coordinate descent but read as a C array
like any other array taken out of libsvm or liblinear.
Yes, I suspected that this is exactly what was happening ... But
surely b the time the object is read at the cython level, this should
be corrected? You're not saying that this order mismatch is intended
to be exposed at the scikits API level, right?

D
Post by Alexandre Gramfort
note that the LassoCV and ElasticNetCD also update X to fortran
order for the same reason.
Alex
Alexandre Gramfort
2011-02-02 14:32:39 UTC
Yes, I suspected that this is exactly what was happening ...  But
surely b the time the object is read at the cython level, this should
be corrected?   You're not saying that this order mismatch is intended
to be exposed at the scikits API level, right?
yes there is a bug in the cython binding

would you fix it?

Alex
Dan Yamins
2011-02-02 14:48:32 UTC
Post by Alexandre Gramfort
yes there is a bug in the cython binding
would you fix it?
Sure ... I would have done so with my second commit yesterday, but
hesitated because doing this will end up making a semantically
noticeable change in the scikits liblinear API. Anybody else who's
noticed this problem and corrected for it will have their code break
after the fix is applied ... Shall I also put a comment somewhere

D
Olivier Grisel
2011-02-02 14:51:44 UTC
Post by Dan Yamins
Post by Alexandre Gramfort
yes there is a bug in the cython binding
would you fix it?
Sure ... I would have done so with my second commit yesterday, but
hesitated because doing this will end up making a semantically
noticeable change in the scikits liblinear API.  Anybody else who's
noticed this problem and corrected for it will have their code break
after the fix is applied ... Shall I also put a comment somewhere
AFAIK, when you do a pull request on github it automatically creates
an issue in the github tracker. Make sure to set it to the milestone
0.7 so that Fabian can point it out in the release notes.
--
Olivier
Fabian Pedregosa
2011-02-02 20:31:56 UTC
Thanks for the explanation, it is now much clearer.

Fabian.
Post by Dan Yamins
Post by Alexandre Gramfort
yes there is a bug in the cython binding
would you fix it?
Sure ... I would have done so with my second commit yesterday, but
hesitated because doing this will end up making a semantically
noticeable change in the scikits liblinear API.  Anybody else who's
noticed this problem and corrected for it will have their code break
after the fix is applied ... Shall I also put a comment somewhere
D
------------------------------------------------------------------------------
Finally, a world-class log management solution at an even better price-free!
February 28th, so secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Dan Yamins
2011-02-03 19:48:40 UTC
On Wed, Feb 2, 2011 at 3:31 PM, Fabian Pedregosa
Post by Fabian Pedregosa
Thanks for the explanation, it is now much clearer.
Fabian.
Post by Alexandre Gramfort
yes there is a bug in the cython binding
would you fix it?
So there are two ways of doing this fix.

1) Actually fix it in the cython layer, e.g. do bidirectional Fortran
<-> C conversion in cython on the raw_coef_ array every time it passes
between liblinear and python.

2) Or, one could just as well NOT do this conversion, and let the
raw_coef_ stay in Fortran order. This won't harm anything, since as
long as the Fortran order is preserved between situations where the
model is a liblinear output (e.g. in training) and a liblinear input
(e.g. in prediction), things will work fine. Instead, one would
simply (re)define the LinearSVC.coef_() getter method so that it
computes the correct thing from the raw_coef_ array, by doing the
Fortran -> C conversion right there in the python code (using numpy
for speed).

On first glance it might sound as if 1) would be better, since all the
code is in C. However, it involves doing back-and-forth reordering
-- which scales as O(nr_class*nr_feature) -- every time one wants to
do something. In the second scenario, the raw_coef_ would always
preserve the Fortran order used by the liblinear library, and the only
real penalty accrues to people who need to do the conversion JUST on
outputs, e.g. when using the coef_ object directly for further model
evaluation outside of scikits code. This includes me and I assume
people like Tobjan, &c. In that case the only difference is between
a Cython loop and a and numpy reshape (which isn't so bad anyway).

What do you guys think? (I've actually implemented both already, so ...)

D
Fabian Pedregosa
2011-02-04 10:49:17 UTC
Post by Dan Yamins
On Wed, Feb 2, 2011 at 3:31 PM, Fabian Pedregosa
Post by Fabian Pedregosa
Thanks for the explanation, it is now much clearer.
Fabian.
Post by Alexandre Gramfort
yes there is a bug in the cython binding
would you fix it?
So there are two ways of doing this fix.
1) Actually fix it in the cython layer, e.g. do bidirectional Fortran
<-> C conversion in cython on the raw_coef_ array every time it passes
between liblinear and python.
2) Or, one could just as well NOT do this conversion, and  let the
raw_coef_ stay in Fortran order.   This won't harm anything, since as
long as the Fortran order is preserved between situations where the
model is a liblinear output (e.g. in training) and a liblinear input
(e.g. in prediction), things will work fine.  Instead, one would
simply (re)define the LinearSVC.coef_() getter  method so that it
computes the correct thing from the raw_coef_ array, by doing the
Fortran -> C conversion right there in the python code (using numpy
for speed).
Would it be possible to declare your numpy array raw_coef_ to be
fortran ordered in the Cython layer (order='F') so that coef_ has the
expected shape and no ordering is required ?

Fabian.
Olivier Grisel
2011-02-04 11:02:53 UTC
Post by Fabian Pedregosa
Post by Dan Yamins
On Wed, Feb 2, 2011 at 3:31 PM, Fabian Pedregosa
Post by Fabian Pedregosa
Thanks for the explanation, it is now much clearer.
Fabian.
Post by Alexandre Gramfort
yes there is a bug in the cython binding
would you fix it?
So there are two ways of doing this fix.
1) Actually fix it in the cython layer, e.g. do bidirectional Fortran
<-> C conversion in cython on the raw_coef_ array every time it passes
between liblinear and python.
2) Or, one could just as well NOT do this conversion, and  let the
raw_coef_ stay in Fortran order.   This won't harm anything, since as
long as the Fortran order is preserved between situations where the
model is a liblinear output (e.g. in training) and a liblinear input
(e.g. in prediction), things will work fine.  Instead, one would
simply (re)define the LinearSVC.coef_() getter  method so that it
computes the correct thing from the raw_coef_ array, by doing the
Fortran -> C conversion right there in the python code (using numpy
for speed).
Would it be possible to declare your numpy array raw_coef_ to be
fortran ordered in the Cython layer (order='F') so that coef_ has the
expected shape and no ordering is required ?
+1, I think this is possible to type-declare fortran ordered arrays in
cython with something like:

cdef myf_funct(np.ndarray[np.double_t, ndim=2, mode="fortran"] a):
...
--
Olivier
Dan Yamins
2011-02-04 13:20:13 UTC
Post by Fabian Pedregosa
Would it be possible to declare your numpy array raw_coef_ to be
fortran ordered in the Cython layer (order='F') so that coef_ has the
expected shape and no ordering is required ?
Sorry, I must have been unclear ... this is exactly how I've
implemented #2. ... One ends up not having to do any operations on
coef_ on the output either, since the ordering is preserved by the
python layer (which I hadn't expected but is good).

D
Dan Yamins
2011-02-04 13:21:36 UTC
And given the drift of your comments I assume you like option 2
better, so I'll push that change.

Dan
Post by Dan Yamins
Post by Fabian Pedregosa
Would it be possible to declare your numpy array raw_coef_ to be
fortran ordered in the Cython layer (order='F') so that coef_ has the
expected shape and no ordering is required ?
Sorry, I must have been unclear ... this is exactly how I've
implemented #2.   ... One ends up not having to do any operations on
coef_ on the output either, since the ordering is preserved by the
python layer (which I hadn't expected but is good).
D
Fabian Pedregosa
2011-02-05 16:17:25 UTC
Post by Dan Yamins
And given the drift of your comments I assume you like option 2
better, so I'll push that change.
Excellent, have you opened a pull request ?

Fabian.
Fabian Pedregosa
2011-02-05 16:19:43 UTC
On Sat, Feb 5, 2011 at 5:17 PM, Fabian Pedregosa
Post by Fabian Pedregosa
Post by Dan Yamins
And given the drift of your comments I assume you like option 2
better, so I'll push that change.
Excellent, have you opened a pull request ?
Never mind, I found it.

Fabian.

Fabian Pedregosa
2011-02-02 05:58:43 UTC
So I was able to simplify the liblinear prediction function.    (I
submitted a pull request).   But more importantly, I think I
understand what's "really going on", and this generates some
questions.
weights = clas.raw_coef_
(a,b) = weights.shape
weights = weights.reshape((b,a))
D = np.column_stack([farray,np.ones(farray.shape[0])])
H = np.dot(D,weights)
predict = H.argmax(1)
return predict
Notice that the key things here are: 1) we reshape the weights matrix
in a nontrivial way (it's not just the transpose).  2) we add a column
of 1s to the data.
After that, the "normal" thing happens, e.g. the dot-product / argmax procedure.
The addition of the column of 1s can be explained by looking at
"coef_" and "intercept_".   It turns out that basically
raw_coef_ = np.column_stack([coef_, intercept_])
So what's happening in terms of 2) is that basically, the bias is
getting stuck on via that column of 1s.  OK, so that makes sense.  But
this still leaves the question about the nontrivial reshaping.
Somehow, the way the raw_coef_ matrix (and the coef_ matrix) is bound
via the cython bindings to the c++ code gets the weights matrix order
wrong -- or at least, wrong for using the usual dot-product approach
to defining the prediction function.    It should be considered to
change this in the cython code so that the scikits liblinear API
provides a more easily useful object.   Is there a motivation for the
way it's done now?
Yes, from the source code it can be seen that coef_ and intercept_ are
just a compatibility layer across raw_coef_:

@property
def intercept_(self):
if self.fit_intercept:
return self.intercept_scaling * self.raw_coef_[:,-1]
return 0.0

@property
def coef_(self):
if self.fit_intercept:
return self.raw_coef_[:,:-1]
return self.raw_coef_

There's no particular motivation in having things this way, but note
that in your decision function the column_stack could be avoided by
using coef_ and intercept_ instead of raw_coef. This way, the decision
function would be something like this:

np.dot(data, clf.coef_.T) + clf.intercept_

But I'd love to hear other proposals for the shape of these arrays (in
particular, I don't know why the transpose is needed).

I've pushed your changes together with some enhacements: I've binded
liblinear's decision_function, so that LinearSVC and
LogisticRegression now have a decision_function method that returns
the decision function as computed by liblinear.

I haven't tested that against your implementation of
decision_function, although it would be great if you could do that.
Also: I noticed that sometimes the raw_coef_ values differ from the
coef_ and intercept_ values by small numerical differences.   These
differences don't change any of the prediction results.   But in fact,
liblinear.predict is using the raw_coef_ values (as you can determine
by sticking some printf's into the linear.cpp source code).   Why are
there these differences and should they be fixed?
The only explanation I can give (and it's not a very good one) is that
the coordinate descent algorithms uses C's random function to select
the order in which features get computed, so different runs give
slighly different results (however, raw_coef_ and [coef_, intercept_]
should be the same if the model is not re-fitted).

Best,

Fabian.
Dan Yamins
2011-02-02 12:48:58 UTC
Hi Fabian:

I think the main point from my previous email was not 100%
appreciated. The reshaping that is necessary in the testing
function is NOT just doing a transpose and in fact is actually doing a
nontrivial reshaping. Actually I'm suggesting that there's a kind of
"bug" in the cython layer, in which mapping order is just "wrong" --
and although it doesn't cause an exception or segfault, and can be
recovered by the reshaping, it does cause very non-expected behavior.
That was the motivation for my doing this work in the first place.

Specifically, let's say raw_coef_ is

[a00 a01 a02 a03]
[a10 a11 a12 a13]
[a20 a21 a22 a23]

Then under the reshaping it becomes:

[a00 a01 a02]
[a03 a10 a11]
[a12 a13 a20]
[a21 a22 a23]

This is most definitely NOT just taking the transpose. As a result,
Post by Fabian Pedregosa
np.dot(data, clf.coef_.T) + clf.intercept_
is NOT true. The correct formula is in fact:

np.dot(data, clf.coef_.reshape((b,a)).) + clf.intercept_

where (a,b) are the shape paramaters of the coef_matrix. It's as if
the values of the weights have been laid out in the wrong order. So
as it stands, the "nth row" of the object called .coef_ by the scikits
liblinear wrapping is not the weight vector for any one particular
label, but instead contains linear stretches of one or more weight
vectors.

This problem caused me -- and I think Tobjan, for instance, and
perhaps others -- quite a bit of consternation, because it clearly
violates expectations.

Note however, that nothing about the liblinear source code violates
expectations -- the matrix structure on coef_ is (I think, for my
reading of the scikits code) done in the scikits cython layer. So I
think it's worth investigating changing that code so that it outputs
for coef_ the reshaped version, as opposed to what it currently does.