Luca Cerone

2013-09-24 16:42:12 UTC

Dear all,

I am practising with scikit-learn to solve multiclass classification

problems.

As an exercise I am trying to build a model to predict the digits dataset

available with scikit-learn.

Ideally I would like to solve this using logistic regression, building a

predictor for each digit (one vs all approach).

When a new "digit" comes I predict the output for each of the trained

classifiers and choose the prediction with the maximum value

(as you can see I am not doing anything special, I think that it is the

naivest approach that you can follow).

So far I performed most of this steps manually, but I guess that there

might be some faster/smarter approach.

For example here is my approach that classifies a digit as 0, 1 or Other.

from sklearn.datasets import load_digits

from sklearn.linear_models import LogisticRegression

digits = load_digits()

data = digits.data

target = digits.target

import pylab as pl

idx = pl.permutation(data.shape[0])

#split the dataset

n_train_sample = 1000

idx_train = idx[0:n_train_sample]

idx_test = idx[0:n_train_sample]

data_train = data[idx_train, : ]

target_train = target[idx_train, : ]

data_test = data[idx_test, : ]

target_test = target[idx_test,:]

#build the classifier that recognize 0:

tar_tr_0 = array(map(lambda x : 1 if x == 0 else 0, target_train))

cfr_0 = LogisticRegression()

cfr_0.fit(data_train, tar_tr_0)

#build the classifier that recognize 1:

tar_tr_0 = array(map(lambda x : 1 if x == 1 else 0, target_train))

cfr_1 = LogisticRegression()

cfr_1.fit(data_train, tar_tr_1)

#build the classifier that recognizes "other":

tar_tr_other = array(map(lambda x : 1 if x > 1 else 0, target_train))

cfr_other = LogisticRegression()

cfr_other.fit(data_train, tar_tr_other)

<<<

Next of course there is some code that takes in input the various trained

classifiers, makes prediction on the test etc etc.

I did this partly for educational purposes (despite I know in theory how

multiclass classification can be performed I never did the prior steps

written before,

which I are useful to learn), partly because I got a bit lost when reading

the documentation (http://scikit-learn.org/stable/modules/multiclass.html).

For the One versus Rest I think I can

use sklearn.multiclass.OneVsRestClassifier (and now I am trying to do

this).

What I couldn't understand however is how to have access to the internal

classifiers, to check for their score etc etc.

I couldn't understand also how to setup a criterion to chose the output.

What if for example the classifier is very good at discriminating all the

digits but 4 and 1?

Also I wanted to build a classifier using some form of cross validation,

but again I got a bit lost.

Sorry if my questions are quite silly!

Thanks a lot in advance for the help!

Cheers,

Luca

P.s. what if I want to "expand" the list of features to perform logistic

regression with quadratic terms? Is there an easy way to do this?

I am practising with scikit-learn to solve multiclass classification

problems.

As an exercise I am trying to build a model to predict the digits dataset

available with scikit-learn.

Ideally I would like to solve this using logistic regression, building a

predictor for each digit (one vs all approach).

When a new "digit" comes I predict the output for each of the trained

classifiers and choose the prediction with the maximum value

(as you can see I am not doing anything special, I think that it is the

naivest approach that you can follow).

So far I performed most of this steps manually, but I guess that there

might be some faster/smarter approach.

For example here is my approach that classifies a digit as 0, 1 or Other.

from sklearn.datasets import load_digits

from sklearn.linear_models import LogisticRegression

digits = load_digits()

data = digits.data

target = digits.target

import pylab as pl

idx = pl.permutation(data.shape[0])

#split the dataset

n_train_sample = 1000

idx_train = idx[0:n_train_sample]

idx_test = idx[0:n_train_sample]

data_train = data[idx_train, : ]

target_train = target[idx_train, : ]

data_test = data[idx_test, : ]

target_test = target[idx_test,:]

#build the classifier that recognize 0:

tar_tr_0 = array(map(lambda x : 1 if x == 0 else 0, target_train))

cfr_0 = LogisticRegression()

cfr_0.fit(data_train, tar_tr_0)

#build the classifier that recognize 1:

tar_tr_0 = array(map(lambda x : 1 if x == 1 else 0, target_train))

cfr_1 = LogisticRegression()

cfr_1.fit(data_train, tar_tr_1)

#build the classifier that recognizes "other":

tar_tr_other = array(map(lambda x : 1 if x > 1 else 0, target_train))

cfr_other = LogisticRegression()

cfr_other.fit(data_train, tar_tr_other)

<<<

Next of course there is some code that takes in input the various trained

classifiers, makes prediction on the test etc etc.

I did this partly for educational purposes (despite I know in theory how

multiclass classification can be performed I never did the prior steps

written before,

which I are useful to learn), partly because I got a bit lost when reading

the documentation (http://scikit-learn.org/stable/modules/multiclass.html).

For the One versus Rest I think I can

use sklearn.multiclass.OneVsRestClassifier (and now I am trying to do

this).

What I couldn't understand however is how to have access to the internal

classifiers, to check for their score etc etc.

I couldn't understand also how to setup a criterion to chose the output.

What if for example the classifier is very good at discriminating all the

digits but 4 and 1?

Also I wanted to build a classifier using some form of cross validation,

but again I got a bit lost.

Sorry if my questions are quite silly!

Thanks a lot in advance for the help!

Cheers,

Luca

P.s. what if I want to "expand" the list of features to perform logistic

regression with quadratic terms? Is there an easy way to do this?