2013-09-24 16:42:12 UTC
I am practising with scikit-learn to solve multiclass classification
As an exercise I am trying to build a model to predict the digits dataset
available with scikit-learn.
Ideally I would like to solve this using logistic regression, building a
predictor for each digit (one vs all approach).
When a new "digit" comes I predict the output for each of the trained
classifiers and choose the prediction with the maximum value
(as you can see I am not doing anything special, I think that it is the
naivest approach that you can follow).
So far I performed most of this steps manually, but I guess that there
might be some faster/smarter approach.
For example here is my approach that classifies a digit as 0, 1 or Other.
from sklearn.datasets import load_digits
from sklearn.linear_models import LogisticRegression
digits = load_digits()
data = digits.data
target = digits.target
import pylab as pl
idx = pl.permutation(data.shape)
#split the dataset
n_train_sample = 1000
idx_train = idx[0:n_train_sample]
idx_test = idx[0:n_train_sample]
data_train = data[idx_train, : ]
target_train = target[idx_train, : ]
data_test = data[idx_test, : ]
target_test = target[idx_test,:]
#build the classifier that recognize 0:
tar_tr_0 = array(map(lambda x : 1 if x == 0 else 0, target_train))
cfr_0 = LogisticRegression()
#build the classifier that recognize 1:
tar_tr_0 = array(map(lambda x : 1 if x == 1 else 0, target_train))
cfr_1 = LogisticRegression()
#build the classifier that recognizes "other":
tar_tr_other = array(map(lambda x : 1 if x > 1 else 0, target_train))
cfr_other = LogisticRegression()
Next of course there is some code that takes in input the various trained
classifiers, makes prediction on the test etc etc.
I did this partly for educational purposes (despite I know in theory how
multiclass classification can be performed I never did the prior steps
which I are useful to learn), partly because I got a bit lost when reading
the documentation (http://scikit-learn.org/stable/modules/multiclass.html).
For the One versus Rest I think I can
use sklearn.multiclass.OneVsRestClassifier (and now I am trying to do
What I couldn't understand however is how to have access to the internal
classifiers, to check for their score etc etc.
I couldn't understand also how to setup a criterion to chose the output.
What if for example the classifier is very good at discriminating all the
digits but 4 and 1?
Also I wanted to build a classifier using some form of cross validation,
but again I got a bit lost.
Sorry if my questions are quite silly!
Thanks a lot in advance for the help!
P.s. what if I want to "expand" the list of features to perform logistic
regression with quadratic terms? Is there an easy way to do this?