[Scikit-learn-general] LinearSVC somtimes returns no label

Discussion:

Sicco van Sas

2012-07-20 15:08:06 UTC

Hi all,

I use LinearSVC for multi-class multi-label text classification, but the
learned classifier doesn't always output a label when I try to classify
a test sample.

Here is the code:

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(train_txt, train_labels)
print str(classifier.predict(example_txt))

Sometimes I get a fine results, e.g. [(u'dogs',)] , while it sometimes
also returns nothing: [()]
It seems that the more samples I train the classifier on, the less it
will output nothing, but even training on 20k samples still sometimes
results in no labels as output. E.g., training on 10k sampels, results
in approx 80. classifiers. I tested on 5k samples and approx 30% of the
samples were given no label, while the ones that did get one of more
labels performed quite good.

Is there a way to force the classifier to always predict at least 1 label?

Cheers,
Sicco

Andreas Müller

2012-07-20 15:38:44 UTC

Permalink

Hi Sicco.
This is desired behavior.
If you want to always get a label, you could have a look at the decision_function
and just predict the label with the highest score if no label was predicted.

Cheers,
Andy

----- Ursprüngliche Mail -----

Gesendet: Freitag, 20. Juli 2012 16:08:06
Betreff: [Scikit-learn-general] LinearSVC somtimes returns no label
Hi all,
I use LinearSVC for multi-class multi-label text classification, but
the
learned classifier doesn't always output a label when I try to
classify
a test sample.
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(train_txt, train_labels)
print str(classifier.predict(example_txt))
Sometimes I get a fine results, e.g. [(u'dogs',)] , while it
sometimes
also returns nothing: [()]
It seems that the more samples I train the classifier on, the less it
will output nothing, but even training on 20k samples still sometimes
results in no labels as output. E.g., training on 10k sampels,
results
in approx 80. classifiers. I tested on 5k samples and approx 30% of
the
samples were given no label, while the ones that did get one of more
labels performed quite good.
Is there a way to force the classifier to always predict at least 1
label?
Cheers,
Sicco
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond.
Discussions
will include endpoint security, mobile security and the latest in
malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Lars Buitinck

2012-07-20 16:15:22 UTC

Permalink

Post by Andreas MÃ¼ller
Hi Sicco.

Indeed, hi, and nice to see you've picked scikit-learn :)

Post by Andreas MÃ¼ller
This is desired behavior.

Then again, we could introduce a min_classes parameter to determine
how many classes should be returned at least. This is commonly what
you want when predicting multiple tags (think StackOverflow questions,
where at least one tag is required).

Post by Andreas MÃ¼ller
If you want to always get a label, you could have a look at the decision_function
and just predict the label with the highest score if no label was predicted.

In some more detail, you can find out which class gets the highest
score for a sample vector x using

clf.label_binarizer_.classes_[numpy.argmax([e.decision_function(x)
for e in clf.estimators_])]

This is arguably a hack; the OvR estimator is a bit rough around the
edges. It doesn't play well with the Pipeline either, since you have
to vectorize the document yourself. Without a Pipeline, the training
procedure would be

vect = TfidfVectorizer() # or Vectorizer in older versions; this
combines CountVectorizer and TfidfTransformer
clf = OneVsRestClassifier(LinearSVC())
X = vect.fit_transform(train_txt)
clf.fit(X, train_labels)

And prediction would become (showing the procedure for one document at
a time now)

x = vect.transform([one_document])
[labels] = clf.predict(x)
if len(labels) == 0:
# apply the trick I described above

Good luck,
Lars

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

Mathieu Blondel

2012-07-20 15:54:41 UTC

Permalink

The fact that your classifier often predicts the empty set is a sign that
your data may suffer from the class imbalance problem. If that's the case,
to correct for the imbalance, you could try to play with LinearSVC's
class_weight option.

HTH,
Mathieu