Discussion:
[Scikit-learn-general] Regarding content classification using HashingVectorizer
Kartik Kumar Perisetla
2014-07-24 02:35:46 UTC
Permalink
Hello,

I am creating a content classifier using scikit-learn through
HashingVectorizer( using this as reference:
http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html
).

The training dataset I am using wikipedia. For example, for "management"
category I am training it with few articles related to management. i.e.
Entire article related to management is one training instance.

I did training with 50 categories and total of ~4000 training instances.
But when I test the prediction for a new sentence or text, it gives wrong
prediction.

So my question is do I need to break wikipedia article into sentences and
label each sentence with category name to make it work correctly? Since I
am using HashingVectorizer, my intuition is it is creating a hash for
entire training instance and not for tokens in it. Is it true?

Also, Could someone please throw some light on how HashingVectorizer works?

Thanks,
Kartik
--
Regards,

Kartik Perisetla
Eustache DIEMERT
2014-07-24 08:51:14 UTC
Permalink
Post by Kartik Kumar Perisetla
But when I test the prediction for a new sentence or text, it gives wrong
prediction.

How do you measure that ?

Having a few badly classified instances does not necessarily means the
learning has failed.

A good classification accuracy for text classification is typically > 80%,
what is yours ?

Also, HashingVectorizer is not really involved in classification accuracy
here - IMHO.

The main factor would probably be how close your new examples are to the
training set. E.g. in the out-of-core example we keep the first 1000
instances for testing. If you just ask predictions for texts taken from
other sources the classification would probably be worse...

HTH

Eustache
Post by Kartik Kumar Perisetla
Hello,
I am creating a content classifier using scikit-learn through
http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html
).
The training dataset I am using wikipedia. For example, for "management"
category I am training it with few articles related to management. i.e.
Entire article related to management is one training instance.
I did training with 50 categories and total of ~4000 training instances.
But when I test the prediction for a new sentence or text, it gives wrong
prediction.
So my question is do I need to break wikipedia article into sentences and
label each sentence with category name to make it work correctly? Since I
am using HashingVectorizer, my intuition is it is creating a hash for
entire training instance and not for tokens in it. Is it true?
Also, Could someone please throw some light on how HashingVectorizer works?
Thanks,
Kartik
--
Regards,
Kartik Perisetla
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Kartik Kumar Perisetla
2014-07-24 14:43:39 UTC
Permalink
I actually used part of text of one wikipedia article which was used in
training. I was expecting it to detect the category for which it was used
as training instance. But it predicted as some other category and thus I
thought it did not give accurate prediction.

Please correct my understanding if its wrong here.

Thanks,
Kartik
Post by Kartik Kumar Perisetla
But when I test the prediction for a new sentence or text, it gives
wrong prediction.
How do you measure that ?
Having a few badly classified instances does not necessarily means the
learning has failed.
A good classification accuracy for text classification is typically > 80%,
what is yours ?
Also, HashingVectorizer is not really involved in classification accuracy
here - IMHO.
The main factor would probably be how close your new examples are to the
training set. E.g. in the out-of-core example we keep the first 1000
instances for testing. If you just ask predictions for texts taken from
other sources the classification would probably be worse...
HTH
Eustache
Post by Kartik Kumar Perisetla
Hello,
I am creating a content classifier using scikit-learn through
http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html
).
The training dataset I am using wikipedia. For example, for "management"
category I am training it with few articles related to management. i.e.
Entire article related to management is one training instance.
I did training with 50 categories and total of ~4000 training instances.
But when I test the prediction for a new sentence or text, it gives wrong
prediction.
So my question is do I need to break wikipedia article into sentences and
label each sentence with category name to make it work correctly? Since I
am using HashingVectorizer, my intuition is it is creating a hash for
entire training instance and not for tokens in it. Is it true?
Also, Could someone please throw some light on how HashingVectorizer works?
Thanks,
Kartik
--
Regards,
Kartik Perisetla
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Regards,

Kartik Perisetla
Olivier Grisel
2014-07-24 22:30:59 UTC
Permalink
Post by Kartik Kumar Perisetla
I actually used part of text of one wikipedia article which was used in
training. I was expecting it to detect the category for which it was used as
training instance. But it predicted as some other category and thus I
thought it did not give accurate prediction.
Please correct my understanding if its wrong here.
Models can underfit, that is fail to giv perfect predictions even on
the training set.

For text classification as for other tasks, underfitting problem can be caused
both by problems at the extracted features level, inadequate model parameter
settings (e.g. the strength model regularization), inadequate model class and
label noise (bad quality of the class labels them-selves)

A good way to understand model underfitting and overfitting (in
relation to the training set size) is to plot learning curves, both
for the score on the training set and on the validation set, see for
instance:

http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html
--
Olivier
Olivier Grisel
2014-07-24 08:54:01 UTC
Permalink
Post by Kartik Kumar Perisetla
Hello,
I am creating a content classifier using scikit-learn through
http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html).
The training dataset I am using wikipedia. For example, for "management"
category I am training it with few articles related to management. i.e.
Entire article related to management is one training instance.
I did training with 50 categories and total of ~4000 training instances.
That's less than 100 instances per category (assuming balanced
classes): this might not be enough labeled data to get good results.
Also you should remove categories for which you have less than 10
examples, or better collapse them all in a category called "other" (to
act as a neutral class for random topic not sufficiently covered by
your training set).

Also this dataset is very small you don't need to mess with a complex
out-of-core setup. Just load all your documents at once in memory,
this will make it easier to evaluate your models.
Post by Kartik Kumar Perisetla
But
when I test the prediction for a new sentence or text, it gives wrong
prediction.
Well this is machine learning, it will never be 100% correct. Make
sure to properly use cross-validation to quantify the quality of a
model, see:

http://scikit-learn.org/stable/modules/cross_validation.html

Also ou need to select the best model parameters via hyperparameter
(grid|random)-search. See:

http://scikit-learn.org/stable/modules/grid_search.html

You can find models that tend to work well on text in this example:

http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html

You might also want to have a look at this tutorial:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Post by Kartik Kumar Perisetla
So my question is do I need to break wikipedia article into sentences and
label each sentence with category name to make it work correctly?
Not necessarily. You could try but first do what I suggested above.
Post by Kartik Kumar Perisetla
Since I am
using HashingVectorizer, my intuition is it is creating a hash for entire
training instance and not for tokens in it. Is it true?
No, each token (e.g. word) is hashed to be mapped to a specific
feature. See the documentation:

http://scikit-learn.org/stable/modules/feature_extraction.html#hashing-vectorizer
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Lars Buitinck
2014-07-24 09:48:10 UTC
Permalink
Post by Kartik Kumar Perisetla
Also, Could someone please throw some light on how HashingVectorizer works?
https://larsmans.github.io/ilps-hashing-trick/
https://en.wikipedia.org/wiki/Feature_hashing
http://metaoptimize.com/qa/questions/6943/what-is-the-hashing-trick
Loading...