Discussion:
Improving the accuracy of classifier
(too old to reply)
adnan rajper
2012-02-02 08:45:04 UTC
Permalink
hi everybody,

I am using multinomial and LinearSVC classifier with default parameters to classify twitter messages into two classes (positive or negative). I followed the tutorial on http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html. I tried "parameter tuning using grid search",  but it gets too slow. Both classifiers (multinomial and LinearSVC) give 75% accuracy. My problem is that I want to improve the accuracy, for instance I want to make it more than 80%. Is there anyway to do it through scikit. 


thanks
Adnan
Gael Varoquaux
2012-02-02 08:49:19 UTC
Permalink
Post by adnan rajper
I tried "parameter tuning using grid search",  but it gets too slow. Both
classifiers (multinomial and LinearSVC) give 75% accuracy. My problem is
that I want to improve the accuracy, for instance I want to make it more
than 80%. Is there anyway to do it through scikit. 
Did you normalize your features?

Gael
Olivier Grisel
2012-02-02 09:12:57 UTC
Permalink
Post by Gael Varoquaux
   I tried "parameter tuning using grid search",  but it gets too slow. Both
   classifiers (multinomial and LinearSVC) give 75% accuracy. My problem is
   that I want to improve the accuracy, for instance I want to make it more
   than 80%. Is there anyway to do it through scikit.
Did you normalize your features?
In the tutorial TFIDF normalization is automatically used when dealing
extracting the features so that should be fine.

Adnan, you should try to use linear_model.Perceptron (on master only),
naive_bayes.MultinomialNB or linear_model.SGDClassifier instead of the
LinearSVC model. They should be faster to train and hence allow you to
perform a finer grid search on their parameters (read the
documentation and examples to understand how their parameters work for
each of them).

In your case I would try to extract bigrams and use the elasticnet
penalty of SGDClassifier and do grid search on alpha (and maybe rho
too).

Then if you can't still reach 80% I would advise you to try to find
more training data. That's probably the easiest way to improve you
classification accuracy.

If you have more negative than positive examples you can also try to
set class_weight="auto" for classifiers that support it.

Also you should have a look at the text of some badly classified
samples to gain some insight on why the classifier is failing on those
example. That can tell you what kind of manually extracted features
would be beneficial to add to your feature extraction layer.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
adnan rajper
2012-02-03 05:16:49 UTC
Permalink
Hi,

Actually I followed this tutorial http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html. It uses Tf-IDF normalization so same I have incorporated after removing URLs, user names and stop words.

Adnan


________________________________
From: Gael Varoquaux <***@normalesup.org>
To: adnan rajper <***@yahoo.com>; scikit-learn-***@lists.sourceforge.net
Sent: Thursday, February 2, 2012 1:49 PM
Subject: Re: [Scikit-learn-general] Improving the accuracy of classifier
    I tried "parameter tuning using grid search",  but it gets too slow. Both
    classifiers (multinomial and LinearSVC) give 75% accuracy. My problem is
    that I want to improve the accuracy, for instance I want to make it more
    than 80%. Is there anyway to do it through scikit. 
Did you normalize your features?

Gael
Peter Prettenhofer
2012-02-02 09:20:14 UTC
Permalink
Hi Adnan,

can you give use some more specific information about your learning
task / dataset including:

- number of samples

- number of features

- class distribution

- features (normalization, preprocessing)

best,
Peter
Post by adnan rajper
hi everybody,
I am using multinomial and LinearSVC classifier with default parameters to
classify twitter messages into two classes (positive or negative). I
followed the tutorial
on http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html.
I tried "parameter tuning using grid search",  but it gets too slow. Both
classifiers (multinomial and LinearSVC) give 75% accuracy. My problem is
that I want to improve the accuracy, for instance I want to make it more
than 80%. Is there anyway to do it through scikit.
thanks
Adnan
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
adnan rajper
2012-02-02 09:51:32 UTC
Permalink
Hi Peter,

number of samples: 1 million tweets
number of features: I use the bag of words model, in-fact I have followed this example  http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html. It uses TF-IDF normalization.
class distribution: equal number of positive and negative tweets
features: I removed the stop words, punctuations, URLs and user names, 

Adnan

________________________________
From: Peter Prettenhofer <***@gmail.com>
To: scikit-learn-***@lists.sourceforge.net
Sent: Thursday, February 2, 2012 2:20 PM
Subject: Re: [Scikit-learn-general] Improving the accuracy of classifier

Hi Adnan,

can you give use some more specific information about your learning
task / dataset including:

- number of samples

- number of features

- class distribution

- features (normalization, preprocessing)

best,
Peter
Post by adnan rajper
hi everybody,
I am using multinomial and LinearSVC classifier with default parameters to
classify twitter messages into two classes (positive or negative). I
followed the tutorial
on http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html.
I tried "parameter tuning using grid search",  but it gets too slow. Both
classifiers (multinomial and LinearSVC) give 75% accuracy. My problem is
that I want to improve the accuracy, for instance I want to make it more
than 80%. Is there anyway to do it through scikit.
thanks
Adnan
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Peter Prettenhofer
2012-02-02 15:44:20 UTC
Permalink
Ok, so I assume you do sentiment classification?

For millions of examples I definitely recommend using either
NaiveBayes or SGDClassifier. I'd start with a bernoulli NB as a
baseline.

Personally, I hardly use IDF weighting for sentiment classification;
words with low document frequency are usually proper nouns which are
not that indicative of sentiment. Furthermore, typos have low document
frequency too...

I strongly recommend further token normalization (contractions,
negations, smileys, repeated chars) which allows you to tackle the
problem of data sparseness (how many features do you have?)

For sentiment classification, don't be too aggressive on punctuation
(e.g. repeated ! and ? are valuable indicators).

To further improve performance you can try SGDClassifier and tune
alpha via grid search (usually a coarse search will do), you don't
need more than a hand full of epochs for a dataset of this size.
Personally, I prefer the modified huber loss over hinge loss
(=default) but that's more a subjective choice.

As Olivier suggested, bigrams may help but they make the data
sparseness problem even worse - so try to counter with more aggressive
regularization.

Hope this helps,

Peter
Post by adnan rajper
Hi Peter,
number of samples: 1 million tweets
number of features: I use the bag of words model, in-fact I have followed
this example
 http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html.
It uses TF-IDF normalization.
class distribution: equal number of positive and negative tweets
features: I removed the stop words, punctuations, URLs and user names,
Adnan
________________________________
Sent: Thursday, February 2, 2012 2:20 PM
Subject: Re: [Scikit-learn-general] Improving the accuracy of classifier
Hi Adnan,
can you give use some more specific information about your learning
- number of samples
- number of features
- class distribution
- features (normalization, preprocessing)
best,
Peter
Post by adnan rajper
hi everybody,
I am using multinomial and LinearSVC classifier with default parameters to
classify twitter messages into two classes (positive or negative). I
followed the tutorial
on http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html.
I tried "parameter tuning using grid search",  but it gets too slow. Both
classifiers (multinomial and LinearSVC) give 75% accuracy. My problem is
that I want to improve the accuracy, for instance I want to make it more
than 80%. Is there anyway to do it through scikit.
thanks
Adnan
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
adnan rajper
2012-02-03 05:15:06 UTC
Permalink
Yes Peter, Indeed I am doing sentiment classification.


Your suggestion are highly appreciable. Sorry, but I am not able to understand your question: "how many features do you have?". Would you care to elaborate it?

Again Thanks millions 

Adnan


________________________________
From: Peter Prettenhofer <***@gmail.com>
To: adnan rajper <***@yahoo.com>; scikit-learn-***@lists.sourceforge.net
Sent: Thursday, February 2, 2012 8:44 PM
Subject: Re: [Scikit-learn-general] Improving the accuracy of classifier

Ok, so I assume you do sentiment classification?

For millions of examples I definitely recommend using either
NaiveBayes or SGDClassifier. I'd start with a bernoulli NB as a
baseline.

Personally, I hardly use IDF weighting for sentiment classification;
words with low document frequency are usually proper nouns which are
not that indicative of sentiment. Furthermore, typos have low document
frequency too...

I strongly recommend further token normalization (contractions,
negations, smileys, repeated chars) which allows you to tackle the
problem of data sparseness (how many features do you have?)

For sentiment classification, don't be too aggressive on punctuation
(e.g. repeated ! and ? are valuable indicators).

To further improve performance you can try SGDClassifier and tune
alpha via grid search (usually a coarse search will do), you don't
need more than a hand full of epochs for a dataset
of this size.
Personally, I prefer the modified huber loss over hinge loss
(=default) but that's more a subjective choice.

As Olivier suggested, bigrams may help but they make the data
sparseness problem even worse - so try to counter with more aggressive
regularization.

Hope this helps,

Peter
Post by adnan rajper
Hi Peter,
number of samples: 1 million tweets
number of features: I use the bag of words model, in-fact I have followed
this example
 http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html.
It uses TF-IDF normalization.
class distribution: equal number of positive and negative tweets
features: I removed the stop words, punctuations, URLs and user names,
Adnan
________________________________
Sent: Thursday, February 2, 2012 2:20 PM
Subject: Re: [Scikit-learn-general] Improving the accuracy of classifier
Hi Adnan,
can you give use some more specific information about your learning
- number of samples
- number of features
- class distribution
- features (normalization, preprocessing)
best,
Peter
Post by adnan rajper
hi everybody,
I am using multinomial and LinearSVC classifier with default parameters to
classify twitter messages into two classes (positive or negative). I
followed the tutorial
on http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html.
I tried "parameter tuning using grid search",  but it gets too slow. Both
classifiers (multinomial and LinearSVC) give 75% accuracy. My problem is
that I want to improve the accuracy, for instance I want to make it more
than 80%. Is there anyway to do it through scikit.
thanks
Adnan
Post by adnan rajper
Post by adnan rajper
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Olivier Grisel
2012-02-03 08:34:17 UTC
Permalink
Post by adnan rajper
Yes Peter, Indeed I am doing sentiment classification.
Your suggestion are highly appreciable. Sorry, but I am not able to
understand your question: "how many features do you have?". Would you care
to elaborate it?
In scikit-learn parlance if you have a 2D data matrix / array X, it's
shape is (n_samples, n_features). In other words, the number of
feature extracted by you vectorizer should be X.shape[1].
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
adnan rajper
2012-02-03 09:33:31 UTC
Permalink
It extracted 155646 features.


________________________________
From: Olivier Grisel <***@ensta.org>
To: adnan rajper <***@yahoo.com>; scikit-learn-***@lists.sourceforge.net
Sent: Friday, February 3, 2012 1:34 PM
Subject: Re: [Scikit-learn-general] Fw: Improving the accuracy of classifier
Post by adnan rajper
Yes Peter, Indeed I am doing sentiment classification.
Your suggestion are highly appreciable. Sorry, but I am not able to
understand your question: "how many features do you have?". Would you care
to elaborate it?
In scikit-learn parlance if you have a 2D data matrix / array X, it's
shape is (n_samples, n_features). In other words, the number of
feature extracted by you vectorizer should be X.shape[1].
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Continue reading on narkive:
Loading...