Discussion:
Incorporation of extra training examples
(too old to reply)
Philipp Singer
2012-07-09 10:42:39 UTC
Permalink
Hey!

I am currently doing text classification. I have the following setup:

78 classes
max 1500 train examples per class
overall around 90.000 train examples
same amount of test examples

I am pretty happy with the classification results (~52% f1 score) which
is fine for my task.

But now I have another scenario. I have around 2.000.000 extra training
examples available which are produced by a certain amount of users not
_directly_ corresponding for the classes but I still know the labels of
this data. If I train the classifier simply on this extra data (without
the correct one) I can achieve a F1-score of ~25%. So this somehow tells
me that there is information available that I now somehow want to
incorporate to my existing data. For some few classes this data even
works slightly better or at least similar.

I have simply tried to combine both datasets (90.000 + 2.000.000) but
this makes the results worse (test data amount always stays the same).
This is not surprising because a lot of noise is added to the data and I
think that the huge amount of extra data somehow overrules the existing one.

My question now is, how I can incorporate this data the best in order to
achieve better classification results than with my first dataset. Maybe
someone has an idea or there are some techniques for that.

Just for the record: I use Tf-Idf with a SVC which works best. I have
also tried a different approach using topic models.

Thanks and many regards,
Philipp
Gael Varoquaux
2012-07-09 11:34:26 UTC
Permalink
Hi,

You can try setting this as a semi-supervised learning problem and using
label propagation:

http://scikit-learn.org/stable/modules/label_propagation.html#label-propagation

HTH,

G
Peter Prettenhofer
2012-07-09 11:47:25 UTC
Permalink
Hi,

some quick thoughts:

- if you use a multinomial Naive Bayes classifier (aka a language
model) you can fit a background model on the large dataset and use
that to smooth the model fitted on the smaller dataset.

- you should look at the domain adaptation / multi-task learning
literature - this might fit your setting better than traditional
semi-supervised learning.

best,
Peter
Post by Gael Varoquaux
Hi,
You can try setting this as a semi-supervised learning problem and using
http://scikit-learn.org/stable/modules/label_propagation.html#label-propagation
HTH,
G
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Philipp Singer
2012-07-09 12:34:34 UTC
Permalink
Post by Peter Prettenhofer
Hi,
Hey!
Post by Peter Prettenhofer
- if you use a multinomial Naive Bayes classifier (aka a language
model) you can fit a background model on the large dataset and use
that to smooth the model fitted on the smaller dataset.
That's a nice idea. Is there a simple way to try this out fast in
scikit-learn?
Post by Peter Prettenhofer
- you should look at the domain adaptation / multi-task learning
literature - this might fit your setting better than traditional
semi-supervised learning.
Thanks, I will look into that.
Post by Peter Prettenhofer
best,
Peter
Regards,
Philipp
Post by Peter Prettenhofer
Post by Gael Varoquaux
Hi,
You can try setting this as a semi-supervised learning problem and using
http://scikit-learn.org/stable/modules/label_propagation.html#label-propagation
HTH,
G
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Peter Prettenhofer
2012-07-09 12:44:47 UTC
Permalink
Post by Philipp Singer
Post by Peter Prettenhofer
Hi,
Hey!
Post by Peter Prettenhofer
- if you use a multinomial Naive Bayes classifier (aka a language
model) you can fit a background model on the large dataset and use
that to smooth the model fitted on the smaller dataset.
That's a nice idea. Is there a simple way to try this out fast in
scikit-learn?
not really, you would need to write you own estimator that simply
takes the two MultinomialNaiveBayes models as arguments and does the
smoothing on predict(_proba); shouldn't be too much code though. If
you do a simple linear interpolation the following should suffice::

def predict_proba(X):
return self.lambda * self.foreground.predict_proba(X) + (1.0 -
self.lambda) * self.background.predict_proba(X)


You could estimate lambda via EM but I'd rather tune using CV.
Post by Philipp Singer
Post by Peter Prettenhofer
- you should look at the domain adaptation / multi-task learning
literature - this might fit your setting better than traditional
semi-supervised learning.
Thanks, I will look into that.
Post by Peter Prettenhofer
best,
Peter
Regards,
Philipp
Post by Peter Prettenhofer
Post by Gael Varoquaux
Hi,
You can try setting this as a semi-supervised learning problem and using
http://scikit-learn.org/stable/modules/label_propagation.html#label-propagation
HTH,
G
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Philipp Singer
2012-07-18 12:37:31 UTC
Permalink
Post by Peter Prettenhofer
Post by Philipp Singer
Post by Peter Prettenhofer
Hi,
Hey!
Post by Peter Prettenhofer
- if you use a multinomial Naive Bayes classifier (aka a language
model) you can fit a background model on the large dataset and use
that to smooth the model fitted on the smaller dataset.
That's a nice idea. Is there a simple way to try this out fast in
scikit-learn?
not really, you would need to write you own estimator that simply
takes the two MultinomialNaiveBayes models as arguments and does the
smoothing on predict(_proba); shouldn't be too much code though. If
return self.lambda * self.foreground.predict_proba(X) + (1.0 -
self.lambda) * self.background.predict_proba(X)
You could estimate lambda via EM but I'd rather tune using CV.
Hey!

I finally found some time to implement this. But I have stumbled upon
this problem:

In this case I would fit one MultinomialNB for the foreground model and
one for the background model. But how would I do the feature extraction
(I have text documents) in this case? Would I fit (e.g., tfidf) on the
whole corpus (foreground + background) and then transform both datasets
on the fitted infos and the test dataset as well?

Regards,
Philipp
Olivier Grisel
2012-07-18 12:41:39 UTC
Permalink
Post by Philipp Singer
Post by Peter Prettenhofer
Post by Philipp Singer
Post by Peter Prettenhofer
Hi,
Hey!
Post by Peter Prettenhofer
- if you use a multinomial Naive Bayes classifier (aka a language
model) you can fit a background model on the large dataset and use
that to smooth the model fitted on the smaller dataset.
That's a nice idea. Is there a simple way to try this out fast in
scikit-learn?
not really, you would need to write you own estimator that simply
takes the two MultinomialNaiveBayes models as arguments and does the
smoothing on predict(_proba); shouldn't be too much code though. If
return self.lambda * self.foreground.predict_proba(X) + (1.0 -
self.lambda) * self.background.predict_proba(X)
You could estimate lambda via EM but I'd rather tune using CV.
Hey!
I finally found some time to implement this. But I have stumbled upon
In this case I would fit one MultinomialNB for the foreground model and
one for the background model. But how would I do the feature extraction
(I have text documents) in this case? Would I fit (e.g., tfidf) on the
whole corpus (foreground + background) and then transform both datasets
on the fitted infos and the test dataset as well?
Sounds reasonable.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Peter Prettenhofer
2012-07-18 13:32:49 UTC
Permalink
Post by Philipp Singer
In this case I would fit one MultinomialNB for the foreground model and
one for the background model. But how would I do the feature extraction
(I have text documents) in this case? Would I fit (e.g., tfidf) on the
whole corpus (foreground + background) and then transform both datasets
on the fitted infos and the test dataset as well?
Personally, I'd start without using IDF; Otherwise, wrap both
estimators using a Pipeline and add a TfidfTransformer (see [1]).

best,
Peter

[1] http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html
--
Peter Prettenhofer
Philipp Singer
2012-07-18 13:35:59 UTC
Permalink
Post by Peter Prettenhofer
Post by Philipp Singer
In this case I would fit one MultinomialNB for the foreground model and
one for the background model. But how would I do the feature extraction
(I have text documents) in this case? Would I fit (e.g., tfidf) on the
whole corpus (foreground + background) and then transform both datasets
on the fitted infos and the test dataset as well?
Personally, I'd start without using IDF; Otherwise, wrap both
estimators using a Pipeline and add a TfidfTransformer (see [1]).
best,
Peter
[1] http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html
Yes, I am currently trying around with tf only, but the vocabulary is
still dependen on the corpus.

Philipp
Peter Prettenhofer
2012-07-18 13:45:32 UTC
Permalink
Post by Philipp Singer
Post by Peter Prettenhofer
Post by Philipp Singer
In this case I would fit one MultinomialNB for the foreground model and
one for the background model. But how would I do the feature extraction
(I have text documents) in this case? Would I fit (e.g., tfidf) on the
whole corpus (foreground + background) and then transform both datasets
on the fitted infos and the test dataset as well?
Personally, I'd start without using IDF; Otherwise, wrap both
estimators using a Pipeline and add a TfidfTransformer (see [1]).
best,
Peter
[1] http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html
Yes, I am currently trying around with tf only, but the vocabulary is
still dependen on the corpus.
I would fit the vectorizor on both datasets (such that the vocabulary
covers the union) and then fit the IDF transformers on each dataset
individually.

Disclaimer: I hardly use sklearn's text utilities
Post by Philipp Singer
Philipp
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Lars Buitinck
2012-07-18 13:49:53 UTC
Permalink
Post by Peter Prettenhofer
Post by Philipp Singer
Yes, I am currently trying around with tf only, but the vocabulary is
still dependen on the corpus.
I would fit the vectorizor on both datasets (such that the vocabulary
covers the union) and then fit the IDF transformers on each dataset
individually.
Disclaimer: I hardly use sklearn's text utilities
You could determine the vocabulary, then pass it to CountVectorizer or
TfidfVectorizer in the constructor.

Also, I have a PR for a hashing vectorizer that does not need a
vocabulary at https://github.com/scikit-learn/scikit-learn/pull/909.
It's not ready for merging yet (and I hardly have time to work on it),
but it does work.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Philipp Singer
2012-07-20 07:59:27 UTC
Permalink
Post by Peter Prettenhofer
Post by Philipp Singer
In this case I would fit one MultinomialNB for the foreground model and
one for the background model. But how would I do the feature extraction
(I have text documents) in this case? Would I fit (e.g., tfidf) on the
whole corpus (foreground + background) and then transform both datasets
on the fitted infos and the test dataset as well?
Personally, I'd start without using IDF; Otherwise, wrap both
estimators using a Pipeline and add a TfidfTransformer (see [1]).
best,
Peter
[1] http://scikit-learn.org/stable/auto_examples/grid_search_text_feature_extraction.html
Everything works fine now. The sad thing though is that I still can't
really improve the classification results. The only thing I can achieve
is to get a higher recall for the classes working well in the background
model, but the precision sinks at the same time. Overall I am staying at
about the same average score when incorporating the background model.

If anyone has any further ideas, please let me know ;)

Regards,
Philipp
Lars Buitinck
2012-07-20 09:47:14 UTC
Permalink
Post by Philipp Singer
Everything works fine now. The sad thing though is that I still can't
really improve the classification results. The only thing I can achieve
is to get a higher recall for the classes working well in the background
model, but the precision sinks at the same time. Overall I am staying at
about the same average score when incorporating the background model.
If anyone has any further ideas, please let me know ;)
Well, since Gael already mentioned semi-supervised training using
label propagation: I have an old PR which has still not been merged,
mostly because of API reasons, that implements semi-supervised
training of Naive Bayes using an EM algorithm:

https://github.com/scikit-learn/scikit-learn/pull/430

I've seen improvements in F1 score when doing text classification with
this algorithm. It may take some work to get this up to speed with the
latest scikit-learn, though.

(Just out of curiosity, which topic models did you try? I'm looking
into these for my own projects.)
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Philipp Singer
2012-07-20 11:07:01 UTC
Permalink
Post by Lars Buitinck
Post by Philipp Singer
Everything works fine now. The sad thing though is that I still can't
really improve the classification results. The only thing I can achieve
is to get a higher recall for the classes working well in the background
model, but the precision sinks at the same time. Overall I am staying at
about the same average score when incorporating the background model.
If anyone has any further ideas, please let me know ;)
Well, since Gael already mentioned semi-supervised training using
label propagation: I have an old PR which has still not been merged,
mostly because of API reasons, that implements semi-supervised
https://github.com/scikit-learn/scikit-learn/pull/430
I've seen improvements in F1 score when doing text classification with
this algorithm. It may take some work to get this up to speed with the
latest scikit-learn, though.
Hey Lars,

Thanks, this looks awesome. I will try it out. The reason why I haven't
used label propagation techniques yet is, that I could not achieve a
fast runtime yet, because I have huge amounts of unlabeled/background
data available.
Post by Lars Buitinck
(Just out of curiosity, which topic models did you try? I'm looking
into these for my own projects.)
We have been using Mallet's LDA based Parallel Topic Model.

Philipp
Olivier Grisel
2012-07-20 12:48:55 UTC
Permalink
Post by Philipp Singer
Post by Lars Buitinck
Post by Philipp Singer
Everything works fine now. The sad thing though is that I still can't
really improve the classification results. The only thing I can achieve
is to get a higher recall for the classes working well in the background
model, but the precision sinks at the same time. Overall I am staying at
about the same average score when incorporating the background model.
If anyone has any further ideas, please let me know ;)
Well, since Gael already mentioned semi-supervised training using
label propagation: I have an old PR which has still not been merged,
mostly because of API reasons, that implements semi-supervised
https://github.com/scikit-learn/scikit-learn/pull/430
I've seen improvements in F1 score when doing text classification with
this algorithm. It may take some work to get this up to speed with the
latest scikit-learn, though.
Hey Lars,
Thanks, this looks awesome. I will try it out. The reason why I haven't
used label propagation techniques yet is, that I could not achieve a
fast runtime yet, because I have huge amounts of unlabeled/background
data available.
Post by Lars Buitinck
(Just out of curiosity, which topic models did you try? I'm looking
into these for my own projects.)
We have been using Mallet's LDA based Parallel Topic Model.
You could also try to extract the top 100 singular vectors using
sklearn.decomposition.RandomizedPCA or gensim.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Philipp Singer
2012-07-20 13:29:52 UTC
Permalink
Post by Lars Buitinck
Well, since Gael already mentioned semi-supervised training using
label propagation: I have an old PR which has still not been merged,
mostly because of API reasons, that implements semi-supervised
https://github.com/scikit-learn/scikit-learn/pull/430
I've seen improvements in F1 score when doing text classification with
this algorithm. It may take some work to get this up to speed with the
latest scikit-learn, though.
Hey again!

I jsut have tried out your implementation of semi-supervised
MultinomialNB. The code works flawless, but unfortunately the
performance of the algorithm drops extremely when I trie to incorporate
my additional data.

I am starting to think that my additional data is useless :/

Just for the record:

training on my 96000 labeled data with MultinomialNB gets me a f1-score
of 0.47. Using around 2.000.000 unlabeled additional data using your
semi-supervised code achieves a f1-score of 0.39

Regards,
philipp
Lars Buitinck
2012-07-20 13:34:25 UTC
Permalink
Post by Philipp Singer
I jsut have tried out your implementation of semi-supervised
MultinomialNB. The code works flawless, but unfortunately the
performance of the algorithm drops extremely when I trie to incorporate
my additional data.
I am starting to think that my additional data is useless :/
training on my 96000 labeled data with MultinomialNB gets me a f1-score
of 0.47. Using around 2.000.000 unlabeled additional data using your
semi-supervised code achieves a f1-score of 0.39
Hmm, too bad. Is the extra data from a very different source?
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Philipp Singer
2012-07-20 13:51:17 UTC
Permalink
Post by Lars Buitinck
Post by Philipp Singer
I jsut have tried out your implementation of semi-supervised
MultinomialNB. The code works flawless, but unfortunately the
performance of the algorithm drops extremely when I trie to incorporate
my additional data.
I am starting to think that my additional data is useless :/
training on my 96000 labeled data with MultinomialNB gets me a f1-score
of 0.47. Using around 2.000.000 unlabeled additional data using your
semi-supervised code achieves a f1-score of 0.39
Hmm, too bad. Is the extra data from a very different source?
Not very different, but documents produced by another kind of users.

I really thought that this data could improve somehow the whole
classification process, because fitting a model on the extra data alone
leads to an f1-score of 0.27, which is pretty good for that data.
Vlad Niculae
2012-07-09 11:59:58 UTC
Permalink
Another (hackish) idea to try would be to keep the labels of the extra
data bit give it a sample_weight low enough not to override your good
training data.
Post by Philipp Singer
Hey!
78 classes
max 1500 train examples per class
overall around 90.000 train examples
same amount of test examples
I am pretty happy with the classification results (~52% f1 score) which
is fine for my task.
But now I have another scenario. I have around 2.000.000 extra training
examples available which are produced by a certain amount of users not
_directly_ corresponding for the classes but I still know the labels of
this data. If I train the classifier simply on this extra data (without
the correct one) I can achieve a F1-score of ~25%. So this somehow tells
me that there is information available that I now somehow want to
incorporate to my existing data. For some few classes this data even
works slightly better or at least similar.
I have simply tried to combine both datasets (90.000 + 2.000.000) but
this makes the results worse (test data amount always stays the same).
This is not surprising because a lot of noise is added to the data and I
think that the huge amount of extra data somehow overrules the existing one.
My question now is, how I can incorporate this data the best in order to
achieve better classification results than with my first dataset. Maybe
someone has an idea or there are some techniques for that.
Just for the record: I use Tf-Idf with a SVC which works best. I have
also tried a different approach using topic models.
Thanks and many regards,
Philipp
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Philipp Singer
2012-07-09 12:32:12 UTC
Permalink
Post by Vlad Niculae
Another (hackish) idea to try would be to keep the labels of the extra
data bit give it a sample_weight low enough not to override your good
training data.
That's actually a great and simple idea. Would I do that similar to that
example:
http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html

So like using a 10 times higher weight for the corresponding samples for
example as a starting point?

I see that the fit method of LinearSVC doesn't have a sample_weight
parameter. So I guess I would have switch to another method. SVC
unfortunaetly has a very long runtime compared to LinearSVC, but maybe a
SGDClassifier would work.

Regards,
Philipp
Post by Vlad Niculae
Post by Philipp Singer
Hey!
78 classes
max 1500 train examples per class
overall around 90.000 train examples
same amount of test examples
I am pretty happy with the classification results (~52% f1 score) which
is fine for my task.
But now I have another scenario. I have around 2.000.000 extra training
examples available which are produced by a certain amount of users not
_directly_ corresponding for the classes but I still know the labels of
this data. If I train the classifier simply on this extra data (without
the correct one) I can achieve a F1-score of ~25%. So this somehow tells
me that there is information available that I now somehow want to
incorporate to my existing data. For some few classes this data even
works slightly better or at least similar.
I have simply tried to combine both datasets (90.000 + 2.000.000) but
this makes the results worse (test data amount always stays the same).
This is not surprising because a lot of noise is added to the data and I
think that the huge amount of extra data somehow overrules the existing one.
My question now is, how I can incorporate this data the best in order to
achieve better classification results than with my first dataset. Maybe
someone has an idea or there are some techniques for that.
Just for the record: I use Tf-Idf with a SVC which works best. I have
also tried a different approach using topic models.
Thanks and many regards,
Philipp
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2012-07-10 20:57:39 UTC
Permalink
Post by Philipp Singer
Post by Vlad Niculae
Another (hackish) idea to try would be to keep the labels of the extra
data bit give it a sample_weight low enough not to override your good
training data.
That's actually a great and simple idea. Would I do that similar to that
http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html
So like using a 10 times higher weight for the corresponding samples for
example as a starting point?
I see that the fit method of LinearSVC doesn't have a sample_weight
parameter. So I guess I would have switch to another method. SVC
unfortunaetly has a very long runtime compared to LinearSVC, but maybe a
SGDClassifier would work.
You can use SVC with kernel="linear". That shouldn't be much slower than
LinearSVC.
Philipp Singer
2012-07-11 06:59:54 UTC
Permalink
Post by Andreas Mueller
You can use SVC with kernel="linear". That shouldn't be much slower than
LinearSVC.
Thanks for the hint. Unfortunately, the LinearSVC implementation is much
faster than the SVC implementation with a linear kernel.
Olivier Grisel
2012-07-11 08:02:00 UTC
Permalink
Post by Philipp Singer
Post by Andreas Mueller
You can use SVC with kernel="linear". That shouldn't be much slower than
LinearSVC.
Thanks for the hint. Unfortunately, the LinearSVC implementation is much
faster than the SVC implementation with a linear kernel.
It mostly depends on the number of samples and classes. For low number
of classes and medium number of samples (e.g. couple of thousands),
SVC on dense data can be much faster (and more memory efficient too).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Philipp Singer
2012-07-11 08:05:27 UTC
Permalink
Post by Olivier Grisel
Post by Philipp Singer
Post by Andreas Mueller
You can use SVC with kernel="linear". That shouldn't be much slower than
LinearSVC.
Thanks for the hint. Unfortunately, the LinearSVC implementation is much
faster than the SVC implementation with a linear kernel.
It mostly depends on the number of samples and classes. For low number
of classes and medium number of samples (e.g. couple of thousands),
SVC on dense data can be much faster (and more memory efficient too).
I see! The thing is that I do text classification (so I have a huge
amount of features) and I also have a large number of training examples,
which seems to slow down the SVC implementation. On the other hand, the
LinearSVC implementation works pretty fast.

I guess, it should not be a hard task to implement sample weighting for
LinearSVC as well? I will take a look into it.
Olivier Grisel
2012-07-11 08:11:05 UTC
Permalink
Post by Philipp Singer
Post by Olivier Grisel
Post by Philipp Singer
Post by Andreas Mueller
You can use SVC with kernel="linear". That shouldn't be much slower than
LinearSVC.
Thanks for the hint. Unfortunately, the LinearSVC implementation is much
faster than the SVC implementation with a linear kernel.
It mostly depends on the number of samples and classes. For low number
of classes and medium number of samples (e.g. couple of thousands),
SVC on dense data can be much faster (and more memory efficient too).
I see! The thing is that I do text classification (so I have a huge
amount of features) and I also have a large number of training examples,
which seems to slow down the SVC implementation. On the other hand, the
LinearSVC implementation works pretty fast.
I guess, it should not be a hard task to implement sample weighting for
LinearSVC as well? I will take a look into it.
LinearSVC is based on the liblinear C++ library which AFAIK does not
support sample weight. You should better have a look at SGDClassifier:

http://scikit-learn.org/stable/modules/sgd.html
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Philipp Singer
2012-07-11 08:14:24 UTC
Permalink
Post by Olivier Grisel
LinearSVC is based on the liblinear C++ library which AFAIK does not
support sample weight.
Well, that's true.
Post by Olivier Grisel
http://scikit-learn.org/stable/modules/sgd.html
I have already tried approaches like SGDC or Multinomial Naive Bayes. I
can improve these two classifiers with sample weighting, but the thing
is that LinearSVC without the incorporated data still outperforms the
other algorithms.

But I guess I will play around a bit more ;)
Vlad Niculae
2012-07-11 08:17:56 UTC
Permalink
Post by Philipp Singer
Post by Olivier Grisel
LinearSVC is based on the liblinear C++ library which AFAIK does not
support sample weight.
Well, that's true.
Post by Olivier Grisel
http://scikit-learn.org/stable/modules/sgd.html
I have already tried approaches like SGDC or Multinomial Naive Bayes. I
can improve these two classifiers with sample weighting, but the thing
is that LinearSVC without the incorporated data still outperforms the
other algorithms.
But I guess I will play around a bit more ;)
How did you set the sample_weight? I found this to be very difficult, specifically, 'auto' rarely improves anything.
Post by Philipp Singer
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------
Vlad N.
http://vene.ro
Gael Varoquaux
2012-07-11 08:20:51 UTC
Permalink
Post by Vlad Niculae
How did you set the sample_weight? I found this to be very difficult, specifically, 'auto' rarely improves anything.
I use grad students for this. They are slighlty prone to overfitting, so
you need to bagg them.

G

:)
Philipp Singer
2012-07-11 08:23:30 UTC
Permalink
Post by Vlad Niculae
Post by Philipp Singer
I have already tried approaches like SGDC or Multinomial Naive Bayes. I
can improve these two classifiers with sample weighting, but the thing
is that LinearSVC without the incorporated data still outperforms the
other algorithms.
But I guess I will play around a bit more ;)
How did you set the sample_weight? I found this to be very difficult, specifically, 'auto' rarely improves anything.
That's another problem, where I don't know exactly how to procede. At
the moment I have weighted my 90.000 well working examples with 1.0 and
the 2.000.000 incorporated data with 0.01.

But I don't know if this is a legit approach. With Naive Bayes I can
improve my classification from 0.48 to 0.50 ;)

If someone can tell us some strategies for sample weighting I would be
happy.
Continue reading on narkive:
Loading...