Discussion:
High training error for small datasets
(too old to reply)
Kai Kuehne
2012-06-22 08:42:40 UTC
Permalink
Hi,
I posted this question a few days ago on IRC shortly before my
internet connection broke down,
so sorry if you read this already.

I'm currently building a simple classification system and try to use
learning curves to check whether whether
my model suffers from high bias or high variance.
I (think I) followed the instructions on this page:
http://jakevdp.github.com/tutorial/astronomy/practical.html
So, if i understood this correctly, the training error should be small
for small training sets.
But, in my implementation and for my corpus, the training error starts
high: Loading Image...
I calculate the error for every m like this: http://dpaste.com/761794/

Does anyone of you have any tips what I did wrong here?
Thank you!
Andreas Mueller
2012-06-22 10:07:17 UTC
Permalink
Hi Kai.
What kind of classifier are you using and with which parameters?

Cheers,
Andy
Post by Kai Kuehne
Hi,
I posted this question a few days ago on IRC shortly before my
internet connection broke down,
so sorry if you read this already.
I'm currently building a simple classification system and try to use
learning curves to check whether whether
my model suffers from high bias or high variance.
http://jakevdp.github.com/tutorial/astronomy/practical.html
So, if i understood this correctly, the training error should be small
for small training sets.
But, in my implementation and for my corpus, the training error starts
high: http://i.imgur.com/j4MNx.png
I calculate the error for every m like this: http://dpaste.com/761794/
Does anyone of you have any tips what I did wrong here?
Thank you!
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2012-06-22 10:15:39 UTC
Permalink
Post by Kai Kuehne
Hi,
I posted this question a few days ago on IRC shortly before my
internet connection broke down,
so sorry if you read this already.
I'm currently building a simple classification system and try to use
learning curves to check whether whether
my model suffers from high bias or high variance.
http://jakevdp.github.com/tutorial/astronomy/practical.html
So, if i understood this correctly, the training error should be small
for small training sets.
But, in my implementation and for my corpus, the training error starts
high: http://i.imgur.com/j4MNx.png
I calculate the error for every m like this: http://dpaste.com/761794/
Maybe the machine learning algorithm stops before reaching actual
convergence? What kind of data are you using? what dimensions? what
type of model and what parameters are you using?

Here is an alternative implementation of the learning curves:

https://gist.github.com/1540431

They behave as expected in this case.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Kai Kuehne
2012-06-22 10:34:02 UTC
Permalink
Hi,
in this case, I'm using a bayes (multinomial) estimator for binary
classification
The texts are rather short. In this case, the size of the dataset is
1000, but I have
the same kind of problem with datasets consisting of 20000 (or 80000) texts.

The Pipeline I'm using looks like this:
Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])

I'm trying the code in the gist, thanks!
Kai Kuehne
2012-06-22 14:16:33 UTC
Permalink
Hi again,

I looked at Olivers diagrams. I plot the error, Oliver the classification score
which starts low (high error). So it's the same problem. Or isn't this
even a problem?
I'm getting confused.

I try to grasp the concept based on
Loading Image...
where, in both cases, the training error starts small.

But then I looked at this picture:
Loading Image...
I see that for small degrees, the classifier will always have a high
bias (all sets show high error rates).
When I use the bayes classifier, what is the degree? Isn't it always 1 which
leads to the observed behavior?

Thanks!
Kai Kuehne
2012-06-22 14:26:55 UTC
Permalink
I didn't explain the thing I don't understand.
I try again...

In this first picture:
http://jakevdp.github.com/_images/plot_bias_variance_examples_3.png

Both training and cross validation error start high, so it's a high bias
if the degree is small.

On the second picture:
http://jakevdp.github.com/_images/plot_bias_variance_examples_4.png

On the left side pictured is d = 1, so a low degree.
The cross validation error starts high, but ... and that's the thing
I both don't understand and cannot reproduce: The training error
starts small. The first diagram states that both start high for small
degrees...
Olivier Grisel
2012-06-22 14:47:04 UTC
Permalink
Post by Kai Kuehne
I didn't explain the thing I don't understand.
I try again...
http://jakevdp.github.com/_images/plot_bias_variance_examples_3.png
Both training and cross validation error start high, so it's a high bias
if the degree is small.
http://jakevdp.github.com/_images/plot_bias_variance_examples_4.png
On the left side pictured is d = 1, so a low degree.
The cross validation error starts high, but ... and that's the thing
I both don't understand and cannot reproduce: The training error
starts small. The first diagram states that both start high for small
degrees...
I don't really know but I think those curves should be recomputed to
display the mean across 10 runs of a 10-folds CV along with the
standard error of the means as error bars like I almost did on my
graphs (I used the standard deviation instead).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2012-06-22 15:14:54 UTC
Permalink
Post by Kai Kuehne
Hi again,
I looked at Olivers diagrams. I plot the error, Oliver the classification score
which starts low (high error). So it's the same problem. Or isn't this
even a problem?
In my case with, when you have few training example, the training
score is good (unless yo have a very high bias / regularizer / simple
model) but the test score (as estimated by cross validation) is bad.

This is exactly what is to be expected as explained by Jake's in his
tutorial on in Andrew Ng's online video here:


--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Continue reading on narkive:
Loading...