2012-07-09 10:42:39 UTC
I am currently doing text classification. I have the following setup:
max 1500 train examples per class
overall around 90.000 train examples
same amount of test examples
I am pretty happy with the classification results (~52% f1 score) which
is fine for my task.
But now I have another scenario. I have around 2.000.000 extra training
examples available which are produced by a certain amount of users not
_directly_ corresponding for the classes but I still know the labels of
this data. If I train the classifier simply on this extra data (without
the correct one) I can achieve a F1-score of ~25%. So this somehow tells
me that there is information available that I now somehow want to
incorporate to my existing data. For some few classes this data even
works slightly better or at least similar.
I have simply tried to combine both datasets (90.000 + 2.000.000) but
this makes the results worse (test data amount always stays the same).
This is not surprising because a lot of noise is added to the data and I
think that the huge amount of extra data somehow overrules the existing one.
My question now is, how I can incorporate this data the best in order to
achieve better classification results than with my first dataset. Maybe
someone has an idea or there are some techniques for that.
Just for the record: I use Tf-Idf with a SVC which works best. I have
also tried a different approach using topic models.
Thanks and many regards,