[Scikit-learn-general] Using TFxIDF with HashingVectorizer

Discussion:

Minkoo

12 years ago

Hi scikit learn.

I have a question on using HashingVectorizer with TFxIDF. Currently, I'm
trying to build a model to predict classes for large set of documents.

I'm using HashingVectorizer as my data is large. Because I can feed some
batch of documents to HashingVectorizer, it's perfect fit for my large data
which is about 6.8GB.

On the other hand TfIdfVectorizer does not support processing documents in
batch. It needs to load the entire feature vector into the memory.

What I'm thinking, instead, is to merge two of them. Though the entire
document is 6.8GB, a CSV containing "(term, IDF)" is about 290MB. For that
reason, I think I can load the IDF table into the memory.

But I couldn't find out a good way to use IDF table when creating feature
vector in HashingVectorizer. 'normalize' seems to be the point to extend
the HashingVectorizer to use the IDF table, but it's currently tied to a
function named 'normalize'.

Has anyone tried similar approach? If no one is, what is the reason? Is it
because TFxIDF is not useful for document classification? or any other
reason?

Please advise.

Thanks
Minkoo

Olivier Grisel

12 years ago

Permalink

You can use a Pipeline to combine a TfidfTransformer with the HashingVectorizer.

--
Olivier

Lars Buitinck

12 years ago

Permalink

Post by Minkoo
I have a question on using HashingVectorizer with TFxIDF. Currently, I'm
trying to build a model to predict classes for large set of documents.
On the other hand TfIdfVectorizer does not support processing documents in
batch. It needs to load the entire feature vector into the memory.

That's because it needs tf-idf needs two passes over the dataset,
while HashingVectorizer is intended as a memoryless, single-pass
method.

Post by Minkoo
But I couldn't find out a good way to use IDF table when creating feature
vector in HashingVectorizer. 'normalize' seems to be the point to extend the
HashingVectorizer to use the IDF table, but it's currently tied to a
function named 'normalize'.

Normalization has little to do with tf-idf, it just means that the
document vectors are normalized so that cosine similarities work and
learners don't get too extreme values as input (note that cosine
similarity and tf-idf are orthogonal concepts, even though IR
textbooks commonly treat them as a pair). The way to combine HV and
Tfidf is

hashing = HashingVectorizer(non_negative=True, norm=None)
tfidf = TfidfTransformer()
hashing_tfidf = Pipeline([("hashing", hashing), ("tidf", tfidf)])

Apu Mishra

11 years ago

Permalink

Post by Lars Buitinck
The way to combine HV and
Tfidf is
hashing = HashingVectorizer(non_negative=True, norm=None)
tfidf = TfidfTransformer()
hashing_tfidf = Pipeline([("hashing", hashing), ("tidf", tfidf)])

I notice your use of the non_negative option in HashingVectorizer(), when
following hashing with TF-IDF.

Since using non_negative eliminates some information, I am curious whether
there is any harm to allowing negative values as inputs to the TF-IDF
function. In the general case, feature values whether positive or negative
should simply scale up based on how document-infrequent they are, so I don't
see the harm of allowing negative values.

-Apu

Lars Buitinck

11 years ago

Permalink

...

non_negative=True is a hack, and yes, it throws away information, and
yes, I think we could define it for negative values by computing idf
on the absolute values. It's just that no-one has done so. The first
step would be to work out the repercussions: if a feature has zero
value everywhere, it might have been seen, but thrown away by the
hasher's collision resolving, so the df statistic is no longer
reliable. Is that acceptable? Can we honestly call the output of this
hack tf-idf?