Minkoo
12 years ago
Hi scikit learn.
I have a question on using HashingVectorizer with TFxIDF. Currently, I'm
trying to build a model to predict classes for large set of documents.
I'm using HashingVectorizer as my data is large. Because I can feed some
batch of documents to HashingVectorizer, it's perfect fit for my large data
which is about 6.8GB.
On the other hand TfIdfVectorizer does not support processing documents in
batch. It needs to load the entire feature vector into the memory.
What I'm thinking, instead, is to merge two of them. Though the entire
document is 6.8GB, a CSV containing "(term, IDF)" is about 290MB. For that
reason, I think I can load the IDF table into the memory.
But I couldn't find out a good way to use IDF table when creating feature
vector in HashingVectorizer. 'normalize' seems to be the point to extend
the HashingVectorizer to use the IDF table, but it's currently tied to a
function named 'normalize'.
Has anyone tried similar approach? If no one is, what is the reason? Is it
because TFxIDF is not useful for document classification? or any other
reason?
Please advise.
Thanks
Minkoo
I have a question on using HashingVectorizer with TFxIDF. Currently, I'm
trying to build a model to predict classes for large set of documents.
I'm using HashingVectorizer as my data is large. Because I can feed some
batch of documents to HashingVectorizer, it's perfect fit for my large data
which is about 6.8GB.
On the other hand TfIdfVectorizer does not support processing documents in
batch. It needs to load the entire feature vector into the memory.
What I'm thinking, instead, is to merge two of them. Though the entire
document is 6.8GB, a CSV containing "(term, IDF)" is about 290MB. For that
reason, I think I can load the IDF table into the memory.
But I couldn't find out a good way to use IDF table when creating feature
vector in HashingVectorizer. 'normalize' seems to be the point to extend
the HashingVectorizer to use the IDF table, but it's currently tied to a
function named 'normalize'.
Has anyone tried similar approach? If no one is, what is the reason? Is it
because TFxIDF is not useful for document classification? or any other
reason?
Please advise.
Thanks
Minkoo