Discussion:
[Scikit-learn-general] CountVectorizer followed by Binarizer doesn't work
Lars Buitinck
2011-06-01 15:35:35 UTC
Permalink
Hi all,

I'm not sure I'm even supposed to try this, but I did it anyway:

pipe = Pipeline([
('vect', CountVectorizer()),
('bin', Binarizer()),
('clf', BernoulliNB()),
])

should, I thought, count term occurrences and then transform them to
binary features to be used in a Bernoulli naive Bayes classifier.[1]
However, fitting this pipeline fails:

Traceback (most recent call last):
File "examples/bernoulli_naive_bayes.py", line 34, in <module>
bnb.fit(docs_train, data_train.target)
File "/scratch/apps/src/scikit-learn/scikits/learn/pipeline.py",
line 141, in fit
Xt = self._pre_transform(X, y, **params)
File "/scratch/apps/src/scikit-learn/scikits/learn/pipeline.py",
line 137, in _pre_transform
Xt = transform.fit(Xt, y).transform(Xt)
File "/scratch/apps/src/scikit-learn/scikits/learn/preprocessing/__init__.py",
line 125, in transform
X[cond] = 1
TypeError: 'coo_matrix' object does not support item assignment

So the question is: is this a bug in Binarizer, is this a bug in
CountVectorizer or did I do something immoral/illegal/invalid?

Regards,
Lars


[1] https://github.com/larsmans/scikit-learn/commit/5f87e43cb462d4df4b982254746b4ce3dc79a1b4
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Vlad Niculae
2011-06-01 15:47:24 UTC
Permalink
Hello!

It looks to me like the BernoulliNB does not support sparse matrices.
My approach when faced with something similar (I didn't know about the
Binarizer at the time) was to write my own vectorizer extending the
CountVectorizer as described on my blog[1]. Basically the
InfinitivesExtractor turns the sparse matrix into a dense one, and
truncates positive values to one.

Maybe you could use the preprocessing.binarizer instead of the
preprocessing.sparse.binarizer?

Of course, other people can help you more than I can on this subject,
I've never used BernoulliNB. But maybe my 2c can be of use to you.

[1] http://venefrombucharest.wordpress.com/2011/04/29/customizing-scikits-learn-for-a-specific-text-analysis-task/

Best,
Vlad
Post by Lars Buitinck
Hi all,
pipe = Pipeline([
   ('vect', CountVectorizer()),
   ('bin', Binarizer()),
   ('clf', BernoulliNB()),
])
should, I thought, count term occurrences and then transform them to
binary features to be used in a Bernoulli naive Bayes classifier.[1]
 File "examples/bernoulli_naive_bayes.py", line 34, in <module>
   bnb.fit(docs_train, data_train.target)
 File "/scratch/apps/src/scikit-learn/scikits/learn/pipeline.py",
line 141, in fit
   Xt = self._pre_transform(X, y, **params)
 File "/scratch/apps/src/scikit-learn/scikits/learn/pipeline.py",
line 137, in _pre_transform
   Xt = transform.fit(Xt, y).transform(Xt)
 File "/scratch/apps/src/scikit-learn/scikits/learn/preprocessing/__init__.py",
line 125, in transform
   X[cond] = 1
TypeError: 'coo_matrix' object does not support item assignment
So the question is: is this a bug in Binarizer, is this a bug in
CountVectorizer or did I do something immoral/illegal/invalid?
Regards,
Lars
[1] https://github.com/larsmans/scikit-learn/commit/5f87e43cb462d4df4b982254746b4ce3dc79a1b4
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Simplify data backup and recovery for your virtual environment with vRanger.
Installation's a snap, and flexible recovery options mean your data is safe,
secure and there when you need it. Data protection magic?
Nope - It's vRanger. Get your free trial download today.
http://p.sf.net/sfu/quest-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2011-06-01 15:48:34 UTC
Permalink
Post by Lars Buitinck
Hi all,
pipe = Pipeline([
   ('vect', CountVectorizer()),
   ('bin', Binarizer()),
   ('clf', BernoulliNB()),
])
should, I thought, count term occurrences and then transform them to
binary features to be used in a Bernoulli naive Bayes classifier.[1]
 File "examples/bernoulli_naive_bayes.py", line 34, in <module>
   bnb.fit(docs_train, data_train.target)
 File "/scratch/apps/src/scikit-learn/scikits/learn/pipeline.py",
line 141, in fit
   Xt = self._pre_transform(X, y, **params)
 File "/scratch/apps/src/scikit-learn/scikits/learn/pipeline.py",
line 137, in _pre_transform
   Xt = transform.fit(Xt, y).transform(Xt)
 File "/scratch/apps/src/scikit-learn/scikits/learn/preprocessing/__init__.py",
line 125, in transform
   X[cond] = 1
TypeError: 'coo_matrix' object does not support item assignment
So the question is: is this a bug in Binarizer, is this a bug in
CountVectorizer or did I do something immoral/illegal/invalid?
It's a bug in Binarizer. Can you replace

if not sp.isspmatrix_csr(X) and not sp.isspmatrix_csc(X):

by

if not sp.isspmatrix_csr(X) and not sp.isspmatrix_csr(X):

and check that it works?

Also if you are to create a special class for BernouillyNB, I would
make the binarization directly from that class.

Mathieu
Mathieu Blondel
2011-06-01 15:51:31 UTC
Permalink
by
Ooops, forget about the comment above.

The matrix doesn't get converted to CSR as expected so there's
definitely a bug somewhere.

Mathieu
Lars Buitinck
2011-06-01 15:53:13 UTC
Permalink
Post by Mathieu Blondel
Post by Lars Buitinck
So the question is: is this a bug in Binarizer, is this a bug in
CountVectorizer or did I do something immoral/illegal/invalid?
It's a bug in Binarizer. Can you replace
by
and check that it works?
My apologies, I included the wrong (dense) Binarizer. Never mind!
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Mathieu Blondel
2011-06-01 15:57:13 UTC
Permalink
Post by Lars Buitinck
My apologies, I included the wrong (dense) Binarizer. Never mind!
Olivier is working on merging the dense and sparse preprocessing
modules into a single module. Yet another proof that it will be more
natural and less-error prone for the user.

Mathieu
Olivier Grisel
2011-06-01 16:04:01 UTC
Permalink
Post by Mathieu Blondel
Post by Lars Buitinck
My apologies, I included the wrong (dense) Binarizer. Never mind!
Olivier is working on merging the dense and sparse preprocessing
modules into a single module. Yet another proof that it will be more
natural and less-error prone for the user.
I had put that work in stanby mode. I will resume ASAP. Maybe tonight.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Loading...