Discussion:
Possible bug about RidgeClassifier and a question about Tree
(too old to reply)
SK Sn
2011-11-16 16:54:32 UTC
Permalink
Hi there,

I experienced abnormal behaviors of RidgeClassifier in context of text
classification.

*Test setup:* ~800 documents, ~2500 features, 15 classes, scikit-learn dev
version (version few days ago), classification with KFold.
*Problem: *
When RidgeClassifier is tested, different results (f1,precision,recall) are
generated when X is in different formats, i.e. scipy.sparse
vs ( numpy.ndarray(by toarray) or numpy.matrixlib.defmatrix.matrix(by
todense) ).
The difference of results (f1/precision/recall) between X sparse and
(X.todense() or X.array()) are about -0.5% to +1.0%.
*Tests:*
Tested in full feature scenario, feature selection scenario, and parts of
classes scenario, this difference all occurs.
Other classifiers that can operate on scipy.sparse are tested, none of them
have this problem. Namely, kNN, Naive Bayes, LinearSVC, SGDClassifier.

So, I reckon, this may be a bug in Ridge itself. Does anyone know which
result, sparse one or toarray/todense one, is the correct one that I should
consider as my result?

Another question about how to use tree classifier: in the experiment
setting mentioned above, I get results, say f1 scores, around 83%-90% using
different kinds of classifiers mentioned above with parameter tunning.
However, when I tried tree classifier, my results are always below 65%.

I tried to tune different parameters but never got substantial improvement.
I tried to look into few textbook and paper, but still could not figure out
in practice what should I do to get similar results from tree
classifier compared to other classifiers.

Could you please shed some light on using trees with high-dimensional data,
or refer me to a practical guide about tree classifiers. Any help would be
appreciated!
Olivier Grisel
2011-11-16 22:39:54 UTC
Permalink
Post by SK Sn
Hi there,
I experienced abnormal behaviors of RidgeClassifier in context of text
classification.
Test setup: ~800 documents, ~2500 features, 15 classes, scikit-learn dev
version (version few days ago), classification with KFold.
When RidgeClassifier is tested, different results (f1,precision,recall) are
generated when X is in different formats, i.e. scipy.sparse
vs ( numpy.ndarray(by toarray) or numpy.matrixlib.defmatrix.matrix(by
todense) ).
You should never use dense matrices: either scipy.sparse or numpy
arrays. For text data, you should probably stick to estimators that
work on scipy.sparse input.
Post by SK Sn
The difference of results (f1/precision/recall) between X sparse and
(X.todense() or X.array()) are about -0.5% to +1.0%.
Always use X.toarray() if you really need to materialize a dense
representation of a sparse dataset. X.todense() is a trap.
Post by SK Sn
Tested in full feature scenario, feature selection scenario, and parts of
classes scenario, this difference all occurs.
Other classifiers that can operate on scipy.sparse are tested, none of them
have this problem. Namely, kNN, Naive Bayes, LinearSVC, SGDClassifier.
Can you please provide a minimalistic reproduction script that
highlight the issue as a gist (see http://gist.github.com )? Maybe
using the 20 newsgroups dataset for instance.

As for decision trees, I think it's normal that a single tree gives
bad results. The future Random Forest implementation should improve
upon that but I don't think the current code base supports sparse data
as input (as is the case for dense data).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2011-11-16 22:40:59 UTC
Permalink
Post by Olivier Grisel
I don't think the current code base supports sparse data
as input (as is the case for dense data).
Sorry I meant: "as is the case for *text* data".
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Lars Buitinck
2011-11-17 10:05:47 UTC
Permalink
Post by Olivier Grisel
You should never use dense matrices: either scipy.sparse or numpy
arrays. For text data, you should probably stick to estimators that
work on scipy.sparse input.
In the current release.
Post by Olivier Grisel
Always use X.toarray() if you really need to materialize a dense
representation of a sparse dataset. X.todense() is a trap.
The next release will add support for samples in np.matrix objects, though.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Olivier Grisel
2011-11-17 15:33:23 UTC
Permalink
Post by Lars Buitinck
Post by Olivier Grisel
You should never use dense matrices: either scipy.sparse or numpy
arrays. For text data, you should probably stick to estimators that
work on scipy.sparse input.
In the current release.
Post by Olivier Grisel
Always use X.toarray() if you really need to materialize a dense
representation of a sparse dataset. X.todense() is a trap.
The next release will add support for samples in np.matrix objects, though.
Even if scikit-learn will be less broken w.r.t. np.matrix object I
would still advise anybody not to use the np.matrix datastructure as
it's API is often more misleading than helpful.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
SK Sn
2011-11-17 15:40:19 UTC
Permalink
Thanks guys, for the detailed explanation. It is clear to me now.

But, just to clarify the original problem, the results (f1 etc.) from
X.todense() and X.toarray() are the same, which all differ from X
(scipy.sparse).

Cheers.
Post by Lars Buitinck
Post by Lars Buitinck
Post by Olivier Grisel
You should never use dense matrices: either scipy.sparse or numpy
arrays. For text data, you should probably stick to estimators that
work on scipy.sparse input.
In the current release.
Post by Olivier Grisel
Always use X.toarray() if you really need to materialize a dense
representation of a sparse dataset. X.todense() is a trap.
The next release will add support for samples in np.matrix objects,
though.
Even if scikit-learn will be less broken w.r.t. np.matrix object I
would still advise anybody not to use the np.matrix datastructure as
it's API is often more misleading than helpful.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2011-11-17 06:33:31 UTC
Permalink
Post by SK Sn
The difference of results (f1/precision/recall) between X sparse and
(X.todense() or X.array()) are about -0.5% to +1.0%.
The difference comes from the fact that different solvers are used for
sparse matrices and numpy arrays.

Mathieu
SK Sn
2011-11-17 07:07:26 UTC
Permalink
@Olivier, the quick reproduction of the error using 20Newsgroups -
https://gist.github.com/1372557
Also, does it mean, actually, for text classification problems, trees are
used less often?

@Mathieu, is this the case only for Ridge? kNN, NB, linearSVC do not have
such a behavior.
If for Ridge, different solvers are used, which result should I refer to as
result from Ridge?

Thanks a lot for your kind help.
Post by Mathieu Blondel
Post by SK Sn
The difference of results (f1/precision/recall) between X sparse and
(X.todense() or X.array()) are about -0.5% to +1.0%.
The difference comes from the fact that different solvers are used for
sparse matrices and numpy arrays.
Mathieu
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2011-11-17 08:26:11 UTC
Permalink
@Mathieu, is this the case only for Ridge? kNN, NB, linearSVC do not have
such a behavior.
If for Ridge, different solvers are used, which result should I refer to as
result from Ridge?
Since you're doing text classification, I would report the results for
the sparse one. (As Olivier pointed out, you can't hope to use a dense
representation with a larger dataset anyway)

Mathieu
Olivier Grisel
2011-11-17 15:40:14 UTC
Permalink
Post by SK Sn
@Olivier, the quick reproduction of the error using 20Newsgroups -
https://gist.github.com/1372557
Also, does it mean, actually, for text classification problems, trees are
used less often?
Probably yes, as simple linear models are often much faster to train
and more scalable and most text classification problems are
approximately linearly separable (e.g. using non-linear models such as
gaussian kernels results in potential over-fitting and much longer
training times).

Would be interesting to try the new Random Forest though once it's
merged though.
Post by SK Sn
@Mathieu, is this the case only for Ridge? kNN, NB, linearSVC do not have
such a behavior.
If for Ridge, different solvers are used, which result should I refer to as
result from Ridge?
Ok so if I understand the real issue is:

# with .toarray(), results: f1:0.99634, precision 0.99637
# only X (sparse), results: f1:0.99524, precision 0.99526
# All other classifiers (kNN, NB, etc) have consistant results no
matter toarray() or not.

I wonder it this is not just about rounding errors. Still f1 score >
0.995 is excellent. I would not call that a bug :P
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
SK Sn
2011-11-17 16:20:27 UTC
Permalink
The difference are normally about 0.1% - 0.5%. The highest difference I
experienced is about 1%.
If different solvers are used as Mathieu mentioned, it is quite
understandable.

What I was just wondering is that why it is just for RidgeClassifier that I
got such abnormal behavior.
Would love to try out Random Forest once it is merged. ;)
Post by Olivier Grisel
Post by SK Sn
@Olivier, the quick reproduction of the error using 20Newsgroups -
https://gist.github.com/1372557
Also, does it mean, actually, for text classification problems, trees are
used less often?
Probably yes, as simple linear models are often much faster to train
and more scalable and most text classification problems are
approximately linearly separable (e.g. using non-linear models such as
gaussian kernels results in potential over-fitting and much longer
training times).
Would be interesting to try the new Random Forest though once it's
merged though.
Post by SK Sn
@Mathieu, is this the case only for Ridge? kNN, NB, linearSVC do not have
such a behavior.
If for Ridge, different solvers are used, which result should I refer to
as
Post by SK Sn
result from Ridge?
# with .toarray(), results: f1:0.99634, precision 0.99637
# only X (sparse), results: f1:0.99524, precision 0.99526
# All other classifiers (kNN, NB, etc) have consistant results no
matter toarray() or not.
I wonder it this is not just about rounding errors. Still f1 score >
0.995 is excellent. I would not call that a bug :P
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...