2011-11-16 16:54:32 UTC
I experienced abnormal behaviors of RidgeClassifier in context of text
*Test setup:* ~800 documents, ~2500 features, 15 classes, scikit-learn dev
version (version few days ago), classification with KFold.
When RidgeClassifier is tested, different results (f1,precision,recall) are
generated when X is in different formats, i.e. scipy.sparse
vs ( numpy.ndarray(by toarray) or numpy.matrixlib.defmatrix.matrix(by
The difference of results (f1/precision/recall) between X sparse and
(X.todense() or X.array()) are about -0.5% to +1.0%.
Tested in full feature scenario, feature selection scenario, and parts of
classes scenario, this difference all occurs.
Other classifiers that can operate on scipy.sparse are tested, none of them
have this problem. Namely, kNN, Naive Bayes, LinearSVC, SGDClassifier.
So, I reckon, this may be a bug in Ridge itself. Does anyone know which
result, sparse one or toarray/todense one, is the correct one that I should
consider as my result?
Another question about how to use tree classifier: in the experiment
setting mentioned above, I get results, say f1 scores, around 83%-90% using
different kinds of classifiers mentioned above with parameter tunning.
However, when I tried tree classifier, my results are always below 65%.
I tried to tune different parameters but never got substantial improvement.
I tried to look into few textbook and paper, but still could not figure out
in practice what should I do to get similar results from tree
classifier compared to other classifiers.
Could you please shed some light on using trees with high-dimensional data,
or refer me to a practical guide about tree classifiers. Any help would be