Thanks for the reply. I assume I need to add missing feature values as 0.0s (and also convert True to 1.0 and False to 0.0) in scikits.learn feature representation. The same isn't the case in NLTK. I would probably proceed to write my own function and post it on the mailing list to receive feedback on efficiency and correctness.
I am aware of Jacob's work but as of now it doesn't allow to add custom feature sets. I did clone the nltk-trainer git and try to understand the code (sci.py under the classification folder) but I got lost at a function call which I couldn't locate inside the source.
I am also aware about LinearSVC equating to logit (called Maxent in NLTK) under the conditions mentioned by you. However, I am not sure if NLTK implements a L2 regularizer in Maxent. I believe the same exists in scikits.learn and hence, will try it out.
Post by Olivier Grisel
Post by Denzil Correa
I would like to convert a NLTK feature set (each data point as a list
with a 2-tuple value where the first tuple value is the feature set
and the second tuple value is the class label) to scikits.learn numpy
array feature sets. My NLTK feature sets consist of a combination of
multiple feature sets including word unigrams, word bigrams, word
trigrams, character unigrams, character bigrams, character trigrams,
frequency of punctuations, frequency of function words, frequency of
letters, frequency of special characters and 80-100 more such features.
There are multiple issues including : index-feature mapping and order
preservation since, target labels need to be stored in a separate
I don't see the issue: just don't re-shuffle the samples and the labels.
Post by Denzil Correa
Is there a quick & efficient way to convert to the feature set
representation in scikits.learn? I moved over to scikits.learn to test
the accuracy of SVM's on my text classification task. Also, it would
be really helpful to the community to have such quick shifts between
these two frameworks/libraries.
Jacob Perkins started some work to use scikit-learn as classifier for
Note this should work with the latest stable release of scikit-learn
(0.7.1). In the current state of the master scikit-learn
feature_extraction.text package has changed a bit and this code would
need a bit of adaptation.
As for the use of SVMa, you should use the sparse LinearSVC (and not
kernel SVC that are not scalable to problems with many samples and
many features as in text classification, and would probably over-fit
anyway). Don't expect a miracle though. Training linear models with
the SVM objective (hinge loss + l2 regularizer) or the logistic
regression objective (log loss + l2 regularizer) generally give
comparable results for text classification.
http://twitter.com/ogrisel - http://github.com/ogrisel
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network
management toolset available today.Â Delivers lowest initial
acquisition cost and overall TCO of any competing solution.
Scikit-learn-general mailing list