Fred Mailhot
2016-03-23 03:45:16 UTC
Hello list,
Firstly, thanks for this incredible package; I use it daily at work. Now on
to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
the following:
#======================
class WordCooccurrenceVectorizer(TfidfVectorizer):
### override __init__ to add w2v_clusters arg
# see
http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
# for explanation of syntax
def __init__(self, *args, **kwargs):
try:
self.w2v_cluster_path = kwargs.pop("w2v_clusters")
except KeyError:
pass
super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
def build_analyzer(self):
preprocess = self.build_preprocessor()
stopwords = self.get_stop_words()
w2v_clusters = self.load_w2v_clusters()
tokenize = self.build_tokenizer()
return lambda doc:
self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
[...]
#======================
I can instantiate this, but when I want to inspect it, I get the following
(this is in ipython, in a script it just hangs):
#======================
In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
stop_words="english", max_df=0.5, min_df=1, max_features=10000,
w2v_clusters="clusters.20160322_1803.w2v", binary=True)
In [3]: vec
Out[3]:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
in __call__(self, obj)
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()
[...]
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
in _get_param_names(cls)
193 " %s with constructor %s doesn't
"
194 " follow this convention."
--> 195 % (cls, init_signature))
196 # Extract and sort argument names excluding 'self'
197 return sorted([p.name for p in parameters])
RuntimeError: scikit-learn estimators should always specify their
parameters in the signature of their __init__ (no varargs). <class
'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
*args, **kwargs) doesn't follow this convention.
In [4]:
#======================
The error is clear enough -- I can't use *args and **kwargs in a sklearn
estimator's __init__() -- but I'm not sure what the correct way is to do
what I need to do. Do I literally need to specify all of the __init__
params in my subclass and then pass them on to the __init__ of super()? If
so, what's the reason for setting this up this way?
Thanks for any pointers/guidance,
Fred.
Firstly, thanks for this incredible package; I use it daily at work. Now on
to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
the following:
#======================
class WordCooccurrenceVectorizer(TfidfVectorizer):
### override __init__ to add w2v_clusters arg
# see
http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
# for explanation of syntax
def __init__(self, *args, **kwargs):
try:
self.w2v_cluster_path = kwargs.pop("w2v_clusters")
except KeyError:
pass
super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
def build_analyzer(self):
preprocess = self.build_preprocessor()
stopwords = self.get_stop_words()
w2v_clusters = self.load_w2v_clusters()
tokenize = self.build_tokenizer()
return lambda doc:
self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
[...]
#======================
I can instantiate this, but when I want to inspect it, I get the following
(this is in ipython, in a script it just hangs):
#======================
In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
stop_words="english", max_df=0.5, min_df=1, max_features=10000,
w2v_clusters="clusters.20160322_1803.w2v", binary=True)
In [3]: vec
Out[3]:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
in __call__(self, obj)
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()
[...]
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
in _get_param_names(cls)
193 " %s with constructor %s doesn't
"
194 " follow this convention."
--> 195 % (cls, init_signature))
196 # Extract and sort argument names excluding 'self'
197 return sorted([p.name for p in parameters])
RuntimeError: scikit-learn estimators should always specify their
parameters in the signature of their __init__ (no varargs). <class
'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
*args, **kwargs) doesn't follow this convention.
In [4]:
#======================
The error is clear enough -- I can't use *args and **kwargs in a sklearn
estimator's __init__() -- but I'm not sure what the correct way is to do
what I need to do. Do I literally need to specify all of the __init__
params in my subclass and then pass them on to the __init__ of super()? If
so, what's the reason for setting this up this way?
Thanks for any pointers/guidance,
Fred.