[Scikit-learn-general] Subclassing vectorizers

Discussion:

Fred Mailhot

2016-03-23 03:45:16 UTC

Hello list,

Firstly, thanks for this incredible package; I use it daily at work. Now on
to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
the following:

#======================
class WordCooccurrenceVectorizer(TfidfVectorizer):

### override __init__ to add w2v_clusters arg
# see
http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
# for explanation of syntax
def __init__(self, *args, **kwargs):
try:
self.w2v_cluster_path = kwargs.pop("w2v_clusters")
except KeyError:
pass
super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)

def build_analyzer(self):
preprocess = self.build_preprocessor()
stopwords = self.get_stop_words()
w2v_clusters = self.load_w2v_clusters()
tokenize = self.build_tokenizer()
return lambda doc:
self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
[...]
#======================

I can instantiate this, but when I want to inspect it, I get the following
(this is in ipython, in a script it just hangs):

#======================
In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
stop_words="english", max_df=0.5, min_df=1, max_features=10000,
w2v_clusters="clusters.20160322_1803.w2v", binary=True)

In [3]: vec
Out[3]:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
in __call__(self, obj)
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()

[...]

/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
in _get_param_names(cls)
193 " %s with constructor %s doesn't
"
194 " follow this convention."
--> 195 % (cls, init_signature))
196 # Extract and sort argument names excluding 'self'
197 return sorted([p.name for p in parameters])

RuntimeError: scikit-learn estimators should always specify their
parameters in the signature of their __init__ (no varargs). <class
'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
*args, **kwargs) doesn't follow this convention.

In [4]:
#======================

The error is clear enough -- I can't use *args and **kwargs in a sklearn
estimator's __init__() -- but I'm not sure what the correct way is to do
what I need to do. Do I literally need to specify all of the __init__
params in my subclass and then pass them on to the __init__ of super()? If
so, what's the reason for setting this up this way?

Thanks for any pointers/guidance,
Fred.

Joel Nothman

2016-03-23 04:01:44 UTC

Permalink

Hi Fred,

We use the __init__ signature to get the list of parameters that (a) can be
set by grid search; (b) need to be copied to a cloned instance of the
estimator (with any fitted model discarded) in constructing ensembles,
cross validation, etc. While none of the scikit-learn library of estimators
do this, in practice you can overload get_params to define your own
parameter listing. See
http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params

Post by Fred Mailhot
Hello list,
Firstly, thanks for this incredible package; I use it daily at work. Now
on to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
#======================
### override __init__ to add w2v_clusters arg
# see
http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
# for explanation of syntax
self.w2v_cluster_path = kwargs.pop("w2v_clusters")
pass
super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
preprocess = self.build_preprocessor()
stopwords = self.get_stop_words()
w2v_clusters = self.load_w2v_clusters()
tokenize = self.build_tokenizer()
self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
[...]
#======================
I can instantiate this, but when I want to inspect it, I get the following
#======================
In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
stop_words="english", max_df=0.5, min_df=1, max_features=10000,
w2v_clusters="clusters.20160322_1803.w2v", binary=True)
In [3]: vec
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
in __call__(self, obj)
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()
[...]
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
in _get_param_names(cls)
193 " %s with constructor %s
doesn't "
194 " follow this convention."
--> 195 % (cls, init_signature))
196 # Extract and sort argument names excluding 'self'
197 return sorted([p.name for p in parameters])
RuntimeError: scikit-learn estimators should always specify their
parameters in the signature of their __init__ (no varargs). <class
'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
*args, **kwargs) doesn't follow this convention.
#======================
The error is clear enough -- I can't use *args and **kwargs in a sklearn
estimator's __init__() -- but I'm not sure what the correct way is to do
what I need to do. Do I literally need to specify all of the __init__
params in my subclass and then pass them on to the __init__ of super()? If
so, what's the reason for setting this up this way?
Thanks for any pointers/guidance,
Fred.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2016-03-23 04:04:43 UTC

Permalink

something like the following may suffice:

def get_params(self, deep=True):
out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep)
out['w2v_clusters'] = self.w2v_clusters
return out

Post by Joel Nothman
Hi Fred,
We use the __init__ signature to get the list of parameters that (a) can
be set by grid search; (b) need to be copied to a cloned instance of the
estimator (with any fitted model discarded) in constructing ensembles,
cross validation, etc. While none of the scikit-learn library of estimators
do this, in practice you can overload get_params to define your own
parameter listing. See
http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params

Post by Fred Mailhot
Hello list,
Firstly, thanks for this incredible package; I use it daily at work. Now
on to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
#======================
### override __init__ to add w2v_clusters arg
# see
http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
# for explanation of syntax
self.w2v_cluster_path = kwargs.pop("w2v_clusters")
pass
super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
preprocess = self.build_preprocessor()
stopwords = self.get_stop_words()
w2v_clusters = self.load_w2v_clusters()
tokenize = self.build_tokenizer()
self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
[...]
#======================
I can instantiate this, but when I want to inspect it, I get the
#======================
In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
stop_words="english", max_df=0.5, min_df=1, max_features=10000,
w2v_clusters="clusters.20160322_1803.w2v", binary=True)
In [3]: vec
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
in __call__(self, obj)
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()
[...]
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
in _get_param_names(cls)
193 " %s with constructor %s
doesn't "
194 " follow this convention."
--> 195 % (cls, init_signature))
196 # Extract and sort argument names excluding 'self'
197 return sorted([p.name for p in parameters])
RuntimeError: scikit-learn estimators should always specify their
parameters in the signature of their __init__ (no varargs). <class
'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
*args, **kwargs) doesn't follow this convention.
#======================
The error is clear enough -- I can't use *args and **kwargs in a sklearn
estimator's __init__() -- but I'm not sure what the correct way is to do
what I need to do. Do I literally need to specify all of the __init__
params in my subclass and then pass them on to the __init__ of super()? If
so, what's the reason for setting this up this way?
Thanks for any pointers/guidance,
Fred.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Joel Nothman

2016-03-23 04:34:53 UTC

Permalink

And I lied that none of the scikit-learn estimators define their own
get_params. Of course the following do: VotingClassifier, Kernel (and
subclasses), Pipeline and FeatureUnion

Post by Joel Nothman
out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep)
out['w2v_clusters'] = self.w2v_clusters
return out

Post by Fred Mailhot
Hello list,
Firstly, thanks for this incredible package; I use it daily at work. Now
on to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
#======================
### override __init__ to add w2v_clusters arg
# see
http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
# for explanation of syntax
self.w2v_cluster_path = kwargs.pop("w2v_clusters")
pass
super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
preprocess = self.build_preprocessor()
stopwords = self.get_stop_words()
w2v_clusters = self.load_w2v_clusters()
tokenize = self.build_tokenizer()
self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
[...]
#======================
I can instantiate this, but when I want to inspect it, I get the
#======================
In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
stop_words="english", max_df=0.5, min_df=1, max_features=10000,
w2v_clusters="clusters.20160322_1803.w2v", binary=True)
In [3]: vec
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
in __call__(self, obj)
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()
[...]
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
in _get_param_names(cls)
193 " %s with constructor %s
doesn't "
194 " follow this convention."
--> 195 % (cls, init_signature))
196 # Extract and sort argument names excluding 'self'
197 return sorted([p.name for p in parameters])
RuntimeError: scikit-learn estimators should always specify their
parameters in the signature of their __init__ (no varargs). <class
'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
*args, **kwargs) doesn't follow this convention.
#======================
The error is clear enough -- I can't use *args and **kwargs in a sklearn
estimator's __init__() -- but I'm not sure what the correct way is to do
what I need to do. Do I literally need to specify all of the __init__
params in my subclass and then pass them on to the __init__ of super()? If
so, what's the reason for setting this up this way?
Thanks for any pointers/guidance,
Fred.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Sebastian Raschka

2016-03-23 04:58:03 UTC

Permalink

Hah, and I just wanted to write regarding the VotingClassifier — I remember my struggle quite well when I tried to to make it pipeline and GridSearch compatible until I figured that one out :P

And I lied that none of the scikit-learn estimators define their own get_params. Of course the following do: VotingClassifier, Kernel (and subclasses), Pipeline and FeatureUnion
out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep)
out['w2v_clusters'] = self.w2v_clusters
return out
Hi Fred,
We use the __init__ signature to get the list of parameters that (a) can be set by grid search; (b) need to be copied to a cloned instance of the estimator (with any fitted model discarded) in constructing ensembles, cross validation, etc. While none of the scikit-learn library of estimators do this, in practice you can overload get_params to define your own parameter listing. See http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params
Hello list,
#======================
### override __init__ to add w2v_clusters arg
# see http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
# for explanation of syntax
self.w2v_cluster_path = kwargs.pop("w2v_clusters")
pass
super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
preprocess = self.build_preprocessor()
stopwords = self.get_stop_words()
w2v_clusters = self.load_w2v_clusters()
tokenize = self.build_tokenizer()
return lambda doc: self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
[...]
#======================
#======================
In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2), stop_words="english", max_df=0.5, min_df=1, max_features=10000, w2v_clusters="clusters.20160322_1803.w2v", binary=True)
In [3]: vec
Out[3]: ---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc in __call__(self, obj)
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()
[...]
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc in _get_param_names(cls)
193 " %s with constructor %s doesn't "
194 " follow this convention."
--> 195 % (cls, init_signature))
196 # Extract and sort argument names excluding 'self'
197 return sorted([p.name for p in parameters])
RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>, *args, **kwargs) doesn't follow this convention.
#======================
The error is clear enough -- I can't use *args and **kwargs in a sklearn estimator's __init__() -- but I'm not sure what the correct way is to do what I need to do. Do I literally need to specify all of the __init__ params in my subclass and then pass them on to the __init__ of super()? If so, what's the reason for setting this up this way?
Thanks for any pointers/guidance,
Fred.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Fred Mailhot

2016-03-23 15:24:35 UTC

Permalink

Thanks very much everyone; seems to be working now!

Hah, and I just wanted to write regarding the VotingClassifier â I
remember my struggle quite well when I tried to to make it pipeline and
GridSearch compatible until I figured that one out :P

Post by Joel Nothman
And I lied that none of the scikit-learn estimators define their own

get_params. Of course the following do: VotingClassifier, Kernel (and
subclasses), Pipeline and FeatureUnion

Post by Joel Nothman
out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep)
out['w2v_clusters'] = self.w2v_clusters
return out
Hi Fred,
We use the __init__ signature to get the list of parameters that (a) can

be set by grid search; (b) need to be copied to a cloned instance of the
estimator (with any fitted model discarded) in constructing ensembles,
cross validation, etc. While none of the scikit-learn library of estimators
do this, in practice you can overload get_params to define your own
parameter listing. See
http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params

Post by Joel Nothman
Hello list,
Firstly, thanks for this incredible package; I use it daily at work. Now

on to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits, I've got

Post by Joel Nothman
#======================
### override __init__ to add w2v_clusters arg
# see

http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass

Post by Joel Nothman
# for explanation of syntax
self.w2v_cluster_path = kwargs.pop("w2v_clusters")
pass
super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
preprocess = self.build_preprocessor()
stopwords = self.get_stop_words()
w2v_clusters = self.load_w2v_clusters()
tokenize = self.build_tokenizer()

self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)

Post by Joel Nothman
[...]
#======================
I can instantiate this, but when I want to inspect it, I get the
#======================
In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),

stop_words="english", max_df=0.5, min_df=1, max_features=10000,
w2v_clusters="clusters.20160322_1803.w2v", binary=True)

Post by Joel Nothman
In [3]: vec

---------------------------------------------------------------------------

Post by Joel Nothman
RuntimeError Traceback (most recent call

last)
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
in __call__(self, obj)

Post by Joel Nothman
697 type_pprinters=self.type_printers,
698 deferred_pprinters=self.deferred_printers)
--> 699 printer.pretty(obj)
700 printer.flush()
701 return stream.getvalue()
[...]

/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
in _get_param_names(cls)

Post by Joel Nothman
193 " %s with constructor %s

doesn't "

Post by Joel Nothman
194 " follow this convention."
--> 195 % (cls, init_signature))
196 # Extract and sort argument names excluding 'self'
197 return sorted([p.name for p in parameters])
RuntimeError: scikit-learn estimators should always specify their

parameters in the signature of their __init__ (no varargs). <class
'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
*args, **kwargs) doesn't follow this convention.

Post by Joel Nothman
#======================
The error is clear enough -- I can't use *args and **kwargs in a sklearn

estimator's __init__() -- but I'm not sure what the correct way is to do
what I need to do. Do I literally need to specify all of the __init__
params in my subclass and then pass them on to the __init__ of super()? If
so, what's the reason for setting this up this way?

Post by Joel Nothman
Thanks for any pointers/guidance,
Fred.

------------------------------------------------------------------------------

Post by Joel Nothman
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Joel Nothman
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.

http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________

Post by Joel Nothman
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general