[Scikit-learn-general] Pickling custom Transformers in a Pipeline

Discussion:

Fred Mailhot

2016-04-05 20:14:29 UTC

Hi all,

I've got a pipeline with some custom transformers that's not pickling, and
I'm not sure why. I've had this previously when using custom preprocessors
& tokenizers with CountVectorizers. I dealt with it then by defining the
custom bits at the module level.

I assumed I could avoid that by creating custom transformers that directly
subclass TransformerMixin and importing them to the module where the
pipeline is defined.

The transformer is implemented like this:

*==============================*
*[...imports...]*
*from text_preprocess import TextPreprocess*

*class CustomTransformer(TransformerMixin):*

* def __init__(self, param_file_1="params.txt"):*
* self.pattern_file = pattern_file*

* self.custom = TextPreprocess(self.param_file)*

* def transform(self, X, *_):*
* if isinstance(X, basestring):*
* X = [X]*
* return ["%s %s" % (x, " ".join([item["rewrite"] for item in*
* self.custom.match(x)["info"] if "rewrite" in item]))
for x in X]*

* def fit(self, *_):*
* return self*
*==============================*

the full pipeline look like this:

*==============================*
*cm = CustomTransformer()*

*vec = FeatureUnion([("char_ng",*
* CountVectorizer(analyzer="char_wb",
tokenizer=string.split,*
* ngram_range=(3, 5),
max_features=None, min_df=1,*
* max_df=0.5, **stop_words=None,
binary=False)),*
* ("word_ng",*
* CountVectorizer(analyzer="word", ngram_range=(2, 3), *
* max_features=5000, min_df=1,
max_df=0.5,*
* stop_words="english", *
*binary=False))])*

*pipeline = Pipeline([("custom", cm), ("vec", vec),*
* ("lr", LogisticRegressionCV(scoring="f1_macro"))])*
*==============================*

And I get the following error when I fit & dump:

*==============================*
*In [62]: pipeline.fit(docs, [0, 0, 0, 1])*
*Out[62]:*
*Pipeline(steps=[('custom', <cm_transformer.CustomTransformer object at
0x113dd2310>), ('vec', FeatureUnion(n_jobs=1,** transformer_list=[('char_ng',
CountVectorizer(analyzer='char_wb', binary=False, decode_error=u'strict',*
* ...None,*
* refit=True, scoring='f1_macro', solver='lbfgs', tol=0.0001,*
* verbose=0))])*

*In [63]: pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*---------------------------------------------------------------------------*
*PicklingError Traceback (most recent call
last)*
*<ipython-input-63-99a63544716d> in <module>()*
*----> 1 pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*

*PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed*
*==============================*

Any pointers would be appreciated. There are hints here and there on SO,
but most point to the solution I referred to above...

Thanks!
Fred.

Andreas Mueller

2016-04-05 20:25:01 UTC

Permalink

What's the type of self.custom?

Also, you can step into the debugger to see which function it is that
can not be pickled.

Post by Fred Mailhot
Hi all,
I've got a pipeline with some custom transformers that's not pickling,
and I'm not sure why. I've had this previously when using custom
preprocessors & tokenizers with CountVectorizers. I dealt with it then
by defining the custom bits at the module level.
I assumed I could avoid that by creating custom transformers that
directly subclass TransformerMixin and importing them to the module
where the pipeline is defined.
*==============================*
*[...imports...]*
*from text_preprocess import TextPreprocess*
*
*
*class CustomTransformer(TransformerMixin):*
*
*
* def __init__(self, param_file_1="params.txt"):*
*self.pattern_file = pattern_file*
*self.custom = TextPreprocess(self.param_file)
*
*
*
* def transform(self, X, *_):*
* if isinstance(X, basestring):*
*X = [X]*
*return ["%s %s" % (x, " ".join([item["rewrite"] for item in*
* self.custom.match(x)["info"] if "rewrite" in item])) for x in X]*
*
*
* def fit(self, *_):*
*return self*
*==============================*
*
*
*==============================**
*
*cm = CustomTransformer()*
*
*
*vec = FeatureUnion([("char_ng",*
* CountVectorizer(analyzer="char_wb", tokenizer=string.split,*
* ngram_range=(3, 5), max_features=None,
min_df=1,*
* max_df=0.5, **stop_words=None, binary=False)),*
* ("word_ng",*
* CountVectorizer(analyzer="word", ngram_range=(2, 3), *
* max_features=5000, min_df=1, max_df=0.5,*
* stop_words="english", **binary=False))])*
*
*
*pipeline = Pipeline([("custom", cm), ("vec", vec),*
* ("lr", LogisticRegressionCV(scoring="f1_macro"))])*
*==============================*
*
*
*
*
*==============================**
*
*In [62]: pipeline.fit(docs, [0, 0, 0, 1])*
*Out[62]:*
*Pipeline(steps=[('custom', <cm_transformer.CustomTransformer object
at 0x113dd2310>), ('vec',
FeatureUnion(n_jobs=1,** transformer_list=[('char_ng',
CountVectorizer(analyzer='char_wb', binary=False, decode_error=u'strict',*
* ...None,*
* refit=True, scoring='f1_macro', solver='lbfgs', tol=0.0001,*
* verbose=0))])*
*
*
*In [63]: pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*---------------------------------------------------------------------------*
*PicklingError Traceback (most recent call
last)*
*<ipython-input-63-99a63544716d> in <module>()*
*----> 1 pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*
*
*PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed*
*==============================*
*
*
Any pointers would be appreciated. There are hints here and there on
SO, but most point to the solution I referred to above...
Thanks!
Fred.
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Fred Mailhot

2016-04-05 21:21:20 UTC

Permalink

Thanks Andreas; I found the lambda living buried deeply in an imported
class of my custom transformer. As it turns out, the *dill* package appears
to be appeal to pickle lambdas without a hiccup, so I'm going with that for
model persistence.

Thanks again,
FM.

Post by Andreas Mueller
What's the type of self.custom?
Also, you can step into the debugger to see which function it is that can
not be pickled.
Hi all,
I've got a pipeline with some custom transformers that's not pickling, and
I'm not sure why. I've had this previously when using custom preprocessors
& tokenizers with CountVectorizers. I dealt with it then by defining the
custom bits at the module level.
I assumed I could avoid that by creating custom transformers that directly
subclass TransformerMixin and importing them to the module where the
pipeline is defined.
*==============================*
*[...imports...]*
*from text_preprocess import TextPreprocess*
*class CustomTransformer(TransformerMixin):*
* def __init__(self, param_file_1="params.txt"):*
* self.pattern_file = pattern_file*
* self.custom = TextPreprocess(self.param_file) *
* def transform(self, X, *_):*
* if isinstance(X, basestring):*
* X = [X]*
* return ["%s %s" % (x, " ".join([item["rewrite"] for item in*
* self.custom.match(x)["info"] if "rewrite" in item]))
for x in X]*
* def fit(self, *_):*
* return self*
*==============================*
*==============================*
*cm = CustomTransformer()*
*vec = FeatureUnion([("char_ng",*
* CountVectorizer(analyzer="char_wb",
tokenizer=string.split,*
* ngram_range=(3, 5),
max_features=None, min_df=1,*
* max_df=0.5, **stop_words=None,
binary=False)),*
* ("word_ng",*
* CountVectorizer(analyzer="word", ngram_range=(2, 3), *
* max_features=5000, min_df=1,
max_df=0.5,*
* stop_words="english", *
*binary=False))])*
*pipeline = Pipeline([("custom", cm), ("vec", vec),*
* ("lr", LogisticRegressionCV(scoring="f1_macro"))])*
*==============================*
*==============================*
*In [62]: pipeline.fit(docs, [0, 0, 0, 1])*
*Out[62]:*
*Pipeline(steps=[('custom', <cm_transformer.CustomTransformer object at
0x113dd2310>), ('vec', FeatureUnion(n_jobs=1,** transformer_list=[('char_ng',
CountVectorizer(analyzer='char_wb', binary=False, decode_error=u'strict',*
* ...None,*
* refit=True, scoring='f1_macro', solver='lbfgs', tol=0.0001,*
* verbose=0))])*
*In [63]: pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*---------------------------------------------------------------------------*
*PicklingError Traceback (most recent call
last)*
*<ipython-input-63-99a63544716d> in <module>()*
*----> 1 pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed*
*==============================*
Any pointers would be appreciated. There are hints here and there on SO,
but most point to the solution I referred to above...
Thanks!
Fred.
------------------------------------------------------------------------------
_______________________________________________
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general