Fred Mailhot
2016-04-05 20:14:29 UTC
Hi all,
I've got a pipeline with some custom transformers that's not pickling, and
I'm not sure why. I've had this previously when using custom preprocessors
& tokenizers with CountVectorizers. I dealt with it then by defining the
custom bits at the module level.
I assumed I could avoid that by creating custom transformers that directly
subclass TransformerMixin and importing them to the module where the
pipeline is defined.
The transformer is implemented like this:
*==============================*
*[...imports...]*
*from text_preprocess import TextPreprocess*
*class CustomTransformer(TransformerMixin):*
* def __init__(self, param_file_1="params.txt"):*
* self.pattern_file = pattern_file*
* self.custom = TextPreprocess(self.param_file)*
* def transform(self, X, *_):*
* if isinstance(X, basestring):*
* X = [X]*
* return ["%s %s" % (x, " ".join([item["rewrite"] for item in*
* self.custom.match(x)["info"] if "rewrite" in item]))
for x in X]*
* def fit(self, *_):*
* return self*
*==============================*
the full pipeline look like this:
*==============================*
*cm = CustomTransformer()*
*vec = FeatureUnion([("char_ng",*
* CountVectorizer(analyzer="char_wb",
tokenizer=string.split,*
* ngram_range=(3, 5),
max_features=None, min_df=1,*
* max_df=0.5, **stop_words=None,
binary=False)),*
* ("word_ng",*
* CountVectorizer(analyzer="word", ngram_range=(2, 3), *
* max_features=5000, min_df=1,
max_df=0.5,*
* stop_words="english", *
*binary=False))])*
*pipeline = Pipeline([("custom", cm), ("vec", vec),*
* ("lr", LogisticRegressionCV(scoring="f1_macro"))])*
*==============================*
And I get the following error when I fit & dump:
*==============================*
*In [62]: pipeline.fit(docs, [0, 0, 0, 1])*
*Out[62]:*
*Pipeline(steps=[('custom', <cm_transformer.CustomTransformer object at
0x113dd2310>), ('vec', FeatureUnion(n_jobs=1,** transformer_list=[('char_ng',
CountVectorizer(analyzer='char_wb', binary=False, decode_error=u'strict',*
* ...None,*
* refit=True, scoring='f1_macro', solver='lbfgs', tol=0.0001,*
* verbose=0))])*
*In [63]: pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*---------------------------------------------------------------------------*
*PicklingError Traceback (most recent call
last)*
*<ipython-input-63-99a63544716d> in <module>()*
*----> 1 pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed*
*==============================*
Any pointers would be appreciated. There are hints here and there on SO,
but most point to the solution I referred to above...
Thanks!
Fred.
I've got a pipeline with some custom transformers that's not pickling, and
I'm not sure why. I've had this previously when using custom preprocessors
& tokenizers with CountVectorizers. I dealt with it then by defining the
custom bits at the module level.
I assumed I could avoid that by creating custom transformers that directly
subclass TransformerMixin and importing them to the module where the
pipeline is defined.
The transformer is implemented like this:
*==============================*
*[...imports...]*
*from text_preprocess import TextPreprocess*
*class CustomTransformer(TransformerMixin):*
* def __init__(self, param_file_1="params.txt"):*
* self.pattern_file = pattern_file*
* self.custom = TextPreprocess(self.param_file)*
* def transform(self, X, *_):*
* if isinstance(X, basestring):*
* X = [X]*
* return ["%s %s" % (x, " ".join([item["rewrite"] for item in*
* self.custom.match(x)["info"] if "rewrite" in item]))
for x in X]*
* def fit(self, *_):*
* return self*
*==============================*
the full pipeline look like this:
*==============================*
*cm = CustomTransformer()*
*vec = FeatureUnion([("char_ng",*
* CountVectorizer(analyzer="char_wb",
tokenizer=string.split,*
* ngram_range=(3, 5),
max_features=None, min_df=1,*
* max_df=0.5, **stop_words=None,
binary=False)),*
* ("word_ng",*
* CountVectorizer(analyzer="word", ngram_range=(2, 3), *
* max_features=5000, min_df=1,
max_df=0.5,*
* stop_words="english", *
*binary=False))])*
*pipeline = Pipeline([("custom", cm), ("vec", vec),*
* ("lr", LogisticRegressionCV(scoring="f1_macro"))])*
*==============================*
And I get the following error when I fit & dump:
*==============================*
*In [62]: pipeline.fit(docs, [0, 0, 0, 1])*
*Out[62]:*
*Pipeline(steps=[('custom', <cm_transformer.CustomTransformer object at
0x113dd2310>), ('vec', FeatureUnion(n_jobs=1,** transformer_list=[('char_ng',
CountVectorizer(analyzer='char_wb', binary=False, decode_error=u'strict',*
* ...None,*
* refit=True, scoring='f1_macro', solver='lbfgs', tol=0.0001,*
* verbose=0))])*
*In [63]: pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*---------------------------------------------------------------------------*
*PicklingError Traceback (most recent call
last)*
*<ipython-input-63-99a63544716d> in <module>()*
*----> 1 pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed*
*==============================*
Any pointers would be appreciated. There are hints here and there on SO,
but most point to the solution I referred to above...
Thanks!
Fred.