[Scikit-learn-general] Pipeline: string categorical data preprocessing

Discussion:

Алексей Драль

2016-03-19 18:17:54 UTC

Hi there,

I have a data set which contains string categorical variables (like
"category_A", "category_B"). I would like to generate dummy variables from
them, but I can't use OneHotEncoder as it expects matrix of integers. I
cannot use LabelEncoder neither, because I cannot provide columns to
process. I wrote a simple class to do so that
applies DictionaryVectorizer per column and stores fitted processors. This
use case looks so common, that I expect that sklearn should contain some
functionality to do so. Could you please assist me if I miss any
standard preprocessor to generate dummy variables from strings for
specified columns?

--
Yours sincerely,
Alexey A. Dral

Andreas Mueller

2016-03-25 15:57:51 UTC

Permalink

This is very common but currently not that easy.
There is a fix here:
https://github.com/scikit-learn/scikit-learn/pull/6559

In the meantime, I think the easiest way is to use pandas' get_dummies
function.

Post by ÐÐ»ÐµÐºÑÐµÐ¹ ÐÑÐ°Ð»Ñ
Hi there,
I have a data set which contains string categorical variables (like
"category_A", "category_B"). I would like to generate dummy variables from
them, but I can't use OneHotEncoder as it expects matrix of integers. I
cannot use LabelEncoder neither, because I cannot provide columns to
process. I wrote a simple class to do so that
applies DictionaryVectorizer per column and stores fitted processors. This
use case looks so common, that I expect that sklearn should contain some
functionality to do so. Could you please assist me if I miss any
standard preprocessor to generate dummy variables from strings for
specified columns?
--
Yours sincerely,
Alexey A. Dral
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2016-03-28 19:32:39 UTC

Permalink

Hi.
In general, please stay on the mailing list.
We could make the check_array in FunctionTransformer optional via a
parameter.

Cheers,
Andy

Hi Andreas,
Nice, I didn't know about make_pipeline before, thank you. I have
exactly the situation that you pointed out "categories are strings
that can frequently don't show up only in test split". I'll take this
approach in mind for the next time.
P.S. testing revealed usage of check_array in FunctionTransformer,
which can lead to problems when dtype objects are strings.
P.P.S. at first, I was wondering if it would be valuable to make a
pull request, but CategoricalEncoder should fix the problem.
make_pipeline(FunctionTransformer(lambda X: pd.get_dummies(X)),
SomeClassifier())
giant caveat: that will only work if the categories are exactly
the same in all possible X that you pass.
Otherwise weird stuff will happen.

Hi Andreas,
Sadly enough, get_dummies is not applicable in pipelines. Thank
you for a link with a fix.
This is very common but currently not that easy.
https://github.com/scikit-learn/scikit-learn/pull/6559
In the meantime, I think the easiest way is to use pandas'
get_dummies function.

Post by ÐÐ»ÐµÐºÑÐµÐ¹ ÐÑÐ°Ð»Ñ
Hi there,
I have a data set which contains string categorical
variables (like
"category_A", "category_B"). I would like to generate dummy
variables from
them, but I can't use OneHotEncoder as it expects matrix of integers. I
cannot use LabelEncoder neither, because I cannot provide columns to
process. I wrote a simple class to do so that
applies DictionaryVectorizer per column and stores fitted
processors. This
use case looks so common, that I expect that sklearn should
contain some
functionality to do so. Could you please assist me if I miss any
standard preprocessor to generate dummy variables from strings for
specified columns?
--
Yours sincerely,
Alexey A. Dral
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Yours sincerely,
Alexey A. Dral