Thomas Unterthiner
2013-10-03 13:56:51 UTC
Hi there!
the http://scikit-learn.org homepage recommends posting on this mailing
list before making major contributions, so here it goes:
sklearn.preprocessing currently offers both a scale() function and a
StandardScaler transformer, as well as a MinMaxScaler.
I'd like to add a `RobustScaler`, which works just like the
StandardScaler, but uses the median for centering and the interquartile
range for scaling, which are more robust statistics with regard to
outliers. In my personal work I often work with noisy data where such a
robust normalization typically gives better results. (Also, it can be
shown that e.g. the sample median is a better estimate for the
population mean than the sample mean if the data is Laplacian
distributed, so there's that, too). But don't know if there's enough
general interest in this for me to add it.
I have a version of this already working in my private code, but while I
was browsing the sklearn source I noticed that StandardScaler and
MinMaxScaler (and the scale() function) contain a lot of
code-duplication. Thus I'd like to introduce a common base class so that
Standard/MinMax/RobustScaler all share a common code to deal with sparse
matrices and other parameters (width_mean, copy, ...). The the classes
themselves would then only differ in how they estimate the
centering/scaling statistics. This would also get rid of the fact that
e.g. MinMaxScaler destroys sparsity when given sparse input, while
StandardScaler takes extra care not to.
However, the cleanest way to do all this would be to rename some of the
attributes and parameters, which are currently quite inconsistently
named. E.g. MinMaxScaler has an attribute `scale_`, while StandardScaler
uses `std_` to store its scaling statistics. I'd thus propose to
introduce a `BaseScaler` with options `with_centering` and
`with_scaling` and with attributes `center_` and `scale_`, and derive
the other scalers from this. It might also make sense to have another
option `axis` which allows to choose on which axis to scale/normalize
(similar to how the "scale()" function does). Of course, the old
attribute-names would have to be deprecated and be removed a few
releases later.
Additionally, I'd like to add a `robust_scale` function, analog to the
`scale` function. Both of these should internally use the
Robust/StandardScaler classes, as quite now there is a lot of duplicated
code between StandardScaler and scale for no good reason.
So, is there be any interest in these modifications/enhancements?
Cheers
Thomas
the http://scikit-learn.org homepage recommends posting on this mailing
list before making major contributions, so here it goes:
sklearn.preprocessing currently offers both a scale() function and a
StandardScaler transformer, as well as a MinMaxScaler.
I'd like to add a `RobustScaler`, which works just like the
StandardScaler, but uses the median for centering and the interquartile
range for scaling, which are more robust statistics with regard to
outliers. In my personal work I often work with noisy data where such a
robust normalization typically gives better results. (Also, it can be
shown that e.g. the sample median is a better estimate for the
population mean than the sample mean if the data is Laplacian
distributed, so there's that, too). But don't know if there's enough
general interest in this for me to add it.
I have a version of this already working in my private code, but while I
was browsing the sklearn source I noticed that StandardScaler and
MinMaxScaler (and the scale() function) contain a lot of
code-duplication. Thus I'd like to introduce a common base class so that
Standard/MinMax/RobustScaler all share a common code to deal with sparse
matrices and other parameters (width_mean, copy, ...). The the classes
themselves would then only differ in how they estimate the
centering/scaling statistics. This would also get rid of the fact that
e.g. MinMaxScaler destroys sparsity when given sparse input, while
StandardScaler takes extra care not to.
However, the cleanest way to do all this would be to rename some of the
attributes and parameters, which are currently quite inconsistently
named. E.g. MinMaxScaler has an attribute `scale_`, while StandardScaler
uses `std_` to store its scaling statistics. I'd thus propose to
introduce a `BaseScaler` with options `with_centering` and
`with_scaling` and with attributes `center_` and `scale_`, and derive
the other scalers from this. It might also make sense to have another
option `axis` which allows to choose on which axis to scale/normalize
(similar to how the "scale()" function does). Of course, the old
attribute-names would have to be deprecated and be removed a few
releases later.
Additionally, I'd like to add a `robust_scale` function, analog to the
`scale` function. Both of these should internally use the
Robust/StandardScaler classes, as quite now there is a lot of duplicated
code between StandardScaler and scale for no good reason.
So, is there be any interest in these modifications/enhancements?
Cheers
Thomas