[Scikit-learn-general] sklearn.preprocessing: robust scaling and general refactoring of scaling functionality

Discussion:

Thomas Unterthiner

2013-10-03 13:56:51 UTC

Hi there!

the http://scikit-learn.org homepage recommends posting on this mailing
list before making major contributions, so here it goes:

sklearn.preprocessing currently offers both a scale() function and a
StandardScaler transformer, as well as a MinMaxScaler.

I'd like to add a `RobustScaler`, which works just like the
StandardScaler, but uses the median for centering and the interquartile
range for scaling, which are more robust statistics with regard to
outliers. In my personal work I often work with noisy data where such a
robust normalization typically gives better results. (Also, it can be
shown that e.g. the sample median is a better estimate for the
population mean than the sample mean if the data is Laplacian
distributed, so there's that, too). But don't know if there's enough
general interest in this for me to add it.

I have a version of this already working in my private code, but while I
was browsing the sklearn source I noticed that StandardScaler and
MinMaxScaler (and the scale() function) contain a lot of
code-duplication. Thus I'd like to introduce a common base class so that
Standard/MinMax/RobustScaler all share a common code to deal with sparse
matrices and other parameters (width_mean, copy, ...). The the classes
themselves would then only differ in how they estimate the
centering/scaling statistics. This would also get rid of the fact that
e.g. MinMaxScaler destroys sparsity when given sparse input, while
StandardScaler takes extra care not to.

However, the cleanest way to do all this would be to rename some of the
attributes and parameters, which are currently quite inconsistently
named. E.g. MinMaxScaler has an attribute `scale_`, while StandardScaler
uses `std_` to store its scaling statistics. I'd thus propose to
introduce a `BaseScaler` with options `with_centering` and
`with_scaling` and with attributes `center_` and `scale_`, and derive
the other scalers from this. It might also make sense to have another
option `axis` which allows to choose on which axis to scale/normalize
(similar to how the "scale()" function does). Of course, the old
attribute-names would have to be deprecated and be removed a few
releases later.

Additionally, I'd like to add a `robust_scale` function, analog to the
`scale` function. Both of these should internally use the
Robust/StandardScaler classes, as quite now there is a lot of duplicated
code between StandardScaler and scale for no good reason.

So, is there be any interest in these modifications/enhancements?

Cheers

Thomas

Olivier Grisel

2013-10-03 14:06:32 UTC

Permalink

Sounds good. Please also add a minmax_scale function while you are at
it. I often miss that one too when doing interactive data exploration
in IPython.

To handle the parameter renamings please follow the standard
deprecation scheme for the public API ("git grep '@deprecated'" to
find examples in the current code base).

Stuff deprecated in the 0.15 release should be marked for removal in
the 0.17 release. Here are more details:

http://scikit-learn.org/stable/developers/index.html#deprecation

--
Olivier

Juan Nunez-Iglesias

2013-10-06 07:21:43 UTC

Permalink

@Olivier, you just blew my mind, as I did not know about git grep! =D

Post by Olivier Grisel
Sounds good. Please also add a minmax_scale function while you are at
it. I often miss that one too when doing interactive data exploration
in IPython.
To handle the parameter renamings please follow the standard
find examples in the current code base).
Stuff deprecated in the 0.15 release should be marked for removal in
http://scikit-learn.org/stable/developers/index.html#deprecation
--
Olivier
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general