Discussion:
SVM - Scaling data or not?
(too old to reply)
Gianni Iannelli
2013-05-25 15:56:46 UTC
Permalink
Dear All,
I'm using the Scipy SVM tool (the one derived from LibSVM I think). I have training set and dataset. The training set is took within the dataset. The training set is around the 10% of the dataset.
Before train my SVM is suggested to scale the data in order to get zero mean and unit variance.
There are two options: - scale the training set, train the SVM, scale the whole dataset, classify the dataset; - scale the whole dataset, take from it the training set, train the SVM, classify the dataset.
The second seems to me more logic than the first but happens that I get extremely better result using the first option than the second one!!!
Is it this normal ?? Probably is a dummy question but I have not too much experience with that!
To scale the data I use sklearn.preprocessing.scale(MyData).
Any suggestions, test that I could do, is really really welcome!
Thanks,Solimyr
Gael Varoquaux
2013-05-25 15:59:07 UTC
Permalink
On Sat, May 25, 2013 at 05:56:46PM +0200, Gianni Iannelli wrote:
> There are two options:
> - scale the training set, train the SVM, scale the whole dataset, classify the
> dataset;
> - scale the whole dataset, take from it the training set, train the SVM,
> classify the dataset.

> The second seems to me more logic than the first but happens that I get
> extremely better result using the first option than the second one!!!

Think of it: if you were in a real predictive situation, you would not
have the test set, so you would not be able to do the second option. For
this reason the second option is usually not considered as Kosher.

Gaël
Lars Buitinck
2013-05-25 16:00:48 UTC
Permalink
2013/5/25 Gianni Iannelli <***@msn.com>:
> Before train my SVM is suggested to scale the data in order to get zero mean
> and unit variance.
>
> There are two options:
> - scale the training set, train the SVM, scale the whole dataset, classify
> the dataset;
> - scale the whole dataset, take from it the training set, train the SVM,
> classify the dataset.

You should always scale both, using the same mean and variance:

scaler = Scaler()
training_set = scaler.fit(training_set)
test_set = scaler.transform(test_set)

If you only scale the training set, then the test set may have
completely different ranges for its features, i.e. it doesn't live in
the same part of the feature set as the training set.

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Gianni Iannelli
2013-05-25 16:32:42 UTC
Permalink
what I usually do is scale the training set and the dataset separately, but I'm doing in this way:
X_train = preprocessing.scale(X_train_noscaled)X_class = preprocessing.scale(X_class_noscaled)
Looking at your suggestion seems that I've to do like this:
scaler = scaler()X_traing = scaler(X_train_noscaled)X_class = scaler.transform(test_set)
Sorry, I have two questions: - scaler is different from scale? I'm askying because I've seen different from scaler in the scikit package; - your suggestion is different from what I do (preprocessing.scale for the two dataset) ?
Thank You Very Much!Solimyr
> From: ***@uva.nl
> Date: Sat, 25 May 2013 18:00:48 +0200
> To: scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] SVM - Scaling data or not?
>
> 2013/5/25 Gianni Iannelli <***@msn.com>:
> > Before train my SVM is suggested to scale the data in order to get zero mean
> > and unit variance.
> >
> > There are two options:
> > - scale the training set, train the SVM, scale the whole dataset, classify
> > the dataset;
> > - scale the whole dataset, take from it the training set, train the SVM,
> > classify the dataset.
>
> You should always scale both, using the same mean and variance:
>
> scaler = Scaler()
> training_set = scaler.fit(training_set)
> test_set = scaler.transform(test_set)
>
> If you only scale the training set, then the test set may have
> completely different ranges for its features, i.e. it doesn't live in
> the same part of the feature set as the training set.
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
a***@ais.uni-bonn.de
2013-05-25 17:11:06 UTC
Permalink
you should not use your method in any common setting. the difference when using scaler is that it will remember mean and variance of the Training set and reuse that for the test set.



Gianni Iannelli <***@msn.com> schrieb:

>what I usually do is scale the training set and the dataset separately,
>but I'm doing in this way:
>X_train = preprocessing.scale(X_train_noscaled)X_class =
>preprocessing.scale(X_class_noscaled)
>Looking at your suggestion seems that I've to do like this:
>scaler = scaler()X_traing = scaler(X_train_noscaled)X_class =
>scaler.transform(test_set)
>Sorry, I have two questions: - scaler is different from scale? I'm
>askying because I've seen different from scaler in the scikit package;
>- your suggestion is different from what I do (preprocessing.scale for
>the two dataset) ?
>Thank You Very Much!Solimyr
>> From: ***@uva.nl
>> Date: Sat, 25 May 2013 18:00:48 +0200
>> To: scikit-learn-***@lists.sourceforge.net
>> Subject: Re: [Scikit-learn-general] SVM - Scaling data or not?
>>
>> 2013/5/25 Gianni Iannelli <***@msn.com>:
>> > Before train my SVM is suggested to scale the data in order to get
>zero mean
>> > and unit variance.
>> >
>> > There are two options:
>> > - scale the training set, train the SVM, scale the whole dataset,
>classify
>> > the dataset;
>> > - scale the whole dataset, take from it the training set, train
>the SVM,
>> > classify the dataset.
>>
>> You should always scale both, using the same mean and variance:
>>
>> scaler = Scaler()
>> training_set = scaler.fit(training_set)
>> test_set = scaler.transform(test_set)
>>
>> If you only scale the training set, then the test set may have
>> completely different ranges for its features, i.e. it doesn't live in
>> the same part of the feature set as the training set.
>>
>> --
>> Lars Buitinck
>> Scientific programmer, ILPS
>> University of Amsterdam
>>
>>
>------------------------------------------------------------------------------
>> Try New Relic Now & We'll Send You this Cool Shirt
>> New Relic is the only SaaS-based application performance monitoring
>service
>> that delivers powerful full stack analytics. Optimize and monitor
>your
>> browser, app, & servers with just a few lines of code. Try New Relic
>> and get this awesome Nerd Life shirt!
>http://p.sf.net/sfu/newrelic_d2d_may
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-***@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>------------------------------------------------------------------------
>
>------------------------------------------------------------------------------
>Try New Relic Now & We'll Send You this Cool Shirt
>New Relic is the only SaaS-based application performance monitoring
>service
>that delivers powerful full stack analytics. Optimize and monitor your
>browser, app, & servers with just a few lines of code. Try New Relic
>and get this awesome Nerd Life shirt!
>http://p.sf.net/sfu/newrelic_d2d_may
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Scikit-learn-general mailing list
>Scikit-learn-***@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
Gianni Iannelli
2013-05-27 13:42:48 UTC
Permalink
Thanks for the answers!
I was trying the scaler() but Python say to me that doesn't exist. Could you please tell me where I wrong? And I could not find also the transform method.
Thanks in advance for your answer! Solimyr

From: ***@ais.uni-bonn.de
Date: Sat, 25 May 2013 19:11:06 +0200
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] SVM - Scaling data or not?

you should not use your method in any common setting. the difference when using scaler is that it will remember mean and variance of the Training set and reuse that for the test set.





Gianni Iannelli <***@msn.com> schrieb:



what I usually do is scale the training set and the dataset separately, but I'm doing in this way:
X_train = preprocessing.scale(X_train_noscaled)X_class = preprocessing.scale(X_class_noscaled)
Looking at your suggestion seems that I've to do like this:
scaler = scaler()X_traing = scaler(X_train_noscaled)X_class = scaler.transform(test_set)
Sorry, I have two questions: - scaler is different from scale? I'm askying because I've seen different from scaler in the scikit package; - your suggestion is different from what I do (preprocessing.scale for the two dataset) ?
Thank You Very Much!Solimyr
> From: ***@uva.nl
> Date: Sat, 25 May 2013 18:00:48 +0200
> To:
scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] SVM - Scaling data or not?
>
> 2013/5/25 Gianni Iannelli <***@msn.com>:
> > Before train my SVM is suggested to scale the data in order to get zero mean
> > and unit variance.
> >
> > There are two options:
> > - scale the training set, train the SVM, scale the whole dataset, classify
> > the dataset;
> > - scale the whole dataset, take from it the training set, train the SVM,
> > classify the dataset.
>
> You should always scale both, using the same mean and variance:
>
> scaler = Scaler()
> training_set = scaler.fit(training_set)
> test_set = scaler.transform(test_set)
>
> If you only scale the training set, then the test set may have
> completely different ranges for its features, i.e. it doesn't
live in
> the same part of the feature set as the training set.
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
Lars Buitinck
2013-05-27 13:44:23 UTC
Permalink
2013/5/27 Gianni Iannelli <***@msn.com>:
> I was trying the scaler() but Python say to me that doesn't exist. Could
> you please tell me where I wrong? And I could not find also the transform
> method.

from sklearn.preprocessing import Scaler

(with a capital S)

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Gianni Iannelli
2013-05-27 14:18:13 UTC
Permalink
Found it! But now it has a different name: StandardScaler.
The documentation is here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
Furthermore, I was reading the documentation:
Centering and scaling happen indepently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.
Could you please exlpain to me what its the point to store Mean and Standard Deviation? It's not so clear to me. And how the transform is made? Sorry for my lower knowledge level about this stuffs...I wanna be sure to understand everything.
Thanks All,Solimyr

> From: ***@uva.nl
> Date: Mon, 27 May 2013 15:44:23 +0200
> To: scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] SVM - Scaling data or not?
>
> 2013/5/27 Gianni Iannelli <***@msn.com>:
> > I was trying the scaler() but Python say to me that doesn't exist. Could
> > you please tell me where I wrong? And I could not find also the transform
> > method.
>
> from sklearn.preprocessing import Scaler
>
> (with a capital S)
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Lars Buitinck
2013-05-27 14:38:12 UTC
Permalink
2013/5/27 Gianni Iannelli <***@msn.com>:
> Found it! But now it has a different name: StandardScaler.

Ah, yes, excuse me.

> Could you please exlpain to me what its the point to store Mean and Standard
> Deviation? It's not so clear to me. And how the transform is made? Sorry for
> my lower knowledge level about this stuffs...I wanna be sure to understand
> everything.

The transform method centers and scales using the mean and stddev that
it has learned from the training set. This makes sure, as I explained
previously, that the test set is mapped to the same region of feature
space where the training set lives, and where the classifier has
learned its decision boundary.

Suppose you'd want to apply a classifier, trained on a scaled training
set, to a single sample. If you don't center and scale, it may live in
the wrong region of space wrt. the decision boundary, which will be
somewhere near the origin. If you'd center and scale the single sample
using its own mean and stddev, it would always end up at the origin
because the mean of one point is the point itself, and no meaningful
classification can be performed.

--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Gianni Iannelli
2013-05-27 15:07:22 UTC
Permalink
Thank you! It's clear! Please, tell me if I understood correctly (or I'm completely stupid):
it took the training set and calculate the mean and the standard deviation for each feature. To calculate it just substract the mean and divide by the std (saw in the posted link on stackoverflow);transform the training set using these values; Train my SVM;take the test set and apply the transformation without calculating again the mean and the std but just using the already calculated one;classify each point.
Doing as I was doing (preprocessing.scale() for the trainingset and the testset) the difference is in the fourth point. It calculates again the mean and the std for the all test set and it applies the trasformation. In this case, looking at the mean, the two regions (training and test) could be shifted and consequently, the classification could be wrong. I wrote could because I have tried with three different dataset and with the second method (the one proposed by you) get better result two times and worste one time respect to the one that I was using.
In conclusion, the method that I was using must be avoid because theoretically it's wrong.
Is this correct?
Thanks to all and thanks for your time!Solimyr
> From: ***@uva.nl
> Date: Mon, 27 May 2013 16:38:12 +0200
> To: scikit-learn-***@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] SVM - Scaling data or not?
>
> 2013/5/27 Gianni Iannelli <***@msn.com>:
> > Found it! But now it has a different name: StandardScaler.
>
> Ah, yes, excuse me.
>
> > Could you please exlpain to me what its the point to store Mean and Standard
> > Deviation? It's not so clear to me. And how the transform is made? Sorry for
> > my lower knowledge level about this stuffs...I wanna be sure to understand
> > everything.
>
> The transform method centers and scales using the mean and stddev that
> it has learned from the training set. This makes sure, as I explained
> previously, that the test set is mapped to the same region of feature
> space where the training set lives, and where the classifier has
> learned its decision boundary.
>
> Suppose you'd want to apply a classifier, trained on a scaled training
> set, to a single sample. If you don't center and scale, it may live in
> the wrong region of space wrt. the decision boundary, which will be
> somewhere near the origin. If you'd center and scale the single sample
> using its own mean and stddev, it would always end up at the origin
> because the mean of one point is the point itself, and no meaningful
> classification can be performed.
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2013-05-27 23:16:22 UTC
Permalink
Sounds like you got it to me, but perhaps "because theoretically it's
wrong" needs another moment's explanation: your estimator is fitted to the
feature values as they are adjusted in training, so it's inappropriate to
adjust them differently at test time. It's also inappropriate because in
the real world you often don't have test sets, but lone test samples, so
finding the mean of all test samples before processing any is not very
helpful.


On Tue, May 28, 2013 at 1:07 AM, Gianni Iannelli <***@msn.com>wrote:

> Thank you! It's clear! Please, tell me if I understood correctly (or I'm
> completely stupid):
>
>
> 1. it took the training set and calculate the mean and the standard
> deviation for each feature. To calculate it just substract the mean and
> divide by the std (saw in the posted link on stackoverflow);
> 2. transform the training set using these values;
> 3. Train my SVM;
> 4. take the test set and apply the transformation *without calculating
> again* the mean and the std but just using the already calculated one;
> 5. classify each point.
>
>
> Doing as I was doing (preprocessing.scale() for the trainingset and the
> testset) the difference is in the fourth point. It calculates again the
> mean and the std for the all test set and it applies the trasformation. In
> this case, looking at the mean, the two regions (training and test) could
> be shifted and consequently, the classification could be wrong. I wrote
> could because I have tried with three different dataset and with the second
> method (the one proposed by you) get better result two times and worste one
> time respect to the one that I was using.
>
> In conclusion, the method that I was using must be avoid because
> theoretically it's wrong.
>
> Is this correct?
>
> Thanks to all and thanks for your time!
> Solimyr
>
> > From: ***@uva.nl
> > Date: Mon, 27 May 2013 16:38:12 +0200
>
> > To: scikit-learn-***@lists.sourceforge.net
> > Subject: Re: [Scikit-learn-general] SVM - Scaling data or not?
> >
> > 2013/5/27 Gianni Iannelli <***@msn.com>:
> > > Found it! But now it has a different name: StandardScaler.
> >
> > Ah, yes, excuse me.
> >
> > > Could you please exlpain to me what its the point to store Mean and
> Standard
> > > Deviation? It's not so clear to me. And how the transform is made?
> Sorry for
> > > my lower knowledge level about this stuffs...I wanna be sure to
> understand
> > > everything.
> >
> > The transform method centers and scales using the mean and stddev that
> > it has learned from the training set. This makes sure, as I explained
> > previously, that the test set is mapped to the same region of feature
> > space where the training set lives, and where the classifier has
> > learned its decision boundary.
> >
> > Suppose you'd want to apply a classifier, trained on a scaled training
> > set, to a single sample. If you don't center and scale, it may live in
> > the wrong region of space wrt. the decision boundary, which will be
> > somewhere near the origin. If you'd center and scale the single sample
> > using its own mean and stddev, it would always end up at the origin
> > because the mean of one point is the point itself, and no meaningful
> > classification can be performed.
> >
> > --
> > Lars Buitinck
> > Scientific programmer, ILPS
> > University of Amsterdam
> >
> >
> ------------------------------------------------------------------------------
> > Try New Relic Now & We'll Send You this Cool Shirt
> > New Relic is the only SaaS-based application performance monitoring
> service
> > that delivers powerful full stack analytics. Optimize and monitor your
> > browser, app, & servers with just a few lines of code. Try New Relic
> > and get this awesome Nerd Life shirt!
> http://p.sf.net/sfu/newrelic_d2d_may
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-***@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Joel Nothman
2013-05-27 14:39:11 UTC
Permalink
A related question was recently asked recently on Stack Overflow. Does this
help?
http://stackoverflow.com/questions/16137816/scikit-learn-preprocessing-svm-with-multiple-classes-in-a-pipeline


On Tue, May 28, 2013 at 12:18 AM, Gianni Iannelli <***@msn.com>wrote:

> Found it! But now it has a different name: StandardScaler.
>
> The documentation is here:
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
>
> Furthermore, I was reading the documentation:
>
> Centering and scaling happen indepently on each feature by computing the
> relevant statistics on the samples in the training set. Mean and standard
> deviation are then stored to be used on later data using the transform
> method.
>
>
> Could you please exlpain to me what its the point to store Mean and
> Standard Deviation? It's not so clear to me. And how the transform is made?
> Sorry for my lower knowledge level about this stuffs...I wanna be sure to
> understand everything.
>
> Thanks All,
> Solimyr
>
>
>
>
> > From: ***@uva.nl
> > Date: Mon, 27 May 2013 15:44:23 +0200
>
> > To: scikit-learn-***@lists.sourceforge.net
> > Subject: Re: [Scikit-learn-general] SVM - Scaling data or not?
> >
> > 2013/5/27 Gianni Iannelli <***@msn.com>:
> > > I was trying the scaler() but Python say to me that doesn't exist.
> Could
> > > you please tell me where I wrong? And I could not find also the
> transform
> > > method.
> >
> > from sklearn.preprocessing import Scaler
> >
> > (with a capital S)
> >
> > --
> > Lars Buitinck
> > Scientific programmer, ILPS
> > University of Amsterdam
> >
> >
> ------------------------------------------------------------------------------
> > Try New Relic Now & We'll Send You this Cool Shirt
> > New Relic is the only SaaS-based application performance monitoring
> service
> > that delivers powerful full stack analytics. Optimize and monitor your
> > browser, app, & servers with just a few lines of code. Try New Relic
> > and get this awesome Nerd Life shirt!
> http://p.sf.net/sfu/newrelic_d2d_may
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-***@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-***@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
Continue reading on narkive:
Loading...