Discussion:
centering of sparse data for elastic net
(too old to reply)
James Jensen
2013-10-11 21:13:41 UTC
Permalink
I've been applying preprocessing.scale() to my data prior to using
scikit-learn's elastic net, with the understanding that elastic net will
not work correctly if the features do not each have zero mean and unit
variance. scale() both centers and normalizes the data. ElasticNet has
an option to normalize the input data but does not mention centering.
Olivier Grisel
2013-10-12 09:19:14 UTC
Permalink
By checking the implementation this is apparently what is done in the
cython implementation of the sparse coordinate descent used for
Elastic Net:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx#L227

the X_mean array is the mean of the training data precomputed by the
ElasticNet._pre_fit method. It is used internally to center the data
on the fly in the CD loop (instead of centering the data ahead of time
as done in the dense array case).

You can control the centering with `normalize=True` flag of the
ElasticNet class (or any other linear regression model).
--
Olivier
James Jensen
2013-10-14 15:13:30 UTC
Permalink
Thank you, Olivier.

Just to clarify: you say

You can control the centering with `normalize=True` flag of the
ElasticNet class (or any other linear regression model).

I've noticed people use the term "normalize" in different ways. In the
case of the `normalize=True` flag of the linear models, does it mean
both scaling samples to have unit norm and centering them to have mean
zero? If so, this is inconsistent with the usage in, say, the
preprocessing module, where "normalization" refers only to scaling to
unit norm, and the word "standardization" is used to refer to doing both
(although the function to standardize is scale(), and "scale" seems more
naturally associated with normalization, in my mind). Because of this, I
had supposed that the `normalize=True` flag did not determine centering.
Lars Buitinck
2013-10-14 15:40:33 UTC
Permalink
I've noticed people use the term "normalize" in different ways. In the case
of the `normalize=True` flag of the linear models, does it mean both scaling
samples to have unit norm and centering them to have mean zero? If so, this
is inconsistent with the usage in, say, the preprocessing module, where
"normalization" refers only to scaling to unit norm, and the word
"standardization" is used to refer to doing both (although the function to
standardize is scale(), and "scale" seems more naturally associated with
normalization, in my mind). Because of this, I had supposed that the
`normalize=True` flag did not determine centering.
Yes, this is inconsistent with the preprocessing module. "normalize"
in linear_models is what preprocessing calls "standard scaling".
James Jensen
2013-10-14 15:34:17 UTC
Permalink
Thanks, Alex. That is helpful. Looks like the glmnet documentation says
that this is how they do it as well. What they don't explain is how to
find alpha_max in the first place. The only thing I've thought of is
doing something like a binary search until you find the smallest alpha
yielding the coef_ of zeros, with some limit on how many steps you do it
in. But is there a better way?

Also, how do you choose the smallest alpha value (or in other words, how
do you choose eps)? I came across an unofficial third-party description
of glmnet that said that if nobs < nvars, a higher value is chosen
(0.01, I think), whereas if nobs > nvars, a smaller value is chosen
(say, 0.0001). The basic idea makes sense, but it seems a bit ad hoc to
me, and it seems like it would be sensible to have more than two
possible values, based on the ratio of nobs to nvars. Any thoughts?
hi James,
for a given value of l1_ratio, the grid of alphas is chosen in log scale
starting from alpha_max to alpha_max / 10**eps. Any value of alpha
larger than alpha_max will lead to a coef_ full of zeros.
HTH
Alex
Nicholas Dronen
2013-10-14 16:00:17 UTC
Permalink
Hi, James:

If by 'alpha' you mean what the lasso literature refers to as 'lambda', my
recollection is that the maximum lambda is determined simply by the L1 norm
of the coefficients of the ordinary least squares solution, because any
value greater than that provides no constraint for the lasso solution.
This was mentioned in a talk at ICML this year:


http://techtalks.tv/talks/the-lasso-persistence-and-cross-validation/58279/

Regards,

Nick
Post by James Jensen
Thanks, Alex. That is helpful. Looks like the glmnet documentation says
that this is how they do it as well. What they don't explain is how to
find alpha_max in the first place. The only thing I've thought of is
doing something like a binary search until you find the smallest alpha
yielding the coef_ of zeros, with some limit on how many steps you do it
in. But is there a better way?
Also, how do you choose the smallest alpha value (or in other words, how
do you choose eps)? I came across an unofficial third-party description
of glmnet that said that if nobs < nvars, a higher value is chosen
(0.01, I think), whereas if nobs > nvars, a smaller value is chosen
(say, 0.0001). The basic idea makes sense, but it seems a bit ad hoc to
me, and it seems like it would be sensible to have more than two
possible values, based on the ratio of nobs to nvars. Any thoughts?
hi James,
for a given value of l1_ratio, the grid of alphas is chosen in log scale
starting from alpha_max to alpha_max / 10**eps. Any value of alpha
larger than alpha_max will lead to a coef_ full of zeros.
HTH
Alex
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
John Collins
2013-10-14 17:04:35 UTC
Permalink
Hi James,

In R speak:

The reason you see the advice to choose a higher alpha if nobs < nvars and
a lower alpha if the comparison is that alpha is the mixing weight between
L1 and L2 penalties (whereas lambda is the regularization level) and
because the L1 penalty tends to set more coefficients to zero than L2.
Therefore if nvars >> nobs this seems like good advice since you'll end up
with a more parsimonious and interpretable model. I would suggest that the
advice above is a good rule of thumb but also a bit hand-wavy. In practice,
alpha is not nearly as sensitive as lambda (level of regularization). It
may be reasonable to play with some discrete set of alphas in (0,1 ] and a
path of lambdas for each and choose the best model from these.

Confusingly, sklearn uses l1_ratio to mean alpha and alpha to mean lambda.
Reading some of the previous thread, maybe this is responsible for some
confusion between the two sets of documentation?

-
John


On Mon, Oct 14, 2013 at 9:33 AM, <
Send Scikit-learn-general mailing list submissions to
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
or, via email, send a message with subject or body 'help' to
You can reach the person managing the list at
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Scikit-learn-general digest..."
1. Re: recommendation systems (Olivier Grisel)
2. Re: centering of sparse data for elastic net (James Jensen)
3. Re: choice of regularization parameter grid for elastic net
(James Jensen)
4. Re: centering of sparse data for elastic net (Lars Buitinck)
5. Re: choice of regularization parameter grid for elastic net
(Nicholas Dronen)
6. Contributing to scikit-learn (Ankit Agrawal)
----------------------------------------------------------------------
Message: 1
Date: Mon, 14 Oct 2013 12:05:24 +0200
Subject: Re: [Scikit-learn-general] recommendation systems
<
Content-Type: text/plain; charset=UTF-8
Actually the mrec implementation is not the original SLIM algorithm
http://slideshare.net/MarkLevy/efficient-slides
--
Olivier
------------------------------
Message: 2
Date: Mon, 14 Oct 2013 08:13:30 -0700
Subject: Re: [Scikit-learn-general] centering of sparse data for
elastic net
Content-Type: text/plain; charset="iso-8859-1"
Thank you, Olivier.
Just to clarify: you say
You can control the centering with `normalize=True` flag of the
ElasticNet class (or any other linear regression model).
I've noticed people use the term "normalize" in different ways. In the
case of the `normalize=True` flag of the linear models, does it mean
both scaling samples to have unit norm and centering them to have mean
zero? If so, this is inconsistent with the usage in, say, the
preprocessing module, where "normalization" refers only to scaling to
unit norm, and the word "standardization" is used to refer to doing both
(although the function to standardize is scale(), and "scale" seems more
naturally associated with normalization, in my mind). Because of this, I
had supposed that the `normalize=True` flag did not determine centering.
-------------- next part --------------
An HTML attachment was scrubbed...
------------------------------
Message: 3
Date: Mon, 14 Oct 2013 08:34:17 -0700
Subject: Re: [Scikit-learn-general] choice of regularization parameter
grid for elastic net
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Thanks, Alex. That is helpful. Looks like the glmnet documentation says
that this is how they do it as well. What they don't explain is how to
find alpha_max in the first place. The only thing I've thought of is
doing something like a binary search until you find the smallest alpha
yielding the coef_ of zeros, with some limit on how many steps you do it
in. But is there a better way?
Also, how do you choose the smallest alpha value (or in other words, how
do you choose eps)? I came across an unofficial third-party description
of glmnet that said that if nobs < nvars, a higher value is chosen
(0.01, I think), whereas if nobs > nvars, a smaller value is chosen
(say, 0.0001). The basic idea makes sense, but it seems a bit ad hoc to
me, and it seems like it would be sensible to have more than two
possible values, based on the ratio of nobs to nvars. Any thoughts?
hi James,
for a given value of l1_ratio, the grid of alphas is chosen in log scale
starting from alpha_max to alpha_max / 10**eps. Any value of alpha
larger than alpha_max will lead to a coef_ full of zeros.
HTH
Alex
------------------------------
Message: 4
Date: Mon, 14 Oct 2013 17:40:33 +0200
Subject: Re: [Scikit-learn-general] centering of sparse data for
elastic net
<
Content-Type: text/plain; charset=UTF-8
I've noticed people use the term "normalize" in different ways. In the
case
of the `normalize=True` flag of the linear models, does it mean both
scaling
samples to have unit norm and centering them to have mean zero? If so,
this
is inconsistent with the usage in, say, the preprocessing module, where
"normalization" refers only to scaling to unit norm, and the word
"standardization" is used to refer to doing both (although the function
to
standardize is scale(), and "scale" seems more naturally associated with
normalization, in my mind). Because of this, I had supposed that the
`normalize=True` flag did not determine centering.
Yes, this is inconsistent with the preprocessing module. "normalize"
in linear_models is what preprocessing calls "standard scaling".
------------------------------
Message: 5
Date: Mon, 14 Oct 2013 10:00:17 -0600
Subject: Re: [Scikit-learn-general] choice of regularization parameter
grid for elastic net
<CADJSnkytfDL-Ziy5q9FbKnuhG2GX=0P9wo3zm4=MZ9-Egrx3=
Content-Type: text/plain; charset="utf-8"
If by 'alpha' you mean what the lasso literature refers to as 'lambda', my
recollection is that the maximum lambda is determined simply by the L1 norm
of the coefficients of the ordinary least squares solution, because any
value greater than that provides no constraint for the lasso solution.
http://techtalks.tv/talks/the-lasso-persistence-and-cross-validation/58279/
Regards,
Nick
Thanks, Alex. That is helpful. Looks like the glmnet documentation says
that this is how they do it as well. What they don't explain is how to
find alpha_max in the first place. The only thing I've thought of is
doing something like a binary search until you find the smallest alpha
yielding the coef_ of zeros, with some limit on how many steps you do it
in. But is there a better way?
Also, how do you choose the smallest alpha value (or in other words, how
do you choose eps)? I came across an unofficial third-party description
of glmnet that said that if nobs < nvars, a higher value is chosen
(0.01, I think), whereas if nobs > nvars, a smaller value is chosen
(say, 0.0001). The basic idea makes sense, but it seems a bit ad hoc to
me, and it seems like it would be sensible to have more than two
possible values, based on the ratio of nobs to nvars. Any thoughts?
hi James,
for a given value of l1_ratio, the grid of alphas is chosen in log
scale
starting from alpha_max to alpha_max / 10**eps. Any value of alpha
larger than alpha_max will lead to a coef_ full of zeros.
HTH
Alex
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
from
the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
-------------- next part --------------
An HTML attachment was scrubbed...
------------------------------
Message: 6
Date: Mon, 14 Oct 2013 22:02:59 +0530
Subject: [Scikit-learn-general] Contributing to scikit-learn
<
Content-Type: text/plain; charset="iso-8859-1"
Hi,
I am Ankit Agrawal, a 4th year undergrad majoring in EE with
specialization in Communications and Signal Processing at IIT Bombay. I
completed my GSoC with scikit-image this year and have a good grasp with
Python(and a little bit with Cython). I have completed a course in ML, and
have taken some courses where it is applied, namely Computer Vision, NLP
and Speech Processing.
I would like to contribute to scikit-learn to improve my understanding
of different ML algorithms. I have started going through some parts of the
documentation and also through the Contributing page. If there are any
other pointers to go through to get started, please let me know. Thanks.
Regards,
Ankit Agrawal,
Communication and Signal Processing,
IIT Bombay.
-------------- next part --------------
An HTML attachment was scrubbed...
------------------------------
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk
------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
End of Scikit-learn-general Digest, Vol 45, Issue 16
****************************************************
Olivier Grisel
2013-10-14 17:20:12 UTC
Permalink
alpha is the strength of the regularizer and l1_ration is the mixing
weight. "lambda" is a reserved keyword in python, hence the use of
alpha instead. But this is very confusing and I which we had used a
common English name like "penalty_strength" instead.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet
Olivier Grisel
2013-10-14 17:20:55 UTC
Permalink
I meant "l1_ratio" instead of "l1_ration".
James Jensen
2013-10-15 00:01:10 UTC
Permalink
John, you're right about the difference in nomenclature. I've been using
scikit-learn's names for the parameters, so the alpha I've referred to
is the regularization strength and corresponds to lambda in glmnet. The
mixing parameter, referred to in glmnet as alpha, is the L1-ratio in
scikit-learn.

Nick, thank you very much for the tip on how the L1 norm of an OLS
solution is used to determine the maximum regularization strength for
lasso. Thinking about how that would extend to elastic net: with an
L1-ratio of 1, alpha_max is the L1 norm of an OLS solution, because
elastic net reduces to lasso in this case. But with L1-ratios between
zero and one, couldn't alpha_max be greater than the L1 norm of an OLS
solution since alpha_max for the elastic net is not the L1
regularization strength, but rather the overall regularization strength,
distributed between L1 and L2? As the ElasticNet documentation says,
alpha = L1 strength + L2 strength, and L1-ratio= L1 strength / (L1
strength + L2 strength). It seems like the alpha_max for elastic net
with a given L1-ratio could be some function of both the L1 and L2 norms
of an OLS solution, and it might be a simple combination. But I haven't
found it browsing the literature, and I am unsure of how to derive it.

I did find the part in coordinate_descent.py where alpha_max is chosen,
but I don't fully understand the reasoning behind it:

alpha_max = np.abs(Xy).max() / (n_samples * l1_ratio)


Another concern: if the data does not have mean zero and/or unit
variance (I've been told this might be ok if, for example, I want to
preserve sparsity in the input), might this affect the magnitude of the
solution coefficients and hence the calculation of alpha_max?

And I'm still not sure how to pick the smallest value of alpha (or
rather "eps," the ratio between the largest and smallest values).

Now for the L1-ratio. The ElasticNetCV class does not automatically
choose a set of L1-ratios to test, as it does with the alphas; it's up
to the user to supply them. However, it does mention in the
documentation for ElasticNetCV:

/Note that a good choice of list of values for l1_ratio is often to
put more values close to 1 (i.e. Lasso) and less close to 0 (i.e.
Ridge), as in //[.1,////.5,////.7,////.9,////.95,////.99,////1]/

I understand John's reasoning that good L1-ratios are likely to be
higher the greater the proportion of variables to samples. If anyone
knows of other considerations that could go into choosing an appropriate
set of L1-ratios, let me know.

Lastly: I was excited about the idea of trying first with a sparse grid
and then repeating the search in more detail in the area of parameter
values yielding high cross-validation scores. However, I notice in the
paper associated with Nick's link that it says "In practice, an upper
bound must be selected for any grid-search optimization [over values of
the L1 regularization parameter]. Note that more advanced optimization
techniques are generally not practical as the CV objective function
[...] is often noisy." Any thoughts on this?
Alexandre Gramfort
2013-10-15 19:37:08 UTC
Permalink
I did find the part in coordinate_descent.py where alpha_max is chosen, but
alpha_max = np.abs(Xy).max() / (n_samples * l1_ratio)
it can be derived from the KKT optimality conditions of the Lasso problem.

A
Olivier Grisel
2013-10-15 20:35:02 UTC
Permalink
Post by Alexandre Gramfort
I did find the part in coordinate_descent.py where alpha_max is chosen, but
alpha_max = np.abs(Xy).max() / (n_samples * l1_ratio)
it can be derived from the KKT optimality conditions of the Lasso problem.
Would be great to add a link to an online reference or the derivation
somewhere in the doc.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2013-10-15 20:36:45 UTC
Permalink
Post by Olivier Grisel
Post by Alexandre Gramfort
I did find the part in coordinate_descent.py where alpha_max is chosen, but
alpha_max = np.abs(Xy).max() / (n_samples * l1_ratio)
it can be derived from the KKT optimality conditions of the Lasso problem.
Would be great to add a link to an online reference or the derivation
somewhere in the doc.
Also is it impacted by the lack of greedy data-centering in the sparse
case? It seems it does to me.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Alexandre Gramfort
2013-10-16 08:28:31 UTC
Permalink
Post by Olivier Grisel
Also is it impacted by the lack of greedy data-centering in the sparse
case? It seems it does to me.
indeed ! I guess we have a bug there...

I'll take a look later today.

Alex
James Jensen
2013-10-16 23:55:32 UTC
Permalink
Thanks to everyone for their help with this.

Tadej Štajner
2013-10-15 05:35:25 UTC
Permalink
Hi James,

I had a similar problem - my approach was to wrap the sparse matrix into
another sparse matrix interface that applies the centering on the spot
when computing dot products. It builds on the same rationale as the
scipy.linalg.LinearOperator that's often used in optimization. I used it
for running CCA on large sparse matrices so it happily gets the job
done. Profiler also doesn't show any nasty bottlenecks.

Here's the gist: https://gist.github.com/tadejs/6986951

There's likely some room for cleanup, as this was when I was still
learning scipy idioms... any suggestions on nicer style or avoiding
pitfalls are welcome :)

Not sure if it will work for elastic net - so far my centering wrapper
only supports dot product and transpose.

I also use the same trick for computing co-variance and centered
co-variance matrices on high-dimensional data and works nicely.

-- Tadej
Post by James Jensen
I've been applying preprocessing.scale() to my data prior to using
scikit-learn's elastic net, with the understanding that elastic net will
not work correctly if the features do not each have zero mean and unit
variance. scale() both centers and normalizes the data. ElasticNet has
an option to normalize the input data but does not mention centering.
Loading...