Discussion:
Bayesian optimization for hyperparameter tuning
(too old to reply)
James Jensen
2014-01-30 19:23:28 UTC
Permalink
I usually hesitate to suggest a new feature in a library like this
unless I am in a position to work on it myself. However, given the
number of people who seem eager to find something to contribute, and
given the recent discussion about improving the Gaussian process module,
I thought I'd venture an idea.

Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.

One useful application of Bayesian optimization is hyperparameter
tuning. It can be used to optimize the cross-validation score, as an
alternative to, for example, grid search. Grid search is simple and
parallelizable, there is no overhead in choosing the hyperparameters to
try, and the nature of some estimators allows them to be used with it
very efficiently. Bayesian optimization is serial and has a small amount
of overhead in evaluating the surrogate. But it is generally much more
efficient in finding good solutions, and particularly shines when the
scoring function is costly or when there are more than 1 or 2
hyperparameters to tune; here grid search is less attractive and
sometimes completely impractical.

In one of my own applications, involving 4 regularization parameters,
I've been using the BayesOpt library
(http://rmcantin.bitbucket.org/html/index.html), which offers it as a
general-purpose optimization technique that one can manually integrate
with one's cross-validation code. In general, it works quite well, but
there are some limitations to its design that can make its integration
inconvenient. Having this functionality directly integrated into
scikit-learn and specifically tailored to hyperparameter tuning would be
useful. I have been impressed with the ease of use of such convenience
classes as GridSearchCV, and dream of having a corresponding BayesOptCV,
etc.

As a general-use optimization method, Bayesian optimization would belong
elsewhere than in scikit-learn, e.g. in scipy.optimize. But specifically
as a method for hyperparameter tuning, it seems it would fit well in the
scope of scikit-learn, especially since I expect it would not be much
more than a layer or two of functionality on top of what scikit-learn's
GP module offers (or will offer once revised). And it would be of more
general utility than an additional estimator here or there.

I'm curious to hear what others think about the idea. Would this be a
good fit for scikit-learn? Do we have people with the interest,
expertise, and time to take this on at some point?
Dan Haiduc
2014-01-30 20:03:19 UTC
Permalink
Actually, I wanted to create exactly this myself.
I was then discouraged by the fact that Scikit-learn did not pull from a
guy who implemented Multi-Armed
Bandit<https://github.com/scikit-learn/scikit-learn/pull/906>on the
reason that Scikit-learn doesn't do reinforcement learning.
I'm new here (everywhere, not just scikit), and I'm not sure how closely
related MAB is with Bayesian optimization, but I think something along
those lines should definitely be implemented for hyperparameters, since
they're expensive functions almost by definition.

Great idea! I certainly wish it gets implemented as well.
Post by James Jensen
I usually hesitate to suggest a new feature in a library like this
unless I am in a position to work on it myself. However, given the
number of people who seem eager to find something to contribute, and
given the recent discussion about improving the Gaussian process module,
I thought I'd venture an idea.
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning. It can be used to optimize the cross-validation score, as an
alternative to, for example, grid search. Grid search is simple and
parallelizable, there is no overhead in choosing the hyperparameters to
try, and the nature of some estimators allows them to be used with it
very efficiently. Bayesian optimization is serial and has a small amount
of overhead in evaluating the surrogate. But it is generally much more
efficient in finding good solutions, and particularly shines when the
scoring function is costly or when there are more than 1 or 2
hyperparameters to tune; here grid search is less attractive and
sometimes completely impractical.
In one of my own applications, involving 4 regularization parameters,
I've been using the BayesOpt library
(http://rmcantin.bitbucket.org/html/index.html), which offers it as a
general-purpose optimization technique that one can manually integrate
with one's cross-validation code. In general, it works quite well, but
there are some limitations to its design that can make its integration
inconvenient. Having this functionality directly integrated into
scikit-learn and specifically tailored to hyperparameter tuning would be
useful. I have been impressed with the ease of use of such convenience
classes as GridSearchCV, and dream of having a corresponding BayesOptCV,
etc.
As a general-use optimization method, Bayesian optimization would belong
elsewhere than in scikit-learn, e.g. in scipy.optimize. But specifically
as a method for hyperparameter tuning, it seems it would fit well in the
scope of scikit-learn, especially since I expect it would not be much
more than a layer or two of functionality on top of what scikit-learn's
GP module offers (or will offer once revised). And it would be of more
general utility than an additional estimator here or there.
I'm curious to hear what others think about the idea. Would this be a
good fit for scikit-learn? Do we have people with the interest,
expertise, and time to take this on at some point?
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Hadayat Seddiqi
2014-01-30 20:11:00 UTC
Permalink
Hi,

So I was the one who volunteered to do contribute my GP code for a revamp
of scikits module. I'm far from an expert, and I can't say I understand how
this would fit off the top of my head, but if someone is knowledgeable and
willing to work on this then I'd be more than happy to lend a hand as well.
I've been kind of quiet on my own GP code so far.. just trying to get
everything as ready and nice as I can before bugging people again.

James you mentioned that you might be hesitant to suggest things if you
don't have time to implement. If I read that correctly, you're saying you
might not have the time, but in case you do, feel free to contact (this
goes for anyone, of course).

-Had
Post by Dan Haiduc
Actually, I wanted to create exactly this myself.
I was then discouraged by the fact that Scikit-learn did not pull from a
guy who implemented Multi-Armed Bandit<https://github.com/scikit-learn/scikit-learn/pull/906>on the reason that Scikit-learn doesn't do reinforcement learning.
I'm new here (everywhere, not just scikit), and I'm not sure how closely
related MAB is with Bayesian optimization, but I think something along
those lines should definitely be implemented for hyperparameters, since
they're expensive functions almost by definition.
Great idea! I certainly wish it gets implemented as well.
Post by James Jensen
I usually hesitate to suggest a new feature in a library like this
unless I am in a position to work on it myself. However, given the
number of people who seem eager to find something to contribute, and
given the recent discussion about improving the Gaussian process module,
I thought I'd venture an idea.
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning. It can be used to optimize the cross-validation score, as an
alternative to, for example, grid search. Grid search is simple and
parallelizable, there is no overhead in choosing the hyperparameters to
try, and the nature of some estimators allows them to be used with it
very efficiently. Bayesian optimization is serial and has a small amount
of overhead in evaluating the surrogate. But it is generally much more
efficient in finding good solutions, and particularly shines when the
scoring function is costly or when there are more than 1 or 2
hyperparameters to tune; here grid search is less attractive and
sometimes completely impractical.
In one of my own applications, involving 4 regularization parameters,
I've been using the BayesOpt library
(http://rmcantin.bitbucket.org/html/index.html), which offers it as a
general-purpose optimization technique that one can manually integrate
with one's cross-validation code. In general, it works quite well, but
there are some limitations to its design that can make its integration
inconvenient. Having this functionality directly integrated into
scikit-learn and specifically tailored to hyperparameter tuning would be
useful. I have been impressed with the ease of use of such convenience
classes as GridSearchCV, and dream of having a corresponding BayesOptCV,
etc.
As a general-use optimization method, Bayesian optimization would belong
elsewhere than in scikit-learn, e.g. in scipy.optimize. But specifically
as a method for hyperparameter tuning, it seems it would fit well in the
scope of scikit-learn, especially since I expect it would not be much
more than a layer or two of functionality on top of what scikit-learn's
GP module offers (or will offer once revised). And it would be of more
general utility than an additional estimator here or there.
I'm curious to hear what others think about the idea. Would this be a
good fit for scikit-learn? Do we have people with the interest,
expertise, and time to take this on at some point?
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Zach Dwiel
2014-01-30 20:15:56 UTC
Permalink
It seems that with GridSearchCV and RandomizedSearchCV both already
included in scikit-learn, it would make sense to also include other common,
more efficient hyperparameter searchers as well.

zach
Post by Hadayat Seddiqi
Hi,
So I was the one who volunteered to do contribute my GP code for a revamp
of scikits module. I'm far from an expert, and I can't say I understand how
this would fit off the top of my head, but if someone is knowledgeable and
willing to work on this then I'd be more than happy to lend a hand as well.
I've been kind of quiet on my own GP code so far.. just trying to get
everything as ready and nice as I can before bugging people again.
James you mentioned that you might be hesitant to suggest things if you
don't have time to implement. If I read that correctly, you're saying you
might not have the time, but in case you do, feel free to contact (this
goes for anyone, of course).
-Had
Post by Dan Haiduc
Actually, I wanted to create exactly this myself.
I was then discouraged by the fact that Scikit-learn did not pull from a
guy who implemented Multi-Armed Bandit<https://github.com/scikit-learn/scikit-learn/pull/906>on the reason that Scikit-learn doesn't do reinforcement learning.
I'm new here (everywhere, not just scikit), and I'm not sure how closely
related MAB is with Bayesian optimization, but I think something along
those lines should definitely be implemented for hyperparameters, since
they're expensive functions almost by definition.
Great idea! I certainly wish it gets implemented as well.
Post by James Jensen
I usually hesitate to suggest a new feature in a library like this
unless I am in a position to work on it myself. However, given the
number of people who seem eager to find something to contribute, and
given the recent discussion about improving the Gaussian process module,
I thought I'd venture an idea.
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning. It can be used to optimize the cross-validation score, as an
alternative to, for example, grid search. Grid search is simple and
parallelizable, there is no overhead in choosing the hyperparameters to
try, and the nature of some estimators allows them to be used with it
very efficiently. Bayesian optimization is serial and has a small amount
of overhead in evaluating the surrogate. But it is generally much more
efficient in finding good solutions, and particularly shines when the
scoring function is costly or when there are more than 1 or 2
hyperparameters to tune; here grid search is less attractive and
sometimes completely impractical.
In one of my own applications, involving 4 regularization parameters,
I've been using the BayesOpt library
(http://rmcantin.bitbucket.org/html/index.html), which offers it as a
general-purpose optimization technique that one can manually integrate
with one's cross-validation code. In general, it works quite well, but
there are some limitations to its design that can make its integration
inconvenient. Having this functionality directly integrated into
scikit-learn and specifically tailored to hyperparameter tuning would be
useful. I have been impressed with the ease of use of such convenience
classes as GridSearchCV, and dream of having a corresponding BayesOptCV,
etc.
As a general-use optimization method, Bayesian optimization would belong
elsewhere than in scikit-learn, e.g. in scipy.optimize. But specifically
as a method for hyperparameter tuning, it seems it would fit well in the
scope of scikit-learn, especially since I expect it would not be much
more than a layer or two of functionality on top of what scikit-learn's
GP module offers (or will offer once revised). And it would be of more
general utility than an additional estimator here or there.
I'm curious to hear what others think about the idea. Would this be a
good fit for scikit-learn? Do we have people with the interest,
expertise, and time to take this on at some point?
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sturla Molden
2014-01-30 22:21:32 UTC
Permalink
As I understand it fro reading about this a LONG time ago (apologies if my
memory is rusty), "Bayesian optimization" means maximizing the
log-likelihood using the Newton-Raphson method. The word "Bayesian" comes
from an obfuscated explanation of what really happens: If we assume a flat
or Gaussian prior and approximate the log-likelihood with a second order
Taylor series expansion, the posterior is approximated with a Gaussian
dustribution. We can then improve this iteratively by refitting the
polynomial around the mode. But only statisticans like to explain
optimization with Newton-Raphson so difficulty. There is no need to involve
Gaussian approximations to the Bayesian posterior here. "Bayesian
optimization" is merely a buzzword. This is no more "Bayesian" than ML
using Fisher's scoring method, in fact it is identical. Any by the way,
Newton-Raphson is not about striking balances between exploitation and
exploration. That is also bullshitting. It is about quadratic convergence,
and if anything, it is famous for finding local optima and sometimes just
failing to converge by overshooting the target (which is why quasi-Newton
is often preferred).

:)

Sturla
Post by Zach Dwiel
It seems that with GridSearchCV and RandomizedSearchCV both already
included in scikit-learn, it would make sense to also include other
common, more efficient hyperparameter searchers as well.
zach
On Thu, Jan 30, 2014 at 3:11 PM, Hadayat Seddiqi
Hi,
So I was the one who volunteered to do contribute my GP code for a revamp
of scikits module. I'm far from an expert, and I can't say I understand
how this would fit off the top of my head, but if someone is
knowledgeable and willing to work on this then I'd be more than happy to
lend a hand as well. I've been kind of quiet on my own GP code so far..
just trying to get everything as ready and nice as I can before bugging people again.
James you mentioned that you might be hesitant to suggest things if you
don't have time to implement. If I read that correctly, you're saying you
might not have the time, but in case you do, feel free to contact (this
goes for anyone, of course).
-Had
On Thu, Jan 30, 2014 at 3:03 PM, Dan Haiduc
Actually, I wanted to create exactly this myself. I was then discouraged
by the fact that Scikit-learn did not pull from a guy who implemented
Multi-Armed Bandit
<a href="https://github.com/scikit-learn/scikit-learn/pull/906">https://github.com/scikit-learn/scikit-learn/pull/906</a>>on
the reason that Scikit-learn doesn't do reinforcement learning. I'm new
here (everywhere, not just scikit), and I'm not sure how closely related
MAB is with Bayesian optimization, but I think something along those
lines should definitely be implemented for hyperparameters, since they're
expensive functions almost by definition.
Great idea! I certainly wish it gets implemented as well.
On Thu, Jan 30, 2014 at 9:23 PM, James Jensen
I usually hesitate to suggest a new feature in a library like this unless
I am in a position to work on it myself. However, given the number of
people who seem eager to find something to contribute, and given the
recent discussion about improving the Gaussian process module, I thought
I'd venture an idea.
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter tuning.
It can be used to optimize the cross-validation score, as an alternative
to, for example, grid search. Grid search is simple and parallelizable,
there is no overhead in choosing the hyperparameters to try, and the
nature of some estimators allows them to be used with it very
efficiently. Bayesian optimization is serial and has a small amount of
overhead in evaluating the surrogate. But it is generally much more
efficient in finding good solutions, and particularly shines when the
scoring function is costly or when there are more than 1 or 2
hyperparameters to tune; here grid search is less attractive and
sometimes completely impractical.
In one of my own applications, involving 4 regularization parameters,
I've been using the BayesOpt library
(http://rmcantin.bitbucket.org/html/index.html), which offers it as a
general-purpose optimization technique that one can manually integrate
with one's cross-validation code. In general, it works quite well, but
there are some limitations to its design that can make its integration
inconvenient. Having this functionality directly integrated into
scikit-learn and specifically tailored to hyperparameter tuning would be
useful. I have been impressed with the ease of use of such convenience
classes as GridSearchCV, and dream of having a corresponding BayesOptCV, etc.
As a general-use optimization method, Bayesian optimization would belong
elsewhere than in scikit-learn, e.g. in scipy.optimize. But specifically
as a method for hyperparameter tuning, it seems it would fit well in the
scope of scikit-learn, especially since I expect it would not be much
more than a layer or two of functionality on top of what scikit-learn's
GP module offers (or will offer once revised). And it would be of more
general utility than an additional estimator here or there.
I'm curious to hear what others think about the idea. Would this be a
good fit for scikit-learn? Do we have people with the interest,
expertise, and time to take this on at some point?
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import a
virtual appliance and go from zero to informed in seconds.
<a
href="http://pubads.g.doubleclick.net/gampad/clk?id=123612991&amp;iu=/4140/ostg.clktrk">http://pubads.g.doubleclick.net/gampad/clk?id=123612991&amp;iu=/4140/ostg.clktrk</a>
_______________________________________________ Scikit-learn-general
href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a>
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import a
virtual appliance and go from zero to informed in seconds.
<a
href="http://pubads.g.doubleclick.net/gampad/clk?id=123612991&amp;iu=/4140/ostg.clktrk">http://pubads.g.doubleclick.net/gampad/clk?id=123612991&amp;iu=/4140/ostg.clktrk</a>
_______________________________________________ Scikit-learn-general
href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a>
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import a
virtual appliance and go from zero to informed in seconds.
<a
href="http://pubads.g.doubleclick.net/gampad/clk?id=123612991&amp;iu=/4140/ostg.clktrk">http://pubads.g.doubleclick.net/gampad/clk?id=123612991&amp;iu=/4140/ostg.clktrk</a>
_______________________________________________ Scikit-learn-general
href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a>
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import a
virtual appliance and go from zero to informed in seconds. <a
href="http://pubads.g.doubleclick.net/gampad/clk?id=123612991&amp;iu=/4140/ostg.clktrk">http://pubads.g.doubleclick.net/gampad/clk?id=123612991&amp;iu=/4140/ostg.clktrk</a>
Ken Arnold
2014-01-31 03:22:21 UTC
Permalink
Post by Sturla Molden
As I understand it fro reading about this a LONG time ago (apologies if my
memory is rusty), "Bayesian optimization" means maximizing the
log-likelihood using the Newton-Raphson method.
Probably that was how the term was typically used at one time, but recently
"Bayesian optimization" has come to mean something different. In a setting
where the function to be optimized is expensive to evaluate (e.g., the
error of an estimator as a function of its hyperparameters), and especially
if samples of that function's value are noisy, it can be helpful to
estimate values of that function as the posterior of a Gaussian Process
prior and a Gaussian observation likelihood. Given that function estimate
(as predicted mean and variance), you can globally optimize an "expected
improvement" heuristic to find the best point(s) to request function
evaluations next.

For details, see:

Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian
Optimization of Machine Learning Algorithms. *arXiv preprint
arXiv:1206.2944*, 1–9. Machine Learning; Learning. Retrieved from
http://arxiv.org/abs/1206.2944
or, as Patrick linked to, http://www.cs.toronto.edu/~jasper/bayesopt.pdf

-Ken
James Jensen
2014-01-30 21:25:18 UTC
Permalink
Hi, Had,

It's true that I'd have limited time (working on a PhD). I imagine most
possible contributors are also quite busy. Mainly, I lack the expertise
necessary to do this properly; I understand Bayesian optimization at a
high level but don't have much of a foundation in the underlying math,
and am an amateur programmer not yet accustomed to writing code that
would meet scikit-learn standards. That being said, if there are ways I
can help make this happen, I'd be glad to do so.

-James
Post by Hadayat Seddiqi
Hi,
So I was the one who volunteered to do contribute my GP code for a
revamp of scikits module. I'm far from an expert, and I can't say I
understand how this would fit off the top of my head, but if someone
is knowledgeable and willing to work on this then I'd be more than
happy to lend a hand as well. I've been kind of quiet on my own GP
code so far.. just trying to get everything as ready and nice as I can
before bugging people again.
James you mentioned that you might be hesitant to suggest things if
you don't have time to implement. If I read that correctly, you're
saying you might not have the time, but in case you do, feel free to
contact (this goes for anyone, of course).
-Had
Actually, I wanted to create exactly this myself.
I was then discouraged by the fact that Scikit-learn did not pull
from a guy who implemented Multi-Armed Bandit
<https://github.com/scikit-learn/scikit-learn/pull/906> on the
reason that Scikit-learn doesn't do reinforcement learning.
I'm new here (everywhere, not just scikit), and I'm not sure how
closely related MAB is with Bayesian optimization, but I think
something along those lines should definitely be implemented for
hyperparameters, since they're expensive functions almost by
definition.
Great idea! I certainly wish it gets implemented as well.
On Thu, Jan 30, 2014 at 9:23 PM, James Jensen
I usually hesitate to suggest a new feature in a library like this
unless I am in a position to work on it myself. However, given the
number of people who seem eager to find something to
contribute, and
given the recent discussion about improving the Gaussian process module,
I thought I'd venture an idea.
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning. It can be used to optimize the cross-validation score, as an
alternative to, for example, grid search. Grid search is simple and
parallelizable, there is no overhead in choosing the
hyperparameters to
try, and the nature of some estimators allows them to be used with it
very efficiently. Bayesian optimization is serial and has a small amount
of overhead in evaluating the surrogate. But it is generally much more
efficient in finding good solutions, and particularly shines when the
scoring function is costly or when there are more than 1 or 2
hyperparameters to tune; here grid search is less attractive and
sometimes completely impractical.
In one of my own applications, involving 4 regularization parameters,
I've been using the BayesOpt library
(http://rmcantin.bitbucket.org/html/index.html), which offers it as a
general-purpose optimization technique that one can manually integrate
with one's cross-validation code. In general, it works quite well, but
there are some limitations to its design that can make its integration
inconvenient. Having this functionality directly integrated into
scikit-learn and specifically tailored to hyperparameter tuning would be
useful. I have been impressed with the ease of use of such convenience
classes as GridSearchCV, and dream of having a corresponding BayesOptCV,
etc.
As a general-use optimization method, Bayesian optimization would belong
elsewhere than in scikit-learn, e.g. in scipy.optimize. But specifically
as a method for hyperparameter tuning, it seems it would fit well in the
scope of scikit-learn, especially since I expect it would not be much
more than a layer or two of functionality on top of what scikit-learn's
GP module offers (or will offer once revised). And it would be of more
general utility than an additional estimator here or there.
I'm curious to hear what others think about the idea. Would this be a
good fit for scikit-learn? Do we have people with the interest,
expertise, and time to take this on at some point?
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2014-01-30 22:28:17 UTC
Permalink
Post by James Jensen
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning.
Thanks a lot for your enthousiasme and suggestion.

Indeed, many of the core developpers would love to see simple Bayesian
optimization used for hyperparameter optimization, for instance taking
the gist of hyperopt https://github.com/hyperopt/hyperopt and making an
extended version of the RandomSearchCV.

However there are a number of technical roadblocks to get there. In
particular the Gaussian process could be improved (to implement
partial_fit for online learning), and the parallel computing engine
(joblib) does not support well as producer/consumer pattern. None of
these problems are showstoppers, but they reduce the usefulness of a
hyper-parameter selection object using Bayesian optimization.

I would hope that we find time to implement these difficult core aspects
and eventually get to implementing a more advanced hyper-parameter
optimizer. But all the core developers are very busy and spending a lot
of time simply maintaining the library (have a look at the number of
issues open or pull requests that are waiting to be reviewed to have an
idea).

If you want to help -beyond helping with reviewing/finishing pull
requests and closing issues, I suggest that first, to prototype code, you
could first submit an example using the Gaussian processes to do
optimization of a noisy function. In a second step, after having that
example merged, we could think about how to build a BayesianSearchCV
object.

Cheers,

Gaël
Frédéric Bastien
2014-01-31 00:53:16 UTC
Permalink
I have a question on those type of algo for hyper parameter
optimization. With a grid search, we can run all jobs in parallel. But
I have the impression that those algo remove that possibility. Is
there there way to sample many starting configuration with those algo?
But the most interresting question, if we start many jobs in parallel,
if the jobs don't finish at the same time as this happen frequently,
can we sample new test point while maximizing the "coverage" with the
currently running jobs that don't have results?

Fred

On Thu, Jan 30, 2014 at 5:28 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by James Jensen
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning.
Thanks a lot for your enthousiasme and suggestion.
Indeed, many of the core developpers would love to see simple Bayesian
optimization used for hyperparameter optimization, for instance taking
the gist of hyperopt https://github.com/hyperopt/hyperopt and making an
extended version of the RandomSearchCV.
However there are a number of technical roadblocks to get there. In
particular the Gaussian process could be improved (to implement
partial_fit for online learning), and the parallel computing engine
(joblib) does not support well as producer/consumer pattern. None of
these problems are showstoppers, but they reduce the usefulness of a
hyper-parameter selection object using Bayesian optimization.
I would hope that we find time to implement these difficult core aspects
and eventually get to implementing a more advanced hyper-parameter
optimizer. But all the core developers are very busy and spending a lot
of time simply maintaining the library (have a look at the number of
issues open or pull requests that are waiting to be reviewed to have an
idea).
If you want to help -beyond helping with reviewing/finishing pull
requests and closing issues, I suggest that first, to prototype code, you
could first submit an example using the Gaussian processes to do
optimization of a noisy function. In a second step, after having that
example merged, we could think about how to build a BayesianSearchCV
object.
Cheers,
Gaël
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Patrick Mineault
2014-01-31 01:28:38 UTC
Permalink
Sure you can:

http://www.cs.toronto.edu/~jasper/*bayes*opt.pdf

And some python code:

https://github.com/JasperSnoek/spearmint
Post by Frédéric Bastien
I have a question on those type of algo for hyper parameter
optimization. With a grid search, we can run all jobs in parallel. But
I have the impression that those algo remove that possibility. Is
there there way to sample many starting configuration with those algo?
But the most interresting question, if we start many jobs in parallel,
if the jobs don't finish at the same time as this happen frequently,
can we sample new test point while maximizing the "coverage" with the
currently running jobs that don't have results?
Fred
On Thu, Jan 30, 2014 at 5:28 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by James Jensen
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning.
Thanks a lot for your enthousiasme and suggestion.
Indeed, many of the core developpers would love to see simple Bayesian
optimization used for hyperparameter optimization, for instance taking
the gist of hyperopt https://github.com/hyperopt/hyperopt and making an
extended version of the RandomSearchCV.
However there are a number of technical roadblocks to get there. In
particular the Gaussian process could be improved (to implement
partial_fit for online learning), and the parallel computing engine
(joblib) does not support well as producer/consumer pattern. None of
these problems are showstoppers, but they reduce the usefulness of a
hyper-parameter selection object using Bayesian optimization.
I would hope that we find time to implement these difficult core aspects
and eventually get to implementing a more advanced hyper-parameter
optimizer. But all the core developers are very busy and spending a lot
of time simply maintaining the library (have a look at the number of
issues open or pull requests that are waiting to be reviewed to have an
idea).
If you want to help -beyond helping with reviewing/finishing pull
requests and closing issues, I suggest that first, to prototype code, you
could first submit an example using the Gaussian processes to do
optimization of a noisy function. In a second step, after having that
example merged, we could think about how to build a BayesianSearchCV
object.
Cheers,
Gaël
------------------------------------------------------------------------------
Post by Gael Varoquaux
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Gael Varoquaux
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Frédéric Bastien
2014-01-31 19:22:03 UTC
Permalink
thanks.

Fred

On Thu, Jan 30, 2014 at 8:28 PM, Patrick Mineault
Post by Ken Arnold
http://www.cs.toronto.edu/~jasper/bayesopt.pdf
https://github.com/JasperSnoek/spearmint
Post by Frédéric Bastien
I have a question on those type of algo for hyper parameter
optimization. With a grid search, we can run all jobs in parallel. But
I have the impression that those algo remove that possibility. Is
there there way to sample many starting configuration with those algo?
But the most interresting question, if we start many jobs in parallel,
if the jobs don't finish at the same time as this happen frequently,
can we sample new test point while maximizing the "coverage" with the
currently running jobs that don't have results?
Fred
On Thu, Jan 30, 2014 at 5:28 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by James Jensen
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning.
Thanks a lot for your enthousiasme and suggestion.
Indeed, many of the core developpers would love to see simple Bayesian
optimization used for hyperparameter optimization, for instance taking
the gist of hyperopt https://github.com/hyperopt/hyperopt and making an
extended version of the RandomSearchCV.
However there are a number of technical roadblocks to get there. In
particular the Gaussian process could be improved (to implement
partial_fit for online learning), and the parallel computing engine
(joblib) does not support well as producer/consumer pattern. None of
these problems are showstoppers, but they reduce the usefulness of a
hyper-parameter selection object using Bayesian optimization.
I would hope that we find time to implement these difficult core aspects
and eventually get to implementing a more advanced hyper-parameter
optimizer. But all the core developers are very busy and spending a lot
of time simply maintaining the library (have a look at the number of
issues open or pull requests that are waiting to be reviewed to have an
idea).
If you want to help -beyond helping with reviewing/finishing pull
requests and closing issues, I suggest that first, to prototype code, you
could first submit an example using the Gaussian processes to do
optimization of a noisy function. In a second step, after having that
example merged, we could think about how to build a BayesianSearchCV
object.
Cheers,
Gaël
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
James Bergstra
2014-02-02 15:43:51 UTC
Permalink
Glad to see this thread revived!

Sklearn-users who are interested in this stuff should check out Hyperopt's
sklearn interface:

https://github.com/hyperopt/hyperopt-sklearn

It's very much a work-in-progress. We're in the process of putting together
some examples / tutorial, and a tech report that describes how well it
works, how long it takes, etc. The results we have so far are encouraging...

And speaking of results: we want to make the case that hyperopt-on-sklearn
is awesome, which requires showing that it works for lots of data sets. We
can only do so much on our own. Real use cases are a lot more interesting
than old standard benchmarks. If someone has a dataset and they'd like to
try hyper-optimizing their sklearn estimators & pre-processing stages, get
in touch! Send me a private message and we can work together to make sure
hyperopt-sklearn has what it takes for your application.

Also, hyperopt's got some new algorithms on the way too... but that'll be
the subject for another writeup.

- James
Post by Frédéric Bastien
thanks.
Fred
On Thu, Jan 30, 2014 at 8:28 PM, Patrick Mineault
Post by Ken Arnold
http://www.cs.toronto.edu/~jasper/bayesopt.pdf
https://github.com/JasperSnoek/spearmint
Post by Frédéric Bastien
I have a question on those type of algo for hyper parameter
optimization. With a grid search, we can run all jobs in parallel. But
I have the impression that those algo remove that possibility. Is
there there way to sample many starting configuration with those algo?
But the most interresting question, if we start many jobs in parallel,
if the jobs don't finish at the same time as this happen frequently,
can we sample new test point while maximizing the "coverage" with the
currently running jobs that don't have results?
Fred
On Thu, Jan 30, 2014 at 5:28 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by James Jensen
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit
the
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on
Bayesian
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning.
Thanks a lot for your enthousiasme and suggestion.
Indeed, many of the core developpers would love to see simple Bayesian
optimization used for hyperparameter optimization, for instance taking
the gist of hyperopt https://github.com/hyperopt/hyperopt and making
an
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
extended version of the RandomSearchCV.
However there are a number of technical roadblocks to get there. In
particular the Gaussian process could be improved (to implement
partial_fit for online learning), and the parallel computing engine
(joblib) does not support well as producer/consumer pattern. None of
these problems are showstoppers, but they reduce the usefulness of a
hyper-parameter selection object using Bayesian optimization.
I would hope that we find time to implement these difficult core
aspects
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
and eventually get to implementing a more advanced hyper-parameter
optimizer. But all the core developers are very busy and spending a
lot
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
of time simply maintaining the library (have a look at the number of
issues open or pull requests that are waiting to be reviewed to have
an
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
idea).
If you want to help -beyond helping with reviewing/finishing pull
requests and closing issues, I suggest that first, to prototype code, you
could first submit an example using the Gaussian processes to do
optimization of a noisy function. In a second step, after having that
example merged, we could think about how to build a BayesianSearchCV
object.
Cheers,
Gaël
------------------------------------------------------------------------------
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply
import
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Ken Arnold
Post by Frédéric Bastien
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Ken Arnold
Post by Frédéric Bastien
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Ken Arnold
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Ken Arnold
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
James Bergstra
2014-02-02 15:44:54 UTC
Permalink
(Sorry about the comment about a revived thread, I was thinking of another
one!)
Post by James Bergstra
Glad to see this thread revived!
Sklearn-users who are interested in this stuff should check out Hyperopt's
https://github.com/hyperopt/hyperopt-sklearn
It's very much a work-in-progress. We're in the process of putting
together some examples / tutorial, and a tech report that describes how
well it works, how long it takes, etc. The results we have so far are
encouraging...
And speaking of results: we want to make the case that hyperopt-on-sklearn
is awesome, which requires showing that it works for lots of data sets. We
can only do so much on our own. Real use cases are a lot more interesting
than old standard benchmarks. If someone has a dataset and they'd like to
try hyper-optimizing their sklearn estimators & pre-processing stages, get
in touch! Send me a private message and we can work together to make sure
hyperopt-sklearn has what it takes for your application.
Also, hyperopt's got some new algorithms on the way too... but that'll be
the subject for another writeup.
- James
Post by Frédéric Bastien
thanks.
Fred
On Thu, Jan 30, 2014 at 8:28 PM, Patrick Mineault
Post by Ken Arnold
http://www.cs.toronto.edu/~jasper/bayesopt.pdf
https://github.com/JasperSnoek/spearmint
Post by Frédéric Bastien
I have a question on those type of algo for hyper parameter
optimization. With a grid search, we can run all jobs in parallel. But
I have the impression that those algo remove that possibility. Is
there there way to sample many starting configuration with those algo?
But the most interresting question, if we start many jobs in parallel,
if the jobs don't finish at the same time as this happen frequently,
can we sample new test point while maximizing the "coverage" with the
currently running jobs that don't have results?
Fred
On Thu, Jan 30, 2014 at 5:28 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by James Jensen
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit
the
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you
haven't
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is
very
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on
Bayesian
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning.
Thanks a lot for your enthousiasme and suggestion.
Indeed, many of the core developpers would love to see simple
Bayesian
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
optimization used for hyperparameter optimization, for instance
taking
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
the gist of hyperopt https://github.com/hyperopt/hyperopt and
making an
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
extended version of the RandomSearchCV.
However there are a number of technical roadblocks to get there. In
particular the Gaussian process could be improved (to implement
partial_fit for online learning), and the parallel computing engine
(joblib) does not support well as producer/consumer pattern. None of
these problems are showstoppers, but they reduce the usefulness of a
hyper-parameter selection object using Bayesian optimization.
I would hope that we find time to implement these difficult core
aspects
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
and eventually get to implementing a more advanced hyper-parameter
optimizer. But all the core developers are very busy and spending a
lot
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
of time simply maintaining the library (have a look at the number of
issues open or pull requests that are waiting to be reviewed to have
an
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
idea).
If you want to help -beyond helping with reviewing/finishing pull
requests and closing issues, I suggest that first, to prototype code, you
could first submit an example using the Gaussian processes to do
optimization of a noisy function. In a second step, after having that
example merged, we could think about how to build a BayesianSearchCV
object.
Cheers,
Gaël
------------------------------------------------------------------------------
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply
import
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Ken Arnold
Post by Frédéric Bastien
Post by Gael Varoquaux
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Ken Arnold
Post by Frédéric Bastien
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Ken Arnold
Post by Frédéric Bastien
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Ken Arnold
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Ken Arnold
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2014-02-02 22:26:52 UTC
Permalink
Nice. I've taken a look at what you've got there...
Post by James Bergstra
Post by Frédéric Bastien
from hpsklearn.components import svc
from hyperopt.pyll.stochastic import sample
sample(svc(str))
SVC(C=0.471736065582, cache_size=100.0, class_weight=None,
coef0=-0.0579424882785, degree=3, gamma=0.0478025734971,
kernel='sigmoid', max_iter=1790, probability=False, random_state=None,
shrinking=False, tol=0.00028626647776, verbose=False)

This is pretty neat! (even if I tend to try a much wider range of C than
this prefers)

The source also marks some parameters as having an expected correlation
with increased performance (effectiveness) and training cost, which I
presume may be used by some minimisers in the future.

The meta-optimiser, while not idiomatic, is similar to the scikit-learn
grid search idea. For one thing it doesn't support different
cross-validation partitions (it always uses a single training and
validation set in 80:20 proportion). One thing it does in addition is allow
fitting to time-out, which I guess is important to the randomised
optimisation technique, but isn't directly supported by joblib.parallel
(although an estimator wrapper and
https://github.com/scikit-learn/scikit-learn/pull/2795 might suffice).

- Joel
Post by James Bergstra
Glad to see this thread revived!
Sklearn-users who are interested in this stuff should check out Hyperopt's
https://github.com/hyperopt/hyperopt-sklearn
It's very much a work-in-progress. We're in the process of putting
together some examples / tutorial, and a tech report that describes how
well it works, how long it takes, etc. The results we have so far are
encouraging...
And speaking of results: we want to make the case that hyperopt-on-sklearn
is awesome, which requires showing that it works for lots of data sets. We
can only do so much on our own. Real use cases are a lot more interesting
than old standard benchmarks. If someone has a dataset and they'd like to
try hyper-optimizing their sklearn estimators & pre-processing stages, get
in touch! Send me a private message and we can work together to make sure
hyperopt-sklearn has what it takes for your application.
Also, hyperopt's got some new algorithms on the way too... but that'll be
the subject for another writeup.
- James
Post by Frédéric Bastien
thanks.
Fred
On Thu, Jan 30, 2014 at 8:28 PM, Patrick Mineault
http://www.cs.toronto.edu/~jasper/bayesopt.pdf
https://github.com/JasperSnoek/spearmint
Post by Frédéric Bastien
I have a question on those type of algo for hyper parameter
optimization. With a grid search, we can run all jobs in parallel. But
I have the impression that those algo remove that possibility. Is
there there way to sample many starting configuration with those algo?
But the most interresting question, if we start many jobs in parallel,
if the jobs don't finish at the same time as this happen frequently,
can we sample new test point while maximizing the "coverage" with the
currently running jobs that don't have results?
Fred
On Thu, Jan 30, 2014 at 5:28 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by James Jensen
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit
the
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you
haven't
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is
very
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on
Bayesian
Post by Frédéric Bastien
Post by Gael Varoquaux
Post by James Jensen
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning.
Thanks a lot for your enthousiasme and suggestion.
Indeed, many of the core developpers would love to see simple
Bayesian
Post by Frédéric Bastien
Post by Gael Varoquaux
optimization used for hyperparameter optimization, for instance
taking
Post by Frédéric Bastien
Post by Gael Varoquaux
the gist of hyperopt https://github.com/hyperopt/hyperopt and
making an
Post by Frédéric Bastien
Post by Gael Varoquaux
extended version of the RandomSearchCV.
However there are a number of technical roadblocks to get there. In
particular the Gaussian process could be improved (to implement
partial_fit for online learning), and the parallel computing engine
(joblib) does not support well as producer/consumer pattern. None of
these problems are showstoppers, but they reduce the usefulness of a
hyper-parameter selection object using Bayesian optimization.
I would hope that we find time to implement these difficult core
aspects
Post by Frédéric Bastien
Post by Gael Varoquaux
and eventually get to implementing a more advanced hyper-parameter
optimizer. But all the core developers are very busy and spending a
lot
Post by Frédéric Bastien
Post by Gael Varoquaux
of time simply maintaining the library (have a look at the number of
issues open or pull requests that are waiting to be reviewed to have
an
Post by Frédéric Bastien
Post by Gael Varoquaux
idea).
If you want to help -beyond helping with reviewing/finishing pull
requests and closing issues, I suggest that first, to prototype code, you
could first submit an example using the Gaussian processes to do
optimization of a noisy function. In a second step, after having that
example merged, we could think about how to build a BayesianSearchCV
object.
Cheers,
Gaël
------------------------------------------------------------------------------
Post by Frédéric Bastien
Post by Gael Varoquaux
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply
import
Post by Frédéric Bastien
Post by Gael Varoquaux
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Frédéric Bastien
Post by Gael Varoquaux
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Frédéric Bastien
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Frédéric Bastien
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
James Bergstra
2014-02-02 23:54:29 UTC
Permalink
Post by Joel Nothman
Nice. I've taken a look at what you've got there...
from hpsklearn.components import svc
from hyperopt.pyll.stochastic import sample
sample(svc(str))
SVC(C=0.471736065582, cache_size=100.0, class_weight=None,
coef0=-0.0579424882785, degree=3, gamma=0.0478025734971,
kernel='sigmoid', max_iter=1790, probability=False, random_state=None,
shrinking=False, tol=0.00028626647776, verbose=False)
This is pretty neat! (even if I tend to try a much wider range of C than
this prefers)
Glad you like it!

What range of C would you recommend?

Regarding the range of C, this is exactly the sort of thing that we're
trying to tune right now. We're trying to set up distributions that are
broad enough to get good performance on a range of data sets, but not so
broad that we overwhelm the search algorithms.

For those interested: it's fair game to use meta-features of the data sets
themselves (e.g. size, sparsity, dimensionality, content-type, etc.) to set
these distributions too. There's lots of research room here, and there's a
workshop at ICML 2014 that would be great for anyone who wants to do some
of it.
Post by Joel Nothman
The source also marks some parameters as having an expected correlation
with increased performance (effectiveness) and training cost, which I
presume may be used by some minimisers in the future.
Exactly, none of the optimization algorithms currently use that info,
but... they should.
Post by Joel Nothman
The meta-optimiser, while not idiomatic, is similar to the scikit-learn
grid search idea. For one thing it doesn't support different
cross-validation partitions (it always uses a single training and
validation set in 80:20 proportion). One thing it does in addition is allow
fitting to time-out, which I guess is important to the randomised
optimisation technique, but isn't directly supported by joblib.parallel
(although an estimator wrapper and
https://github.com/scikit-learn/scikit-learn/pull/2795 might suffice).
Thanks for the comments, I translated them into a few tickets on
hyperopt-sklearn.

You're right, that 80-20 split is a hack. It's not a bad default, but it
should certainly be possible to use specific train and validation sets.
(https://github.com/hyperopt/hyperopt-sklearn/issues/14)

It should also be possible to do a K-fold loop and abort after the first
few splits yield unpromising scores. There is previous work on how to do
this "right" so it's mainly a matter of putting that logic in place.
(https://github.com/hyperopt/hyperopt-sklearn/issues/15)

Re: the time-out, you're right. Sometimes fit() calls take a really long
time, out of all proportion to the quality of the resulting model. The
timeout keeps the search going, which I've found to deliver better
performance than waiting for super-slow configurations. That said, it would
be even better to detect when an estimator has an iterative fitting
procedure so that a half-baked estimator can be tested instead of aborted
when the timeout triggers.
(https://github.com/hyperopt/hyperopt-sklearn/issues/16)

- James

Joel Nothman
2014-01-31 03:07:20 UTC
Permalink
With a grid search, we can run all jobs in parallel. But I have the
impression that those algo remove that possibility. ...

You can still run all folds in, say 10-fold cross-validation in parallel.
But the most interresting question, if we start many jobs in parallel,
if the jobs don't finish at the same time as this happen frequently,
can we sample new test point while maximizing the "coverage" with the
currently running jobs that don't have results?

Not using joblib.parallel (currently).


I had imagined that more nuanced hyperparameter optimisation would
essentially consist of evaluating a sequence of sets of hyperparameters,
i.e. a sequence of grid searches (although not necessarily a strict grid).
In the trivial case, where you evaluate a sequence of single parameter
settings, it is fairly easy to create a wrapper so that scikit-learn's grid
search facility can be the objective function in something like hyperopt or
even scipy.optimize.

And no, the parallelisation wouldn't be optimal, but the current grid
search doesn't do anything like see that a particular parameter has given
really poor results on 5 folds of a 10-fold cross validation, and conclude
that isn't worth running the other five. This could be implemented
privately, either by wrapping the estimator to contact a manager, or by
reinventing the search. But it's likely all too complex and custom to
include in the standard scikit-learn package.

- Joel
I have a question on those type of algo for hyper parameter
optimization. With a grid search, we can run all jobs in parallel. But
I have the impression that those algo remove that possibility. Is
there there way to sample many starting configuration with those algo?
But the most interresting question, if we start many jobs in parallel,
if the jobs don't finish at the same time as this happen frequently,
can we sample new test point while maximizing the "coverage" with the
currently running jobs that don't have results?
Fred
On Thu, Jan 30, 2014 at 5:28 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by James Jensen
Bayesian optimization is an efficient method used especially for
functions that are expensive to evaluate. The basic idea is to fit the
function using Gaussian processes, using a surrogate function that
determines where to evaluate next in each iteration. The surrogate
strikes a balance between exploration (sampling intervals you haven't
tried before) and exploitation (if previous samples in a vicinity scored
well, then the likelihood of getting a high score in that area is high).
Some of the math behind it is beyond me, but the general idea is very
intuitive. Brochu, Cora, and de Freitas (2010) "A Tutorial on Bayesian
Optimization of Expensive Cost Functions," is a good introduction.
One useful application of Bayesian optimization is hyperparameter
tuning.
Thanks a lot for your enthousiasme and suggestion.
Indeed, many of the core developpers would love to see simple Bayesian
optimization used for hyperparameter optimization, for instance taking
the gist of hyperopt https://github.com/hyperopt/hyperopt and making an
extended version of the RandomSearchCV.
However there are a number of technical roadblocks to get there. In
particular the Gaussian process could be improved (to implement
partial_fit for online learning), and the parallel computing engine
(joblib) does not support well as producer/consumer pattern. None of
these problems are showstoppers, but they reduce the usefulness of a
hyper-parameter selection object using Bayesian optimization.
I would hope that we find time to implement these difficult core aspects
and eventually get to implementing a more advanced hyper-parameter
optimizer. But all the core developers are very busy and spending a lot
of time simply maintaining the library (have a look at the number of
issues open or pull requests that are waiting to be reviewed to have an
idea).
If you want to help -beyond helping with reviewing/finishing pull
requests and closing issues, I suggest that first, to prototype code, you
could first submit an example using the Gaussian processes to do
optimization of a noisy function. In a second step, after having that
example merged, we could think about how to build a BayesianSearchCV
object.
Cheers,
Gaël
------------------------------------------------------------------------------
Post by Gael Varoquaux
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
Post by Gael Varoquaux
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable
security intelligence. It gives you real-time visual feedback on key
security issues and trends. Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2014-01-31 08:38:52 UTC
Permalink
Post by Frédéric Bastien
I have a question on those type of algo for hyper parameter
optimization. With a grid search, we can run all jobs in parallel. But
I have the impression that those algo remove that possibility. Is
there there way to sample many starting configuration with those algo?
As others have answered, in theory it is possible. You need a
producer/consumer pattern in which you asynchronously spawn jobs that fit
and test a model, and when you retrieve the results, you update the
Bayesian optimizer which gives you another set of test points to try.

The parallel computing pattern is much more involved than those that
joblib supports. We want to evolve joblib to be more flexible, but we
want to do this while keeping its robustness and its simplicity. Thus
there is a lot of work on this side.

Hyperopt implements all these patterns, and more, but with fairly
involved code and more dependencies.

Gaël
Continue reading on narkive:
Loading...