[Scikit-learn-general] Project/PR Idea for Faster Automated Model Search

Discussion:

Pedro Rodriguez

2016-01-27 17:01:02 UTC

Hi,

I am considering working on a project which would result in a PR to
scikit-learn, but would like to check that something like it doesn't
already exist or is in progress (in our out of SKLearn).

Goal: Implement the algorithm (TuPAQ) described here:
http://web.cs.ucla.edu/~ameet/tupaq_socc.pdf to make something similar to
GridSearchCV

Result: Potentially much faster training time over the parameter/model
space than GridSearchCV

Description of Algorithm:
1. Train all models by some number of iterations to kick start
2. Drop out all models that are not within some margin of the best model
3. Repeat steps 1 and 2 based on some heuristic
4. Return best model

Existing Code:
Didn't find anything in SKLearn like this, closest thing I found was this:
https://github.com/hyperopt/hyperopt-sklearn but it doesn't include some of
the other methods used in the paper (like early model termination)

Thanks!

--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

***@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Andreas Mueller

2016-01-27 19:34:32 UTC

Permalink

Hi.
Also check out this:
https://github.com/scikit-learn/scikit-learn/pull/5491

auto-sklearn (which uses meta-learning) might also be of interest to you.

From your description TuPAQ seems to assume that there is some notion
of iterations.
That is true only for some models. It might be easier to run models on
subsets of the data.
That's actually something data robot does to screen models faster.

I don't think Tupaq is ready for inclusion in scikit-learn (way too
fresh, 2 citations?).
But if you want to create a scikit-learn compatible implementation,
please go ahead, that would be great to have for reference.

cheers,
Andy

Post by Pedro Rodriguez
Hi,
I am considering working on a project which would result in a PR to
scikit-learn, but would like to check that something like it doesn't
already exist or is in progress (in our out of SKLearn).
http://web.cs.ucla.edu/~ameet/tupaq_socc.pdf
<http://web.cs.ucla.edu/%7Eameet/tupaq_socc.pdf> to make something
similar to GridSearchCV
Result: Potentially much faster training time over the parameter/model
space than GridSearchCV
1. Train all models by some number of iterations to kick start
2. Drop out all models that are not within some margin of the best model
3. Repeat steps 1 and 2 based on some heuristic
4. Return best model
Didn't find anything in SKLearn like this, closest thing I found was
this: https://github.com/hyperopt/hyperopt-sklearn but it doesn't
include some of the other methods used in the paper (like early model
termination)
Thanks!
--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni
pedrorodriguez.io <http://pedrorodriguez.io/> | 909-353-4423
<tel:909-353-4423>
https://www.linkedin.com/in/pedrorodriguezscience
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Pedro Rodriguez

2016-01-27 20:15:57 UTC

Permalink

Thanks for response Andy,

The main thing I wanted to get out of asking was:
1. Is this a reasonable thing to try?
2. Has it been done before?

I would want to make it scikit-learn compatible, but having it be a PR
isn't my main goal (only a possible plus). Looks like this might be
interesting to do and comparing against auto-sklearn would give me an idea
of how well it does against a well thought out/tuned take on automated
model search.

The iterations assumption is fair, it also assumes that an iteration across
algorithms is equivalent which isn't true. That would be something else I
would be interested in looking at (use time budget instead)

Pedro

Post by Andreas Mueller
Hi.
https://github.com/scikit-learn/scikit-learn/pull/5491
auto-sklearn (which uses meta-learning) might also be of interest to you.
From your description TuPAQ seems to assume that there is some notion of
iterations.
That is true only for some models. It might be easier to run models on
subsets of the data.
That's actually something data robot does to screen models faster.
I don't think Tupaq is ready for inclusion in scikit-learn (way too fresh,
2 citations?).
But if you want to create a scikit-learn compatible implementation, please
go ahead, that would be great to have for reference.
cheers,
Andy
Hi,
I am considering working on a project which would result in a PR to
scikit-learn, but would like to check that something like it doesn't
already exist or is in progress (in our out of SKLearn).
http://web.cs.ucla.edu/~ameet/tupaq_socc.pdf to make something similar to
GridSearchCV
Result: Potentially much faster training time over the parameter/model
space than GridSearchCV
1. Train all models by some number of iterations to kick start
2. Drop out all models that are not within some margin of the best model
3. Repeat steps 1 and 2 based on some heuristic
4. Return best model
<https://github.com/hyperopt/hyperopt-sklearn>
https://github.com/hyperopt/hyperopt-sklearn but it doesn't include some
of the other methods used in the paper (like early model termination)
Thanks!
--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni
https://www.linkedin.com/in/pedrorodriguezscience
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Andreas Mueller

2016-01-27 23:00:06 UTC

Permalink

Post by Pedro Rodriguez
Thanks for response Andy,
1. Is this a reasonable thing to try?

Yes.

Post by Pedro Rodriguez
2. Has it been done before?

Not for TuPAQ afaik.

Post by Pedro Rodriguez
I would want to make it scikit-learn compatible, but having it be a PR
isn't my main goal (only a possible plus). Looks like this might be
interesting to do and comparing against auto-sklearn would give me an
idea of how well it does against a well thought out/tuned take on
automated model search.
The iterations assumption is fair, it also assumes that an iteration
across algorithms is equivalent which isn't true. That would be
something else I would be interested in looking at (use time budget
instead)

Not all algorithms even have a concept of iteration. What is an
iteration for a Random Forest? What is an iteration for PCA? What is an
iteration for an SVM?
Sometimes, even if you could stop early, the result might be nonsensical.