Discussion:
[Scikit-learn-general] Sampling in grid_search randomized_grid_search
Sebastian Raschka
2016-02-19 16:36:29 UTC
Permalink
Hi, Stelios,
I am wondering, how did you implement this tweak? Just a thought, but instead of adding extra functionality inside the GridSearch class, what about using a random training data selector (transformer) as a pipeline object? Something along the lines of

class RandomRowSelector(object):
def __init__(self):
pass

def _some_random_sampling_function(self, X, y)

def transform(self, X, y):
sampled_rows = self.some_random_sampling_function(self, X, y)
return X[sampled_rows, :], y[sampled_rows, :]

def fit(self, X, y=None):
return self

Best,
Sebastian
Hi everyone,
I was thinking to implement a tweak where it is possible to sample randomly from a dataset when using grid search. This would particularly useful for big datasets. The sampling takes place during each round of grid search.
Does anyone think this would be worthy submitting to scikit-learn?
Best regards,
Stelios
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2016-02-19 16:42:41 UTC
Permalink
That won't work, as it is modifying the number of samples, which breaks
the scikit-learn pipeline.

Please add this usecase in the PR on the scikit-learn enhancement
proposal that discusses a possible modification to scikit-learn:
https://github.com/scikit-learn/enhancement_proposals/pull/2

Cheers,

Gaël
Post by Sebastian Raschka
Hi, Stelios,
I am wondering, how did you implement this tweak? Just a thought, but instead of adding extra functionality inside the GridSearch class, what about using a random training data selector (transformer) as a pipeline object? Something along the lines of
pass
def _some_random_sampling_function(self, X, y)
sampled_rows = self.some_random_sampling_function(self, X, y)
return X[sampled_rows, :], y[sampled_rows, :]
return self
Best,
Sebastian
Hi everyone,
I was thinking to implement a tweak where it is possible to sample randomly from a dataset when using grid search. This would particularly useful for big datasets. The sampling takes place during each round of grid search.
Does anyone think this would be worthy submitting to scikit-learn?
Best regards,
Stelios
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Gael Varoquaux
Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
Andreas Mueller
2016-02-22 16:51:37 UTC
Permalink
You can just do this via a CV object. For example, use
StratifiedShuffleSplit(train_set=.1, test_set=.1, n_folds=5)
and your training and test set will be randomly samples disjoint 10% of
the data, repeated 5 times.
Post by Gael Varoquaux
That won't work, as it is modifying the number of samples, which breaks
the scikit-learn pipeline.
Please add this usecase in the PR on the scikit-learn enhancement
https://github.com/scikit-learn/enhancement_proposals/pull/2
Cheers,
Gaël
Post by Sebastian Raschka
Hi, Stelios,
I am wondering, how did you implement this tweak? Just a thought, but instead of adding extra functionality inside the GridSearch class, what about using a random training data selector (transformer) as a pipeline object? Something along the lines of
pass
def _some_random_sampling_function(self, X, y)
sampled_rows = self.some_random_sampling_function(self, X, y)
return X[sampled_rows, :], y[sampled_rows, :]
return self
Best,
Sebastian
Hi everyone,
I was thinking to implement a tweak where it is possible to sample randomly from a dataset when using grid search. This would particularly useful for big datasets. The sampling takes place during each round of grid search.
Does anyone think this would be worthy submitting to scikit-learn?
Best regards,
Stelios
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Stylianos Kampakis
2016-02-22 20:17:37 UTC
Permalink
Ah, thanks! Much better solution to what I did :)

Regards,
Stelios
Post by Andreas Mueller
You can just do this via a CV object. For example, use
StratifiedShuffleSplit(train_set=.1, test_set=.1, n_folds=5)
and your training and test set will be randomly samples disjoint 10% of
the data, repeated 5 times.
Post by Gael Varoquaux
That won't work, as it is modifying the number of samples, which breaks
the scikit-learn pipeline.
Please add this usecase in the PR on the scikit-learn enhancement
https://github.com/scikit-learn/enhancement_proposals/pull/2
Cheers,
Gaël
Post by Sebastian Raschka
Hi, Stelios,
I am wondering, how did you implement this tweak? Just a thought, but
instead of adding extra functionality inside the GridSearch class, what
about using a random training data selector (transformer) as a pipeline
object? Something along the lines of
Post by Gael Varoquaux
Post by Sebastian Raschka
pass
def _some_random_sampling_function(self, X, y)
sampled_rows = self.some_random_sampling_function(self, X, y)
return X[sampled_rows, :], y[sampled_rows, :]
return self
Best,
Sebastian
On Feb 19, 2016, at 7:56 AM, Stylianos Kampakis <
Hi everyone,
I was thinking to implement a tweak where it is possible to sample
randomly from a dataset when using grid search. This would particularly
useful for big datasets. The sampling takes place during each round of grid
search.
Post by Gael Varoquaux
Post by Sebastian Raschka
Does anyone think this would be worthy submitting to scikit-learn?
Best regards,
Stelios
------------------------------------------------------------------------------
Post by Gael Varoquaux
Post by Sebastian Raschka
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140_______________________________________________
Post by Gael Varoquaux
Post by Sebastian Raschka
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gael Varoquaux
Post by Sebastian Raschka
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...