Discussion:
How to present parameter search results
(too old to reply)
Joel Nothman
2013-06-02 00:56:09 UTC
Permalink
TL;DR: a list of `namedtuple`s is a poor solution for parameter search
results; here I suggest better alternatives.

I would like to draw some attention to #1787 which proposes that structured
arrays be used to return parameter search (e.g. GridSearchCV) results. A
few proposals have sought additional parameter search outputs (e.g.
training scores and times in #1742; multiple test metrics, such as P and R
where F1 is the objective, or per-class performance); structured arrays may
not be the right answer, but some solution should be selected.

In scikit-learn 0.13, results are a list of triples (parameters, mean
score, fold scores). Using tuples, or `namedtuple`s as in the current dev
version, is a particularly poor solution:

* it is not extensible if people will expect it to have fixed length, and
changes in namedtuple length break unpickling.
* it doesn't look like the output of other estimators, afaik.
* it is not especially convenient to access.

We need a format that can support more fields. As far as I can see this
means one of:

1. a sequence of dicts
2. a sequence of namespaces (like `namedtuple`s but not iterable)
3. a dict of arrays
4. many attributes, each an array, on the estimator
5. many attributes, each an array, on a custom results object
6. a structured array / recarray

All of these require the fields to be named (something not discussed enough
at #1787). Except for (2), (4) and (5) where descriptors can be used to
deprecate names and transform values, all those names and their values must
remain fixed across versions. I think (4) is most compatible with
scikit-learn's use of attributes and parameters (coindexed arrays are
common).

Structured arrays are good because they can be accessed in all dimensions
(search candidates, folds where relevant, and fields). They are bad because
they are not familiar to scikit-learn users and can be quirky to work with
(particularly if some fields have `dtype=object`).

It seems the common use-case for this data is to select one or more
candidates by their parameter values, and then to explore a few fields,
such as scores or times, their means or standard deviations. Structured
arrays (6) make this easy in some cases because slicing by index and the
zipped iteration over selected fields are included out of the box (but
`zip` is still needed to mix per-candidate and per-fold data).

It could be possible to enable this sort of functionality with (4) or (5)
-- slicing over the search candidates; iterating fields in parallel;
aggregating over folds -- but this increases API complexity and reinvents
the wheel (*).

So, please: consider the alternatives (**); and please don't lock in a list
of `namedtuple`s.

- Joel
(*) Essentially we're replicating `pandas.DataFrame` except that our
per-fold data is 2d and so doesn't fit into their `Series`. I guess having
a format that can be easily imported into a `DataFrame` (3, 6) has
advantages. See also #1034.
(**) My preferences are (4) for its simplicity, familiarity and
flexibility; and (6) because it can be easily transformed and uses an
appropriate, existing numpy data structure.
Olivier Grisel
2013-06-07 10:02:21 UTC
Permalink
TL;DNR: parameter search results datastructure choice should
anticipate new use-cases

Thanks Joel for the detailed analysis.

I the current situation I think I my-self I like:

5. many attributes, each an array, on a custom results object

This makes it possible to write a `__repr__` method on that object
that could write a statistical summary of the top 10 or so candidate
parameterizations.

I thinks we should keep `best_param_`, `best_estimator_` and
`best_score_` as quick access convenience accessors even if they are
redundant with the detailed content of the search results.

However to move the discussion forward on the model evaluation results
there are three additional use-cases currently not addressed by the
current design but that I would like to be have addressed somehow at
some point in the future:

A- Fault tolerance and handling missing results caused by evaluation errors

How to handle partial results? Sometimes some combinations of the
parameters will trigger runtime errors, for instance if the evaluation
raises an exception if the estimator fails to convergence
(ill-conditioning) or numeric overflow / underflow (apparently this
can happen in our SGD cython code and raises a ValueError,
to be debugged) or memory error...

I think the whole search should not crash if one evaluation fails
after 3 hours of computation and many successful evaluations. The
error should be collected and the evaluation iteration should be
excluded from the final results statistics.

B- Being able to monitor partial results and interrupt search before
waiting for the end (e.g. by handling KeyBoardInterrupt using an async
job scheduling API)

Also, even if the current joblib API does not allow for that, I think
it would be very useful to make it possible at some point to allow the
user to monitor the current progress in the search and allow him to
interrupt it without loosing access to the evaluation results
collected up to that point.

C- Being able to warm-start a search with previously collected results

C1: Refining the search space: Submit a new grid or parameter sampler
that focus the search at a finer scale around an interesting area in
existing dimensions and optionally trim dimensions that are deemed
useless by the user according to the past results.

C2: Refining the cross-validation: the user might want to perform a
first search with very low number of CV (e.g. 1 or 2 iterations of
shuffle split) to have a coarse overview of the interesting part of
the search space, then trim the parameter grid to a smaller yet
promising grid and then add more CV iterations only for those
parameters so as to be able get finer estimates of the mean validation
scores by reducing the standard error of the mean across random CV
folds.

Note: C2 is only useful for the (Stratified)ShuffleSplit cross
validation where you can grow n_iter or change random_state to get as
many CV split as you want provided the dataset is large enough.

In order to be able to address A, B and C in the future, I think the
estimator object should adopt a simple primary datastructure that is a
growable list of individual (parameter, CV-fold)-scoped evaluations
and then provide the user with methods to simply introspect the, such
as: find the top 10 parameters by average validation scores across
currently available CV fold (some CV fold could be missing due to some
partial evaluation caused by A (failures) or B (interrupted
computation)).

Each item in this list could have:

- parameters_id: unique parameter set integer identifier (e.g. a deep
hash or random index)
- parameters: the parameter settings dict
- cv_id: unique CV object integer identifier (hash of the of the CV
object or random index)
- cv_iter_index: the CV fold iteration integer index
- validation_score_name: the primary validation score (to be used for
ranking models)

Optional attributes we could add in the future:

- training score to be able to estimate under-fitting (if non-zero)
and over-fitting by diffing with the validation score
- more training an validation scores (e.g. precision, recall, AUC...)
- more evaluation metrics that are not scores by useful for model
analysis (e.g. a confusion matrix for classifiaction)
- fitting time
- prediction time (could be complicate to separate out of the complete
scoring time due to our Scorer API that currently hides it).

Then to compute the mean score for a given parameter sets one could
group-by parameters_id (e.g. using a python `defaultdict(list)` with
parameter_id as key).
Advanced users could also convert this log of evaluation as a pandas
dataframe and then do joins / group-by themselves to compute various
aggregate statistics across the dimensions of there choice.

Finally there is an additional use case that I have in mind even if
possibly less a priority than the other:

D: warm starting with larger subsamples of the dataset

Make it possible to start the search on a small sub sample of the
datasets (e.g. 10% of the complete dataset) , then with a larger
subset (e.g. with 20% of the dataset) to be able to identify the most
promising parameterization quickly and evaluate how sensitive they are
sensitive to a doubling of the dataset size. That would make it
possible to select a smaller grid for a parameter search on the full
dataset and also being able to compute learning curves for
bias-variance analysis of the individual parameters.

--
Olivier
Joel Nothman
2013-06-09 02:38:35 UTC
Permalink
Post by Olivier Grisel
A- Fault tolerance and handling missing results caused by evaluation errors
I don't think this affects the output format, except where we can actually
get partial results for a fold, or if we want to report successful folds
and ignore others for a single candidate parameter setting. But I wonder if
that just makes things much too complicated.
Post by Olivier Grisel
B: Being able to monitor partial results and interrupt search
before waiting for the end (e.g. by handling KeyBoardInterrupt using an
async job scheduling API)

So the stop and resume case just means the results need to be appendable...?

In general, I don't think Parallel's returning a list is of great benefit
here. Working with an iterable would be more comfortable.
Post by Olivier Grisel
C1: Refining the search space
Similarly, it should be possible to have fit append further results.
Post by Olivier Grisel
C2: Refining the cross-validation
and
Post by Olivier Grisel
D: warm starting with larger subsamples of the dataset
I would think in these cases it's better to create a new estimator and/or
keep results separate.
Something you missed: the ability to get back diagnostics on the quality /
complexity of the model, e.g. coefficient sparsity.

These suggestions do make me consider storage in an external database (a
blob store, or an online spreadsheet) as hyperopt allows. I think "allows"
is important here: when you get to that scale of experimentation, you
probably don't want results logged only in memory. But we need a sensible
default for working with a few thousand candidates.

Except for purity of parallelism, I don't see why you would want do store
each fold result for a single candidate separately. I don't see the
use-case for providing them separately to the user (except where one fold
failed and another succeeded). As far as I'm concerned, the frontend should
hide that.

I do see that providing all fields together for a single candidate is the
most common use-case and argues against providing parallel arrays (but not
against a structured array / recarray).

Finally, the single most important thing I can see about making results
explorable is not providing candidate parameter settings only as dicts, but
splitting the dicts out so that you can query by the value of each
parameter, and group over others.

This may be getting into crazy land, and certainly close to reimplementing
Pandas for the 2d case, or recarrays with benefits, but: imagine we had a
SearchResult object with:
* attributes like fold_test_score, fold_train_score, fold_train_time, each
a 2d array.
* __getattr__ magic that produced mean_test_score, mean_train_time, etc.
and std_test_score, std_train_time on demand (weighted by some
samples_per_fold attr if iid=True).
* attributes like param_C that would enable selecting certain candidates by
their parameter settings (through numpy-style boolean queries).
* __getitem__ that can pull out one or more candidates by index (and
returns a SearchResult).
* a method that return a dict of selected 1d array attributes for
Pandas-style (or spreadsheet? in that case a list of dicts) integration
* a method that zips over selected attributes for simple iteration.

Is this crazy, or does it do exactly what we want? or both? And how does it
not meet the needs of your wishlist, Olivier (except where the number of
folds differ)?

- Joel
Post by Olivier Grisel
TL;DNR: parameter search results datastructure choice should
anticipate new use-cases
Thanks Joel for the detailed analysis.
5. many attributes, each an array, on a custom results object
This makes it possible to write a `__repr__` method on that object
that could write a statistical summary of the top 10 or so candidate
parameterizations.
I thinks we should keep `best_param_`, `best_estimator_` and
`best_score_` as quick access convenience accessors even if they are
redundant with the detailed content of the search results.
However to move the discussion forward on the model evaluation results
there are three additional use-cases currently not addressed by the
current design but that I would like to be have addressed somehow at
A- Fault tolerance and handling missing results caused by evaluation errors
How to handle partial results? Sometimes some combinations of the
parameters will trigger runtime errors, for instance if the evaluation
raises an exception if the estimator fails to convergence
(ill-conditioning) or numeric overflow / underflow (apparently this
can happen in our SGD cython code and raises a ValueError,
to be debugged) or memory error...
I think the whole search should not crash if one evaluation fails
after 3 hours of computation and many successful evaluations. The
error should be collected and the evaluation iteration should be
excluded from the final results statistics.
B- Being able to monitor partial results and interrupt search before
waiting for the end (e.g. by handling KeyBoardInterrupt using an async
job scheduling API)
Also, even if the current joblib API does not allow for that, I think
it would be very useful to make it possible at some point to allow the
user to monitor the current progress in the search and allow him to
interrupt it without loosing access to the evaluation results
collected up to that point.
C- Being able to warm-start a search with previously collected results
C1: Refining the search space: Submit a new grid or parameter sampler
that focus the search at a finer scale around an interesting area in
existing dimensions and optionally trim dimensions that are deemed
useless by the user according to the past results.
C2: Refining the cross-validation: the user might want to perform a
first search with very low number of CV (e.g. 1 or 2 iterations of
shuffle split) to have a coarse overview of the interesting part of
the search space, then trim the parameter grid to a smaller yet
promising grid and then add more CV iterations only for those
parameters so as to be able get finer estimates of the mean validation
scores by reducing the standard error of the mean across random CV
folds.
Note: C2 is only useful for the (Stratified)ShuffleSplit cross
validation where you can grow n_iter or change random_state to get as
many CV split as you want provided the dataset is large enough.
In order to be able to address A, B and C in the future, I think the
estimator object should adopt a simple primary datastructure that is a
growable list of individual (parameter, CV-fold)-scoped evaluations
and then provide the user with methods to simply introspect the, such
as: find the top 10 parameters by average validation scores across
currently available CV fold (some CV fold could be missing due to some
partial evaluation caused by A (failures) or B (interrupted
computation)).
- parameters_id: unique parameter set integer identifier (e.g. a deep
hash or random index)
- parameters: the parameter settings dict
- cv_id: unique CV object integer identifier (hash of the of the CV
object or random index)
- cv_iter_index: the CV fold iteration integer index
- validation_score_name: the primary validation score (to be used for
ranking models)
- training score to be able to estimate under-fitting (if non-zero)
and over-fitting by diffing with the validation score
- more training an validation scores (e.g. precision, recall, AUC...)
- more evaluation metrics that are not scores by useful for model
analysis (e.g. a confusion matrix for classifiaction)
- fitting time
- prediction time (could be complicate to separate out of the complete
scoring time due to our Scorer API that currently hides it).
Then to compute the mean score for a given parameter sets one could
group-by parameters_id (e.g. using a python `defaultdict(list)` with
parameter_id as key).
Advanced users could also convert this log of evaluation as a pandas
dataframe and then do joins / group-by themselves to compute various
aggregate statistics across the dimensions of there choice.
Finally there is an additional use case that I have in mind even if
D: warm starting with larger subsamples of the dataset
Make it possible to start the search on a small sub sample of the
datasets (e.g. 10% of the complete dataset) , then with a larger
subset (e.g. with 20% of the dataset) to be able to identify the most
promising parameterization quickly and evaluate how sensitive they are
sensitive to a doubling of the dataset size. That would make it
possible to select a smaller grid for a parameter search on the full
dataset and also being able to compute learning curves for
bias-variance analysis of the individual parameters.
--
Olivier
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2013-06-09 02:46:27 UTC
Permalink
On Sun, Jun 9, 2013 at 12:38 PM, Joel Nothman
Post by Joel Nothman
This may be getting into crazy land, and certainly close to reimplementing
Pandas for the 2d case, or recarrays with benefits, but: imagine we had a
* attributes like fold_test_score, fold_train_score, fold_train_time, each
a 2d array.
* __getattr__ magic that produced mean_test_score, mean_train_time, etc.
and std_test_score, std_train_time on demand (weighted by some
samples_per_fold attr if iid=True).
* attributes like param_C that would enable selecting certain candidates
by their parameter settings (through numpy-style boolean queries).
* __getitem__ that can pull out one or more candidates by index (and
returns a SearchResult).
* a method that return a dict of selected 1d array attributes for
Pandas-style (or spreadsheet? in that case a list of dicts) integration
* a method that zips over selected attributes for simple iteration.
And sure, a method that performs
self[np.argsort(self.mean_test_score)[-k:]] to get the k best results...
Post by Joel Nothman
- Joel
Post by Olivier Grisel
TL;DNR: parameter search results datastructure choice should
anticipate new use-cases
Thanks Joel for the detailed analysis.
5. many attributes, each an array, on a custom results object
This makes it possible to write a `__repr__` method on that object
that could write a statistical summary of the top 10 or so candidate
parameterizations.
I thinks we should keep `best_param_`, `best_estimator_` and
`best_score_` as quick access convenience accessors even if they are
redundant with the detailed content of the search results.
However to move the discussion forward on the model evaluation results
there are three additional use-cases currently not addressed by the
current design but that I would like to be have addressed somehow at
A- Fault tolerance and handling missing results caused by evaluation errors
How to handle partial results? Sometimes some combinations of the
parameters will trigger runtime errors, for instance if the evaluation
raises an exception if the estimator fails to convergence
(ill-conditioning) or numeric overflow / underflow (apparently this
can happen in our SGD cython code and raises a ValueError,
to be debugged) or memory error...
I think the whole search should not crash if one evaluation fails
after 3 hours of computation and many successful evaluations. The
error should be collected and the evaluation iteration should be
excluded from the final results statistics.
B- Being able to monitor partial results and interrupt search before
waiting for the end (e.g. by handling KeyBoardInterrupt using an async
job scheduling API)
Also, even if the current joblib API does not allow for that, I think
it would be very useful to make it possible at some point to allow the
user to monitor the current progress in the search and allow him to
interrupt it without loosing access to the evaluation results
collected up to that point.
C- Being able to warm-start a search with previously collected results
C1: Refining the search space: Submit a new grid or parameter sampler
that focus the search at a finer scale around an interesting area in
existing dimensions and optionally trim dimensions that are deemed
useless by the user according to the past results.
C2: Refining the cross-validation: the user might want to perform a
first search with very low number of CV (e.g. 1 or 2 iterations of
shuffle split) to have a coarse overview of the interesting part of
the search space, then trim the parameter grid to a smaller yet
promising grid and then add more CV iterations only for those
parameters so as to be able get finer estimates of the mean validation
scores by reducing the standard error of the mean across random CV
folds.
Note: C2 is only useful for the (Stratified)ShuffleSplit cross
validation where you can grow n_iter or change random_state to get as
many CV split as you want provided the dataset is large enough.
In order to be able to address A, B and C in the future, I think the
estimator object should adopt a simple primary datastructure that is a
growable list of individual (parameter, CV-fold)-scoped evaluations
and then provide the user with methods to simply introspect the, such
as: find the top 10 parameters by average validation scores across
currently available CV fold (some CV fold could be missing due to some
partial evaluation caused by A (failures) or B (interrupted
computation)).
- parameters_id: unique parameter set integer identifier (e.g. a deep
hash or random index)
- parameters: the parameter settings dict
- cv_id: unique CV object integer identifier (hash of the of the CV
object or random index)
- cv_iter_index: the CV fold iteration integer index
- validation_score_name: the primary validation score (to be used for
ranking models)
- training score to be able to estimate under-fitting (if non-zero)
and over-fitting by diffing with the validation score
- more training an validation scores (e.g. precision, recall, AUC...)
- more evaluation metrics that are not scores by useful for model
analysis (e.g. a confusion matrix for classifiaction)
- fitting time
- prediction time (could be complicate to separate out of the complete
scoring time due to our Scorer API that currently hides it).
Then to compute the mean score for a given parameter sets one could
group-by parameters_id (e.g. using a python `defaultdict(list)` with
parameter_id as key).
Advanced users could also convert this log of evaluation as a pandas
dataframe and then do joins / group-by themselves to compute various
aggregate statistics across the dimensions of there choice.
Finally there is an additional use case that I have in mind even if
D: warm starting with larger subsamples of the dataset
Make it possible to start the search on a small sub sample of the
datasets (e.g. 10% of the complete dataset) , then with a larger
subset (e.g. with 20% of the dataset) to be able to identify the most
promising parameterization quickly and evaluate how sensitive they are
sensitive to a doubling of the dataset size. That would make it
possible to select a smaller grid for a parameter search on the full
dataset and also being able to compute learning curves for
bias-variance analysis of the individual parameters.
--
Olivier
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2013-06-09 09:40:44 UTC
Permalink
Again, it's probably over the top, but I think it's a useful interface
(prototyped at https://github.com/jnothman/scikit-learn/tree/search_results
Post by Joel Nothman
from __future__ import print_function
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.svm import SVC
iris = load_iris()
grid = {'C': [0.01, 0.1, 1], 'degree': [1, 2, 3]}
search = GridSearchCV(SVC(kernel='poly'),
param_grid=grid).fit(iris.data, iris.target)
Post by Joel Nothman
res = search.results_
res.best().mean_test_score
0.97333333333333338
Post by Joel Nothman
res
<9 candidates. Best results:
<0.973 for {'C': 0.10000000000000001, 'degree': 3}>,
<0.967 for {'C': 1.0, 'degree': 3}>,
<0.967 for {'C': 1.0, 'degree': 2}>, ...>
Post by Joel Nothman
for tup in res.zipped('parameters', 'mean_test_score',
'std_test_score'):
... print(*tup)
...
{'C': 0.01, 'degree': 1} 0.673333333333 0.033993463424
{'C': 0.01, 'degree': 2} 0.926666666667 0.00942809041582
{'C': 0.01, 'degree': 3} 0.966666666667 0.0188561808316
{'C': 0.10000000000000001, 'degree': 1} 0.94 0.0163299316186
{'C': 0.10000000000000001, 'degree': 2} 0.966666666667 0.0188561808316
{'C': 0.10000000000000001, 'degree': 3} 0.973333333333 0.00942809041582
{'C': 1.0, 'degree': 1} 0.966666666667 0.0249443825785
{'C': 1.0, 'degree': 2} 0.966666666667 0.00942809041582
{'C': 1.0, 'degree': 3} 0.966666666667 0.0188561808316
On Sun, Jun 9, 2013 at 12:46 PM, Joel Nothman
On Sun, Jun 9, 2013 at 12:38 PM, Joel Nothman <
Post by Joel Nothman
This may be getting into crazy land, and certainly close to
* attributes like fold_test_score, fold_train_score, fold_train_time,
each a 2d array.
* __getattr__ magic that produced mean_test_score, mean_train_time, etc.
and std_test_score, std_train_time on demand (weighted by some
samples_per_fold attr if iid=True).
* attributes like param_C that would enable selecting certain candidates
by their parameter settings (through numpy-style boolean queries).
* __getitem__ that can pull out one or more candidates by index (and
returns a SearchResult).
* a method that return a dict of selected 1d array attributes for
Pandas-style (or spreadsheet? in that case a list of dicts) integration
* a method that zips over selected attributes for simple iteration.
And sure, a method that performs
self[np.argsort(self.mean_test_score)[-k:]] to get the k best results...
Post by Joel Nothman
- Joel
TL;DNR: parameter search results datastructure choice should
anticipate new use-cases
Thanks Joel for the detailed analysis.
5. many attributes, each an array, on a custom results object
This makes it possible to write a `__repr__` method on that object
that could write a statistical summary of the top 10 or so candidate
parameterizations.
I thinks we should keep `best_param_`, `best_estimator_` and
`best_score_` as quick access convenience accessors even if they are
redundant with the detailed content of the search results.
However to move the discussion forward on the model evaluation results
there are three additional use-cases currently not addressed by the
current design but that I would like to be have addressed somehow at
A- Fault tolerance and handling missing results caused by evaluation errors
How to handle partial results? Sometimes some combinations of the
parameters will trigger runtime errors, for instance if the evaluation
raises an exception if the estimator fails to convergence
(ill-conditioning) or numeric overflow / underflow (apparently this
can happen in our SGD cython code and raises a ValueError,
to be debugged) or memory error...
I think the whole search should not crash if one evaluation fails
after 3 hours of computation and many successful evaluations. The
error should be collected and the evaluation iteration should be
excluded from the final results statistics.
B- Being able to monitor partial results and interrupt search before
waiting for the end (e.g. by handling KeyBoardInterrupt using an async
job scheduling API)
Also, even if the current joblib API does not allow for that, I think
it would be very useful to make it possible at some point to allow the
user to monitor the current progress in the search and allow him to
interrupt it without loosing access to the evaluation results
collected up to that point.
C- Being able to warm-start a search with previously collected results
C1: Refining the search space: Submit a new grid or parameter sampler
that focus the search at a finer scale around an interesting area in
existing dimensions and optionally trim dimensions that are deemed
useless by the user according to the past results.
C2: Refining the cross-validation: the user might want to perform a
first search with very low number of CV (e.g. 1 or 2 iterations of
shuffle split) to have a coarse overview of the interesting part of
the search space, then trim the parameter grid to a smaller yet
promising grid and then add more CV iterations only for those
parameters so as to be able get finer estimates of the mean validation
scores by reducing the standard error of the mean across random CV
folds.
Note: C2 is only useful for the (Stratified)ShuffleSplit cross
validation where you can grow n_iter or change random_state to get as
many CV split as you want provided the dataset is large enough.
In order to be able to address A, B and C in the future, I think the
estimator object should adopt a simple primary datastructure that is a
growable list of individual (parameter, CV-fold)-scoped evaluations
and then provide the user with methods to simply introspect the, such
as: find the top 10 parameters by average validation scores across
currently available CV fold (some CV fold could be missing due to some
partial evaluation caused by A (failures) or B (interrupted
computation)).
- parameters_id: unique parameter set integer identifier (e.g. a deep
hash or random index)
- parameters: the parameter settings dict
- cv_id: unique CV object integer identifier (hash of the of the CV
object or random index)
- cv_iter_index: the CV fold iteration integer index
- validation_score_name: the primary validation score (to be used for
ranking models)
- training score to be able to estimate under-fitting (if non-zero)
and over-fitting by diffing with the validation score
- more training an validation scores (e.g. precision, recall, AUC...)
- more evaluation metrics that are not scores by useful for model
analysis (e.g. a confusion matrix for classifiaction)
- fitting time
- prediction time (could be complicate to separate out of the complete
scoring time due to our Scorer API that currently hides it).
Then to compute the mean score for a given parameter sets one could
group-by parameters_id (e.g. using a python `defaultdict(list)` with
parameter_id as key).
Advanced users could also convert this log of evaluation as a pandas
dataframe and then do joins / group-by themselves to compute various
aggregate statistics across the dimensions of there choice.
Finally there is an additional use case that I have in mind even if
D: warm starting with larger subsamples of the dataset
Make it possible to start the search on a small sub sample of the
datasets (e.g. 10% of the complete dataset) , then with a larger
subset (e.g. with 20% of the dataset) to be able to identify the most
promising parameterization quickly and evaluate how sensitive they are
sensitive to a doubling of the dataset size. That would make it
possible to select a smaller grid for a parameter search on the full
dataset and also being able to compute learning curves for
bias-variance analysis of the individual parameters.
--
Olivier
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2013-06-09 16:25:13 UTC
Permalink
Post by Joel Nothman
Again, it's probably over the top, but I think it's a useful interface
(prototyped at
from __future__ import print_function
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.svm import SVC
iris = load_iris()
grid = {'C': [0.01, 0.1, 1], 'degree': [1, 2, 3]}
search = GridSearchCV(SVC(kernel='poly'),
param_grid=grid).fit(iris.data, iris.target)
res = search.results_
res.best().mean_test_score
0.97333333333333338
res
<0.973 for {'C': 0.10000000000000001, 'degree': 3}>,
<0.967 for {'C': 1.0, 'degree': 3}>,
<0.967 for {'C': 1.0, 'degree': 2}>, ...>
for tup in res.zipped('parameters', 'mean_test_score',
... print(*tup)
...
{'C': 0.01, 'degree': 1} 0.673333333333 0.033993463424
{'C': 0.01, 'degree': 2} 0.926666666667 0.00942809041582
{'C': 0.01, 'degree': 3} 0.966666666667 0.0188561808316
{'C': 0.10000000000000001, 'degree': 1} 0.94 0.0163299316186
{'C': 0.10000000000000001, 'degree': 2} 0.966666666667 0.0188561808316
{'C': 0.10000000000000001, 'degree': 3} 0.973333333333 0.00942809041582
{'C': 1.0, 'degree': 1} 0.966666666667 0.0249443825785
{'C': 1.0, 'degree': 2} 0.966666666667 0.00942809041582
{'C': 1.0, 'degree': 3} 0.966666666667 0.0188561808316
I very much like that but I still think that we should keep the raw
evaluation log to make it easier to implement future extensions.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Joel Nothman
2013-06-09 23:20:55 UTC
Permalink
Firstly, a note that I've added that example to the doctest on my branch,
with some extensions to show selecting over parameter values and grouping
over named fields (e.g. identifying the 'C' with the best result per
'degree').

I think hyperopt's use of mongodb (an alternatively) sounds a lot like what
you're proposing. The other case we should eventually support is finding
the best result while keeping no log whatsoever. In the meantime, I would
like to give users an interface to access more than just the score for the
full set of results; but yes, it could become merely an option for log
handling / analysis.
Post by Olivier Grisel
- they badly handle missing / partial results or at least there is not
uniform solution as missing data markers would depend on the dtype of
the column, e.g.: NaNs for floats, -1 as a marker for ints, None for
dtype=object? Furthermore missing results are pre-allocated.
mrecarrays handle the masking issues, albeit providing a bit of a clumsy
interface<http://numpy-discussion.10968.n7.nabble.com/mrecarray-indexing-behaviour-td33532.html>.
I currently use such masking for missing parameters in cases like your:

param_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

It gets a bit messy, but selecting by parameter value still works as
expected. And yes, the preallocation is a bit of a problem; this takes up
unnecessary space, but generally not as much unnecessary space as a series
of dicts! (Admittedly array storage of string parameters is a bit wasteful
of memory when stored with dtype=np.string_ rather than dtype=object.)
Post by Olivier Grisel
- they do not naturally handle change in dimension sizes or number of
dimensions:

No, they don't. My current solution does not handle changes in number of
folds / dimensions. It handles the subset of data with two dimensions of
the same size (with possibly-masked parameters and maybe results in the
future too). I think that's still pretty useful in most cases; and it could
perhaps have a different storage backend with the same frontend to handle
the heterogenous size case.

Btw, one thing I haven't implemented on SearchResult is an __array__ method
that returns a mrecarray of all parameters and result means and stds (where
dtypes allow), suitable for import into pandas or export to CSV.
Post by Olivier Grisel
Post by Joel Nothman
Again, it's probably over the top, but I think it's a useful interface
(prototyped at
from __future__ import print_function
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.svm import SVC
iris = load_iris()
grid = {'C': [0.01, 0.1, 1], 'degree': [1, 2, 3]}
search = GridSearchCV(SVC(kernel='poly'),
param_grid=grid).fit(iris.data, iris.target)
res = search.results_
res.best().mean_test_score
0.97333333333333338
res
<0.973 for {'C': 0.10000000000000001, 'degree': 3}>,
<0.967 for {'C': 1.0, 'degree': 3}>,
<0.967 for {'C': 1.0, 'degree': 2}>, ...>
for tup in res.zipped('parameters', 'mean_test_score',
... print(*tup)
...
{'C': 0.01, 'degree': 1} 0.673333333333 0.033993463424
{'C': 0.01, 'degree': 2} 0.926666666667 0.00942809041582
{'C': 0.01, 'degree': 3} 0.966666666667 0.0188561808316
{'C': 0.10000000000000001, 'degree': 1} 0.94 0.0163299316186
{'C': 0.10000000000000001, 'degree': 2} 0.966666666667 0.0188561808316
{'C': 0.10000000000000001, 'degree': 3} 0.973333333333 0.00942809041582
{'C': 1.0, 'degree': 1} 0.966666666667 0.0249443825785
{'C': 1.0, 'degree': 2} 0.966666666667 0.00942809041582
{'C': 1.0, 'degree': 3} 0.966666666667 0.0188561808316
I very much like that but I still think that we should keep the raw
evaluation log to make it easier to implement future extensions.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2013-06-09 16:18:49 UTC
Permalink
Post by Joel Nothman
Post by Olivier Grisel
A- Fault tolerance and handling missing results caused by evaluation errors
I don't think this affects the output format, except where we can actually
get partial results for a fold, or if we want to report successful folds and
ignore others for a single candidate parameter setting. But I wonder if that
just makes things much too complicated.
It's not complicated to store successful results in a list and failed
parameters + matching error tracebacks in another.

The log of successful evaluation could either be a list of dicts of a
list of namedtuples. The list of dicts optional is probably more
flexible if we want to make it possible the user to collect additional
evaluation attribute by passing a callback for instance.
Post by Joel Nothman
Post by Olivier Grisel
B: Being able to monitor partial results and interrupt search before
waiting for the end (e.g. by handling KeyBoardInterrupt using an async job
scheduling API)
So the stop and resume case just means the results need to be appendable...?
Yes mostly. But that also mean that we should be able to compute mean
scores over 2 out 5 folds and then recompute the mean scores latter
when we get access to the 5 folds results.

Hence my proposal is to store the raw dummy list of evaluations and
offer public methods to compute aggregate user friendly summaries of
the partial or complete results.
Post by Joel Nothman
In general, I don't think Parallel's returning a list is of great benefit
here. Working with an iterable would be more comfortable.
Yes we might need to make joblib.Parallel evolve to support task
submission and async retrieval to implement this. I think this is one
of the possible design goal envisioned by Gael as possible evolution
of the joblib project.
Post by Joel Nothman
Post by Olivier Grisel
C1: Refining the search space
Similarly, it should be possible to have fit append further results.
Yes.
Post by Joel Nothman
Post by Olivier Grisel
C2: Refining the cross-validation
and
Post by Olivier Grisel
D: warm starting with larger subsamples of the dataset
I would think in these cases it's better to create a new estimator and/or
keep results separate.
Although I think those are two very important to manage the
exploration / exploitation trade-off faced by the ML researches and
practicionners, I also agree they could be addressed in later
evolution scikit-learn or even maybe as separate projects as
https://github.com/jaberg/hyperopt or
https://github.comm/pydata/pyrallel

I would just like to emphasize that storing the raw evaluations logs
as a dummy python list would make it possible to deal with this kind
of future evoluations if we ever decide to implement them directly in
scikit-learn.

Hence I think that data structure that stores the evaluations results
should be as simple as possible and avoid making any assumptions on
the kind of aggregation or the number of axis we will collect during
the search.

Basically adding support for sub-sampling will add a new axis for
possible aggregations and if we use 2D numpy rec-arrays as the primary
datastructure with 1 row per parameter settings we won't be able to
implement that use case at all without breaking the API once again.
Post by Joel Nothman
Something you missed: the ability to get back diagnostics on the quality /
complexity of the model, e.g. coefficient sparsity.
Yes. I think we could extend the fit_grid_point API to make it
possible to pass an arbitrary python callback that would have access
to the fitted estimator and the CV fold and collect any kind of
additional model properties to be included in the search report.
Post by Joel Nothman
These suggestions do make me consider storage in an external database (a
blob store, or an online spreadsheet) as hyperopt allows. I think "allows"
is important here: when you get to that scale of experimentation, you
probably don't want results logged only in memory. But we need a sensible
default for working with a few thousand candidates.
I agree, but I think we should keep that thread
Post by Joel Nothman
Except for purity of parallelism, I don't see why you would want do store
each fold result for a single candidate separately. I don't see the use-case
for providing them separately to the user (except where one fold failed and
another succeeded).
To make it easy to:

- deal with partial / incomplete results (either for fault tolerance
or early stopping / monitoring)

- extend the size of an existing dimension (e.g. collecting 5 random
CV folds instead of 3) in a warm restart of the search.

- add a new dimension (e.g. subsamples of the dataset), possibly in
warm restart of the search instance.

by not making any assumptions on the kind of estimates the user will
want in the future of the lib.
Post by Joel Nothman
As far as I'm concerned, the frontend should hide that.
Yes that's why I propose to provide public methods to compute
interesting aggregates from the raw evaluation log.
Post by Joel Nothman
I do see that providing all fields together for a single candidate is the
most common use-case and argues against providing parallel arrays (but not
against a structured array / recarray).
structured array / recarray have 2 issues:

- they badly handle missing / partial results or at least there is not
uniform solution as missing data markers would depend on the dtype of
the column, e.g.: NaNs for floats, -1 as a marker for ints, None for
dtype=object? Furthermore missing results are pre-allocated.

- they do not naturally handle change in dimension sizes or number of
Post by Joel Nothman
Finally, the single most important thing I can see about making results
explorable is not providing candidate parameter settings only as dicts, but
splitting the dicts out so that you can query by the value of each
parameter, and group over others.
Yes but if we go for the simple evaluation log list I propose, this
can be always be implemented provided by dedicated methods.

Furthermore be aware that the number of parameters is not always the
same for each result item of a GridSearchCV:

See: http://scikit-learn.org/stable/modules/grid_search.html#gridsearchcv

This is a valid param grid:

param_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]

The gamma attribute is only present when `kernel == 'rbf'`.

Expanding this in column of a rec array is not very natural I think.
This is similar to the sparsity issue mentioned earlier.
Post by Joel Nothman
This may be getting into crazy land, and certainly close to reimplementing
Pandas for the 2d case, or recarrays with benefits, but: imagine we had a
* attributes like fold_test_score, fold_train_score, fold_train_time, each a
2d array.
* __getattr__ magic that produced mean_test_score, mean_train_time, etc. and
std_test_score, std_train_time on demand (weighted by some samples_per_fold
attr if iid=True).
* attributes like param_C that would enable selecting certain candidates by
their parameter settings (through numpy-style boolean queries).
* __getitem__ that can pull out one or more candidates by index (and returns
a SearchResult).
* a method that return a dict of selected 1d array attributes for
Pandas-style (or spreadsheet? in that case a list of dicts) integration
* a method that zips over selected attributes for simple iteration.
Is this crazy, or does it do exactly what we want? or both? And how does it
not meet the needs of your wishlist, Olivier (except where the number of
folds differ)?
Interesting but I am not sure I understand it all. Can you give an
example of a typical series of instructions that would leverage such a
SearchResult object from an interactive python sessions to introspect
it?

Furthermore, such a SeachResult instance could always be computed on
demand or at the end of the computation from the raw evaluation log.
Or even wrap the raw evaluation log internally.

Basically I am advocating Event Sourcing [1] as a design goal for the
primary datastructure to store the evaluation results. Let us make as
few assumptions as possible on the kind of data we want to collect and
who the user will aggregate those data to find the best models.

[1] http://martinfowler.com/eaaDev/EventSourcing.html

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2013-06-09 16:21:21 UTC
Permalink
Post by Olivier Grisel
Post by Joel Nothman
Is this crazy, or does it do exactly what we want? or both? And how does it
not meet the needs of your wishlist, Olivier (except where the number of
folds differ)?
Interesting but I am not sure I understand it all. Can you give an
example of a typical series of instructions that would leverage such a
SearchResult object from an interactive python sessions to introspect
it?
Ignore that, I had not read your later responses when I replied.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Romaniuk, Michal
2013-06-07 13:13:52 UTC
Permalink
It would be great if there was a way to access the parameter search results as a numpy ndarray, with one axis for each parameter and one additional axis for the cross-validation folds. This would make it easy to visualise the grid search results, compute the mean, median or variance for each grid point, etc.

Regards,
Michal
Andreas Mueller
2013-06-07 13:26:39 UTC
Permalink
Post by Romaniuk, Michal
It would be great if there was a way to access the parameter search results as a numpy ndarray, with one axis for each parameter and one additional axis for the cross-validation folds. This would make it easy to visualise the grid search results, compute the mean, median or variance for each grid point, etc.
The problem with that is that it does not carry over to randomized or
optimized search.
I already implemented that once and abandoned it again.
Gael Varoquaux
2013-06-07 14:00:14 UTC
Permalink
Post by Romaniuk, Michal
It would be great if there was a way to access the parameter search
results as a numpy ndarray, with one axis for each parameter and one
additional axis for the cross-validation folds. This would make it easy
to visualise the grid search results, compute the mean, median or
variance for each grid point, etc.
That only works if these parameters are on the nD grid, which is not at
all garanteed.
Joel Nothman
2013-06-08 09:46:20 UTC
Permalink
But where it is the case, an index into the results (so that you can use
np.asarray(results)[grid.build_index()] in the desired manner) is possible.
https://github.com/scikit-learn/scikit-learn/pull/1842

On the other hand, as long as you can get an array of parameter values for
each parameter name (e.g.
https://github.com/jnothman/scikit-learn/tree/parameters_mrecarray), even
if not a grid, you can transform your data with pandas.DataFrame or similar.

And Olivier, I'm looking forward to reading your response, but will need a
moment more than I have right now!

- Joel


On Sat, Jun 8, 2013 at 12:00 AM, Gael Varoquaux <
Post by Gael Varoquaux
Post by Romaniuk, Michal
It would be great if there was a way to access the parameter search
results as a numpy ndarray, with one axis for each parameter and one
additional axis for the cross-validation folds. This would make it easy
to visualise the grid search results, compute the mean, median or
variance for each grid point, etc.
That only works if these parameters are on the nD grid, which is not at
all garanteed.
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Continue reading on narkive:
Loading...