Discussion:
GSoC 2015 Proposal: Multiple Metric Learning
(too old to reply)
Raghav R V
2015-03-22 23:57:47 UTC
Permalink
Hi,

1. This is my proposal for the multiple metric learning project as a wiki
page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.

Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman)

Any feedback/suggestions/additions/deletions would be awesome. :)

2. Given that there is a huge interest among students in learning about ML,
do you think it would be within the scope of/beneficial to skl to have all
the exercises and/or concepts, from a good quality book (ESL / PRML /
Murphy) or an academic course like NG's CS229 (not the less rigorous
coursera version), implemented using sklearn? Or perhaps we could instead
enhance our tutorials and examples, to be a self study guide to learn about
ML?
I have included this in my GSoC proposal but was not quite sure if this
would be an useful idea!!

Or would it be better if I simply add more examples?

Please let me know your views!!

Thanks


R
Ronnie Ghose
2015-03-23 00:00:18 UTC
Permalink
1. the link is broken
2. that sounds quite difficult and unfortunately conducive to cheating
Post by Raghav R V
Hi,
1. This is my proposal for the multiple metric learning project as a wiki
page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman)
Any feedback/suggestions/additions/deletions would be awesome. :)
2. Given that there is a huge interest among students in learning about
ML, do you think it would be within the scope of/beneficial to skl to have
all the exercises and/or concepts, from a good quality book (ESL / PRML /
Murphy) or an academic course like NG's CS229 (not the less rigorous
coursera version), implemented using sklearn? Or perhaps we could instead
enhance our tutorials and examples, to be a self study guide to learn about
ML?
I have included this in my GSoC proposal but was not quite sure if this
would be an useful idea!!
Or would it be better if I simply add more examples?
Please let me know your views!!
Thanks
R
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for
all
things parallel software development, from weekly thought leadership blogs
to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Raghav R V
2015-03-23 00:17:42 UTC
Permalink
Post by Ronnie Ghose
1. the link is broken
Ah! Sorry :) -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.

2. that sounds quite difficult and unfortunately conducive to cheating
Hmm... Should I then simply opt for adding more examples then?
Post by Ronnie Ghose
Post by Raghav R V
Hi,
1. This is my proposal for the multiple metric learning project as a wiki
page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman)
Any feedback/suggestions/additions/deletions would be awesome. :)
2. Given that there is a huge interest among students in learning about
ML, do you think it would be within the scope of/beneficial to skl to have
all the exercises and/or concepts, from a good quality book (ESL / PRML /
Murphy) or an academic course like NG's CS229 (not the less rigorous
coursera version), implemented using sklearn? Or perhaps we could instead
enhance our tutorials and examples, to be a self study guide to learn about
ML?
I have included this in my GSoC proposal but was not quite sure if this
would be an useful idea!!
Or would it be better if I simply add more examples?
Please let me know your views!!
Thanks
R
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub
for all
things parallel software development, from weekly thought leadership
blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for
all
things parallel software development, from weekly thought leadership blogs
to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Raghav R V
2015-03-23 00:52:23 UTC
Permalink
2 things :

* The subject should have been "Multiple Metric Support in grid_search and
cross_validation modules and other general improvements" and not multiple
metric learning! Sorry for that!
* The link was not available due to the trailing "." (dot), which has been
fixed now!

Thanks
R
Post by Ronnie Ghose
1. the link is broken
Ah! Sorry :) -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
2. that sounds quite difficult and unfortunately conducive to cheating
Hmm... Should I then simply opt for adding more examples then?
Post by Raghav R V
Post by Raghav R V
Hi,
1. This is my proposal for the multiple metric learning project as a
wiki page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman)
Any feedback/suggestions/additions/deletions would be awesome. :)
2. Given that there is a huge interest among students in learning about
ML, do you think it would be within the scope of/beneficial to skl to have
all the exercises and/or concepts, from a good quality book (ESL / PRML /
Murphy) or an academic course like NG's CS229 (not the less rigorous
coursera version), implemented using sklearn? Or perhaps we could instead
enhance our tutorials and examples, to be a self study guide to learn about
ML?
I have included this in my GSoC proposal but was not quite sure if this
would be an useful idea!!
Or would it be better if I simply add more examples?
Please let me know your views!!
Thanks
R
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub
for all
things parallel software development, from weekly thought leadership
blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub
for all
things parallel software development, from weekly thought leadership
blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2015-03-23 20:41:59 UTC
Permalink
can you please also upload it to melange?
Post by Raghav R V
* The subject should have been "Multiple Metric Support in grid_search
and cross_validation modules and other general improvements" and not
multiple metric learning! Sorry for that!
* The link was not available due to the trailing "." (dot), which has
been fixed now!
Thanks
R
1. the link is broken
Ah! Sorry :) -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements.
2. that sounds quite difficult and unfortunately conducive to cheating
Hmm... Should I then simply opt for adding more examples then?
Hi,
1. This is my proposal for the multiple metric learning
project as a wiki page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements.
Possible mentors : Andreas Mueller (amueller) and Joel
Nothman (jnothman)
Any feedback/suggestions/additions/deletions would be awesome. :)
2. Given that there is a huge interest among students in
learning about ML, do you think it would be within the
scope of/beneficial to skl to have all the exercises
and/or concepts, from a good quality book (ESL / PRML /
Murphy) or an academic course like NG's CS229 (not the
less rigorous coursera version), implemented using
sklearn? Or perhaps we could instead enhance our tutorials
and examples, to be a self study guide to learn about ML?
I have included this in my GSoC proposal but was not quite
sure if this would be an useful idea!!
Or would it be better if I simply add more examples?
Please let me know your views!!
Thanks
R
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go
Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media,
is your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a
look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel
Website, sponsored
by Intel and developed in partnership with Slashdot Media, is
your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2015-03-23 21:40:58 UTC
Permalink
Hi Raghav.

I feel that your proposal lacks some focus.
I'd remove the two:

Mallow's Cp for LASSO / LARS
Implement built in abs max scaler, Nesterov's momentum and finish up the
Multilayer Perceptron module.

And as discussed in this thread probably also
Forge a self sufficient ML tutorial based on scikit-learn.

If you feel like you proposal has not enough material (not sure about that),
two things that could be added and are more related to the
cross-validation and grid-search part
(but probably difficult from an API standpoint) are making CV objects
(aka path algorithms, or generalized cross-validation)
work together with GridSearchCV.
The other would be how to allow early stopping using a validation set.
The two are probably related (imho).

Olivier also mentioned cross-validation for out-of-core (partial_fit)
algorithms.
I feel that is not as important, but might also tie into your proposal.

Finishing the refactoring of model_evaluation in three days seems a bit
optimistic, if you include reviews.

For sample_weight support, I'm not if there are obvious ways to extend
sample_weight to all the algorithms that you mentioned.
How does it work for spectral clustering and agglomerative clustering
for example?

In general, I feel you should rather focus on less things, and more on
the details of what to do there.
Otherwise the proposal looks good.
For the wiki, having links to the issues might be helpful.

Thanks for the application :)

Andy
Post by Raghav R V
* The subject should have been "Multiple Metric Support in grid_search
and cross_validation modules and other general improvements" and not
multiple metric learning! Sorry for that!
* The link was not available due to the trailing "." (dot), which has
been fixed now!
Thanks
R
1. the link is broken
Ah! Sorry :) -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements.
2. that sounds quite difficult and unfortunately conducive to cheating
Hmm... Should I then simply opt for adding more examples then?
Hi,
1. This is my proposal for the multiple metric learning
project as a wiki page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements.
Possible mentors : Andreas Mueller (amueller) and Joel
Nothman (jnothman)
Any feedback/suggestions/additions/deletions would be awesome. :)
2. Given that there is a huge interest among students in
learning about ML, do you think it would be within the
scope of/beneficial to skl to have all the exercises
and/or concepts, from a good quality book (ESL / PRML /
Murphy) or an academic course like NG's CS229 (not the
less rigorous coursera version), implemented using
sklearn? Or perhaps we could instead enhance our tutorials
and examples, to be a self study guide to learn about ML?
I have included this in my GSoC proposal but was not quite
sure if this would be an useful idea!!
Or would it be better if I simply add more examples?
Please let me know your views!!
Thanks
R
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go
Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media,
is your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a
look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel
Website, sponsored
by Intel and developed in partnership with Slashdot Media, is
your hub for all
things parallel software development, from weekly thought
leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Raghav R V
2015-03-24 12:04:39 UTC
Permalink
Hi Andy,

Thanks a lot for your feedback... I'll update my proposal wiki based on
your guidelines and also submit the same to melange too by today!


Thanks,


R
Post by Andreas Mueller
Hi Raghav.
I feel that your proposal lacks some focus.
Mallow's Cp for LASSO / LARS
Implement built in abs max scaler, Nesterov's momentum and finish up the
Multilayer Perceptron module.
And as discussed in this thread probably also
Forge a self sufficient ML tutorial based on scikit-learn.
If you feel like you proposal has not enough material (not sure about that),
two things that could be added and are more related to the
cross-validation and grid-search part
(but probably difficult from an API standpoint) are making CV objects (aka
path algorithms, or generalized cross-validation)
work together with GridSearchCV.
The other would be how to allow early stopping using a validation set.
The two are probably related (imho).
Olivier also mentioned cross-validation for out-of-core (partial_fit)
algorithms.
I feel that is not as important, but might also tie into your proposal.
Finishing the refactoring of model_evaluation in three days seems a bit
optimistic, if you include reviews.
For sample_weight support, I'm not if there are obvious ways to extend
sample_weight to all the algorithms that you mentioned.
How does it work for spectral clustering and agglomerative clustering for
example?
In general, I feel you should rather focus on less things, and more on the
details of what to do there.
Otherwise the proposal looks good.
For the wiki, having links to the issues might be helpful.
Thanks for the application :)
Andy
* The subject should have been "Multiple Metric Support in grid_search
and cross_validation modules and other general improvements" and not
multiple metric learning! Sorry for that!
* The link was not available due to the trailing "." (dot), which has been
fixed now!
Thanks
R
Post by Ronnie Ghose
1. the link is broken
Ah! Sorry :) -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
2. that sounds quite difficult and unfortunately conducive to cheating
Hmm... Should I then simply opt for adding more examples then?
Post by Raghav R V
Post by Raghav R V
Hi,
1. This is my proposal for the multiple metric learning project as a
wiki page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman)
Any feedback/suggestions/additions/deletions would be awesome. :)
2. Given that there is a huge interest among students in learning
about ML, do you think it would be within the scope of/beneficial to skl to
have all the exercises and/or concepts, from a good quality book (ESL /
PRML / Murphy) or an academic course like NG's CS229 (not the less rigorous
coursera version), implemented using sklearn? Or perhaps we could instead
enhance our tutorials and examples, to be a self study guide to learn about
ML?
I have included this in my GSoC proposal but was not quite sure if this
would be an useful idea!!
Or would it be better if I simply add more examples?
Please let me know your views!!
Thanks
R
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2015-03-24 12:40:38 UTC
Permalink
I agree with everything Andy says. I think the core developers are very
enthusiastic to have a project along the lines of "Finish all the things
that need finishing", but it's very impractical to do so much context
switching both for students and mentors/reviewers.

One of the advantages of GSoC is that it creates specialisation: on the one
hand, a user becomes expert in what they tackle; on the other, reviewers
and mentors can limit their attention to the topic at hand. So please, try
to focus a little more.
Post by Andreas Mueller
Hi Raghav.
I feel that your proposal lacks some focus.
Mallow's Cp for LASSO / LARS
Implement built in abs max scaler, Nesterov's momentum and finish up the
Multilayer Perceptron module.
And as discussed in this thread probably also
Forge a self sufficient ML tutorial based on scikit-learn.
If you feel like you proposal has not enough material (not sure about that),
two things that could be added and are more related to the
cross-validation and grid-search part
(but probably difficult from an API standpoint) are making CV objects (aka
path algorithms, or generalized cross-validation)
work together with GridSearchCV.
The other would be how to allow early stopping using a validation set.
The two are probably related (imho).
Olivier also mentioned cross-validation for out-of-core (partial_fit)
algorithms.
I feel that is not as important, but might also tie into your proposal.
Finishing the refactoring of model_evaluation in three days seems a bit
optimistic, if you include reviews.
For sample_weight support, I'm not if there are obvious ways to extend
sample_weight to all the algorithms that you mentioned.
How does it work for spectral clustering and agglomerative clustering for
example?
In general, I feel you should rather focus on less things, and more on the
details of what to do there.
Otherwise the proposal looks good.
For the wiki, having links to the issues might be helpful.
Thanks for the application :)
Andy
* The subject should have been "Multiple Metric Support in grid_search
and cross_validation modules and other general improvements" and not
multiple metric learning! Sorry for that!
* The link was not available due to the trailing "." (dot), which has been
fixed now!
Thanks
R
Post by Ronnie Ghose
1. the link is broken
Ah! Sorry :) -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
2. that sounds quite difficult and unfortunately conducive to cheating
Hmm... Should I then simply opt for adding more examples then?
Post by Raghav R V
Post by Raghav R V
Hi,
1. This is my proposal for the multiple metric learning project as a
wiki page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman)
Any feedback/suggestions/additions/deletions would be awesome. :)
2. Given that there is a huge interest among students in learning
about ML, do you think it would be within the scope of/beneficial to skl to
have all the exercises and/or concepts, from a good quality book (ESL /
PRML / Murphy) or an academic course like NG's CS229 (not the less rigorous
coursera version), implemented using sklearn? Or perhaps we could instead
enhance our tutorials and examples, to be a self study guide to learn about
ML?
I have included this in my GSoC proposal but was not quite sure if this
would be an useful idea!!
Or would it be better if I simply add more examples?
Please let me know your views!!
Thanks
R
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Vlad Niculae
2015-03-24 23:39:17 UTC
Permalink
Hi Raghav, hi everyone,

If I may, I have a very high-level comment on your proposal. It clearly shows that you are very involved in the project and understand the internals well. However, I feel like it’s written from a way too technical perspective. Your proposal contains implementation details, but little or no discussion of why each change is important and how it impacts users. Taking a step back and writing such discussion can help gain perspective, which is important for planning.

This is equally important in terms of your weekly blog posts: they should provide an interesting read for more than just scikit-learn developers. So try not to think of your GSoC blog posts as a chore/requirement—they’re a great opportunity to reach out to the community. Your blog posts will be a great opportunity to show off scikit-learn’s ease of use and clean API for tasks that can normally get tedious to write manually. It won’t be as easy to write about them as it would be if you worked on some shiny new model, but if you do it right, this makes it even better: everybody needs cross-validation and model selection!

Which leads me to finer comments:

1. The design of multiple metric support is important and would bring an immense usability gain. At the moment, most non-trivial model selection cases require custom code.

A while ago there was a mailing list discussion about using Pandas data frames for managing the complex multi-dimensional structure that arises. Of course, scikit-learn will never have a Pandas dependency, but we can try to make it as easy as possible to return things that plug seamlessly with Pandas. We won’t be able to show this off in documentation and examples, but it can make for a shiny blog post.

2. Also on multiple metric support, you say “one iteration per metric (as is done currently).” What does this refer to, where is it done this way?

3. How does multiple metric support interfere with model selection APIs? Suddenly there is no more “best_{score|params|estimator}_”. There is an API discussion to be had there, and your review of possible options would be a great addition to the proposal. For example, will model selection objects gain a “criterion” function, that maybe defaults to getting the first specified metric? If so, could this API be used to make global decisions, e.g. "the model which is within 1 standard error of the best score, but has the largest C?” Or should it essentially just return a number per parameter configuration, that we then sort by?

4. There is another API discussion about `sample_weight`: is that the only parameter that we want to route to scoring? I have some applications where I want some notion of `sample_group`. (This would allow to use scikit-learn directly for e.g. query-grouped search results ranking.) I proposed the `sample_*` API convention but it has quite a few downsides; if I remember correctly Joel proposed a param_routing API where you would pass a routing dict {‘sample_group’: ‘fit’, ‘score’}: such an API would be much more extensible.

Wrapping up 3+4 I would make sure to reserve time in the timeline for API discussion and convergence, especially given that we are trying to reach an API freeze. This will *not* be easy. It wouldn’t hurt to factor in time for PR review as well. This might make you rethink the timeline a bit.

5. Nitpicks:
* There are some empty spaces in your proposal: 4, 5 in the abstract, 5, 6 in the details section, and two weeks in the timeline.
* updation -> update
* Mr. Blondel’s first name is spelled Mathieu :)
* I would try to rephrase point #8 in the detailed section. Reading the proposal I had no idea what that point is saying.
* There’s something left over about Nesterov momentum in the timeline.
* Are you seriously planning to work 8x7? I thought full time means 8x5.
* In “About me” you spell Python inconsistently (should be uppercased), "no where" -> nowhere, “I, nevertheless” -> “I nevertheless”, september -> September.

Hope all my comments can help strengthen your proposal!

Yours,
Vlad
I agree with everything Andy says. I think the core developers are very enthusiastic to have a project along the lines of "Finish all the things that need finishing", but it's very impractical to do so much context switching both for students and mentors/reviewers.
One of the advantages of GSoC is that it creates specialisation: on the one hand, a user becomes expert in what they tackle; on the other, reviewers and mentors can limit their attention to the topic at hand. So please, try to focus a little more.
Hi Raghav.
I feel that your proposal lacks some focus.
Mallow's Cp for LASSO / LARS
Implement built in abs max scaler, Nesterov's momentum and finish up the Multilayer Perceptron module.
And as discussed in this thread probably also
Forge a self sufficient ML tutorial based on scikit-learn.
If you feel like you proposal has not enough material (not sure about that),
two things that could be added and are more related to the cross-validation and grid-search part
(but probably difficult from an API standpoint) are making CV objects (aka path algorithms, or generalized cross-validation)
work together with GridSearchCV.
The other would be how to allow early stopping using a validation set.
The two are probably related (imho).
Olivier also mentioned cross-validation for out-of-core (partial_fit) algorithms.
I feel that is not as important, but might also tie into your proposal.
Finishing the refactoring of model_evaluation in three days seems a bit optimistic, if you include reviews.
For sample_weight support, I'm not if there are obvious ways to extend sample_weight to all the algorithms that you mentioned.
How does it work for spectral clustering and agglomerative clustering for example?
In general, I feel you should rather focus on less things, and more on the details of what to do there.
Otherwise the proposal looks good.
For the wiki, having links to the issues might be helpful.
Thanks for the application :)
Andy
* The subject should have been "Multiple Metric Support in grid_search and cross_validation modules and other general improvements" and not multiple metric learning! Sorry for that!
* The link was not available due to the trailing "." (dot), which has been fixed now!
Thanks
R
1. the link is broken
Ah! Sorry :) - https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements.
2. that sounds quite difficult and unfortunately conducive to cheating
Hmm... Should I then simply opt for adding more examples then?
Hi,
1. This is my proposal for the multiple metric learning project as a wiki page - https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements.
Possible mentors : Andreas Mueller (amueller) and Joel Nothman (jnothman)
Any feedback/suggestions/additions/deletions would be awesome. :)
2. Given that there is a huge interest among students in learning about ML, do you think it would be within the scope of/beneficial to skl to have all the exercises and/or concepts, from a good quality book (ESL / PRML / Murphy) or an academic course like NG's CS229 (not the less rigorous coursera version), implemented using sklearn? Or perhaps we could instead enhance our tutorials and examples, to be a self study guide to learn about ML?
I have included this in my GSoC proposal but was not quite sure if this would be an useful idea!!
Or would it be better if I simply add more examples?
Please let me know your views!!
Thanks
R
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.
http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2015-03-25 06:32:00 UTC
Permalink
Post by Vlad Niculae
1. The design of multiple metric support is important and would bring an immense usability gain.
But it will also require a framework of its own. I would say that this is
to be considered in a second step.

G
Raghav R V
2015-03-25 06:59:24 UTC
Permalink
Hi Vlad!!

Thanks a tonne for the detailed review of my proposal. :)
Post by Vlad Niculae
Your proposal contains implementation details, but little or no
discussion of why each change is important and how it impacts users

Yes, I'll add a section discussing the motivation of the various
deliverable. (which actually need to be strengthened a bit)
Post by Vlad Niculae
weekly blog posts [...] they’re a great opportunity to reach out to the
community. [...] plug seamlessly with Pandas. We won’t be able to show this
off in documentation and examples, but it can make for a shiny blog post.

Sure! this is an interesting perspective. Quite frankly I used to blog with
a dev audience in mind... From now on, I'll make sure my weekly blogs
showcase an important feature that was recently contributed and I'll also
write it with our user base as audience!

By "one iteration per metric (as is done currently).” I meant we currently
refit grid search for every single metric that we want the model to be
optimized with respect to... (I think I worded this wrong! Thanks for
pointing it out! )
Post by Vlad Niculae
How does multiple metric support interfere with model selection APIs?
Refactoring the search / cv objects into model_selection involves splitting
the files / moving parts of code around to another file without a clean git
move (git is blind to such moves) and hence any merged changes which
touches the code of grid search et all needs to be manually rebased which
may be error prone... This was the reason I intend to work on it during the
month of April itself and would love to see it merged ASAP...
Post by Vlad Niculae
Suddenly there is no more “best_{score|params|estimator}_”. There is an
API discussion to be had there, and your review of possible options would
be a great addition to the proposal.

Yes sure I'll add a few lines discussing the same..

This also reminds me of the following related issues (#2733, #1034/#1020,
#2079/#1842) which all have great ideas, which could be used too...
Post by Vlad Niculae
There is another API discussion about `sample_weight` [...] Wrapping up
3+4 I would make sure to reserve time in the timeline for API discussion
and convergence [...]

Hmm thanks! I really need to decide on the amount of time that can be
justifiably allocated for discussions... It is probably better to stack my
deliverables in pipeline with multiple goals where one can be done in
parallel while the other is being reviewed.
Post by Vlad Niculae
There’s something left over about Nesterov momentum in the timeline.
I should have removed that! sorry... (It was a left over from a previous
version of my prop.)
Post by Vlad Niculae
Mr. Blondel’s first name is spelled Mathieu
Ah sorry Mathieu ;)
Post by Vlad Niculae
Are you seriously planning to work 8x7? I thought full time means 8x5.
Yes I am okay with this :) At least I think so :p


Thanks a lot for all your comments! I will address them along with the
other comments and will reflect the changes back to the wiki too :)


Have a great day!! :)


R
Post by Vlad Niculae
Hi Raghav, hi everyone,
If I may, I have a very high-level comment on your proposal. It clearly
shows that you are very involved in the project and understand the
internals well. However, I feel like it’s written from a way too technical
perspective. Your proposal contains implementation details, but little or
no discussion of why each change is important and how it impacts users.
Taking a step back and writing such discussion can help gain perspective,
which is important for planning.
This is equally important in terms of your weekly blog posts: they should
provide an interesting read for more than just scikit-learn developers. So
try not to think of your GSoC blog posts as a chore/requirement—they’re a
great opportunity to reach out to the community. Your blog posts will be a
great opportunity to show off scikit-learn’s ease of use and clean API for
tasks that can normally get tedious to write manually. It won’t be as easy
to write about them as it would be if you worked on some shiny new model,
but if you do it right, this makes it even better: everybody needs
cross-validation and model selection!
1. The design of multiple metric support is important and would bring an
immense usability gain. At the moment, most non-trivial model selection
cases require custom code.
A while ago there was a mailing list discussion about using Pandas data
frames for managing the complex multi-dimensional structure that arises. Of
course, scikit-learn will never have a Pandas dependency, but we can try to
make it as easy as possible to return things that plug seamlessly with
Pandas. We won’t be able to show this off in documentation and examples,
but it can make for a shiny blog post.
2. Also on multiple metric support, you say “one iteration per metric (as
is done currently).” What does this refer to, where is it done this way?
3. How does multiple metric support interfere with model selection APIs?
Suddenly there is no more “best_{score|params|estimator}_”. There is an API
discussion to be had there, and your review of possible options would be a
great addition to the proposal. For example, will model selection objects
gain a “criterion” function, that maybe defaults to getting the first
specified metric? If so, could this API be used to make global decisions,
e.g. "the model which is within 1 standard error of the best score, but has
the largest C?” Or should it essentially just return a number per parameter
configuration, that we then sort by?
4. There is another API discussion about `sample_weight`: is that the only
parameter that we want to route to scoring? I have some applications where
I want some notion of `sample_group`. (This would allow to use scikit-learn
directly for e.g. query-grouped search results ranking.) I proposed the
`sample_*` API convention but it has quite a few downsides; if I remember
correctly Joel proposed a param_routing API where you would pass a routing
dict {‘sample_group’: ‘fit’, ‘score’}: such an API would be much more
extensible.
Wrapping up 3+4 I would make sure to reserve time in the timeline for API
discussion and convergence, especially given that we are trying to reach an
API freeze. This will *not* be easy. It wouldn’t hurt to factor in time for
PR review as well. This might make you rethink the timeline a bit.
* There are some empty spaces in your proposal: 4, 5 in the abstract, 5, 6
in the details section, and two weeks in the timeline.
* updation -> update
* Mr. Blondel’s first name is spelled Mathieu :)
* I would try to rephrase point #8 in the detailed section. Reading the
proposal I had no idea what that point is saying.
* There’s something left over about Nesterov momentum in the timeline.
* Are you seriously planning to work 8x7? I thought full time means 8x5.
* In “About me” you spell Python inconsistently (should be uppercased),
"no where" -> nowhere, “I, nevertheless” -> “I nevertheless”, september ->
September.
Hope all my comments can help strengthen your proposal!
Yours,
Vlad
Post by Joel Nothman
I agree with everything Andy says. I think the core developers are very
enthusiastic to have a project along the lines of "Finish all the things
that need finishing", but it's very impractical to do so much context
switching both for students and mentors/reviewers.
Post by Joel Nothman
One of the advantages of GSoC is that it creates specialisation: on the
one hand, a user becomes expert in what they tackle; on the other,
reviewers and mentors can limit their attention to the topic at hand. So
please, try to focus a little more.
Post by Joel Nothman
Hi Raghav.
I feel that your proposal lacks some focus.
Mallow's Cp for LASSO / LARS
Implement built in abs max scaler, Nesterov's momentum and finish up the
Multilayer Perceptron module.
Post by Joel Nothman
And as discussed in this thread probably also
Forge a self sufficient ML tutorial based on scikit-learn.
If you feel like you proposal has not enough material (not sure about
that),
Post by Joel Nothman
two things that could be added and are more related to the
cross-validation and grid-search part
Post by Joel Nothman
(but probably difficult from an API standpoint) are making CV objects
(aka path algorithms, or generalized cross-validation)
Post by Joel Nothman
work together with GridSearchCV.
The other would be how to allow early stopping using a validation set.
The two are probably related (imho).
Olivier also mentioned cross-validation for out-of-core (partial_fit)
algorithms.
Post by Joel Nothman
I feel that is not as important, but might also tie into your proposal.
Finishing the refactoring of model_evaluation in three days seems a bit
optimistic, if you include reviews.
Post by Joel Nothman
For sample_weight support, I'm not if there are obvious ways to extend
sample_weight to all the algorithms that you mentioned.
Post by Joel Nothman
How does it work for spectral clustering and agglomerative clustering
for example?
Post by Joel Nothman
In general, I feel you should rather focus on less things, and more on
the details of what to do there.
Post by Joel Nothman
Otherwise the proposal looks good.
For the wiki, having links to the issues might be helpful.
Thanks for the application :)
Andy
Post by Raghav R V
* The subject should have been "Multiple Metric Support in grid_search
and cross_validation modules and other general improvements" and not
multiple metric learning! Sorry for that!
Post by Joel Nothman
Post by Raghav R V
* The link was not available due to the trailing "." (dot), which has
been fixed now!
Post by Joel Nothman
Post by Raghav R V
Thanks
R
1. the link is broken
Ah! Sorry :) -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
Post by Joel Nothman
Post by Raghav R V
2. that sounds quite difficult and unfortunately conducive to cheating
Hmm... Should I then simply opt for adding more examples then?
Hi,
1. This is my proposal for the multiple metric learning project as a
wiki page -
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
.
Post by Joel Nothman
Post by Raghav R V
Possible mentors : Andreas Mueller (amueller) and Joel Nothman
(jnothman)
Post by Joel Nothman
Post by Raghav R V
Any feedback/suggestions/additions/deletions would be awesome. :)
2. Given that there is a huge interest among students in learning about
ML, do you think it would be within the scope of/beneficial to skl to have
all the exercises and/or concepts, from a good quality book (ESL / PRML /
Murphy) or an academic course like NG's CS229 (not the less rigorous
coursera version), implemented using sklearn? Or perhaps we could instead
enhance our tutorials and examples, to be a self study guide to learn about
ML?
Post by Joel Nothman
Post by Raghav R V
I have included this in my GSoC proposal but was not quite sure if this
would be an useful idea!!
Post by Joel Nothman
Post by Raghav R V
Or would it be better if I simply add more examples?
Please let me know your views!!
Thanks
R
------------------------------------------------------------------------------
Post by Joel Nothman
Post by Raghav R V
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
Post by Joel Nothman
Post by Raghav R V
by Intel and developed in partnership with Slashdot Media, is your hub
for all
Post by Joel Nothman
Post by Raghav R V
things parallel software development, from weekly thought leadership
blogs to
Post by Joel Nothman
Post by Raghav R V
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Joel Nothman
Post by Raghav R V
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
Post by Joel Nothman
Post by Raghav R V
by Intel and developed in partnership with Slashdot Media, is your hub
for all
Post by Joel Nothman
Post by Raghav R V
things parallel software development, from weekly thought leadership
blogs to
Post by Joel Nothman
Post by Raghav R V
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Joel Nothman
Post by Raghav R V
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
Post by Joel Nothman
Post by Raghav R V
by Intel and developed in partnership with Slashdot Media, is your hub
for all
Post by Joel Nothman
Post by Raghav R V
things parallel software development, from weekly thought leadership
blogs to
Post by Joel Nothman
Post by Raghav R V
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.
http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Joel Nothman
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
Post by Joel Nothman
by Intel and developed in partnership with Slashdot Media, is your hub
for all
Post by Joel Nothman
things parallel software development, from weekly thought leadership
blogs to
Post by Joel Nothman
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Joel Nothman
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
Post by Joel Nothman
by Intel and developed in partnership with Slashdot Media, is your hub
for all
Post by Joel Nothman
things parallel software development, from weekly thought leadership
blogs to
Post by Joel Nothman
news, videos, case studies, tutorials and more. Take a look and join the
conversation now.
http://goparallel.sourceforge.net/_______________________________________________
Post by Joel Nothman
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2015-03-25 19:29:06 UTC
Permalink
Post by Vlad Niculae
Hi Raghav, hi everyone,
If I may, I have a very high-level comment on your proposal. It clearly shows that you are very involved in the project and understand the internals well. However, I feel like it’s written from a way too technical perspective. Your proposal contains implementation details, but little or no discussion of why each change is important and how it impacts users. Taking a step back and writing such discussion can help gain perspective, which is important for planning.
Great comment! (as are your following points).
Post by Vlad Niculae
3. How does multiple metric support interfere with model selection APIs? Suddenly there is no more “best_{score|params|estimator}_”. There is an API discussion to be had there, and your review of possible options would be a great addition to the proposal. For example, will model selection objects gain a “criterion” function, that maybe defaults to getting the first specified metric? If so, could this API be used to make global decisions, e.g. "the model which is within 1 standard error of the best score, but has the largest C?” Or should it essentially just return a number per parameter configuration, that we then sort by?
Actually I would not fiddle with this. Why not always the first one? The
rest is just additional information.
Post by Vlad Niculae
4. There is another API discussion about `sample_weight`: is that the only parameter that we want to route to scoring? I have some applications where I want some notion of `sample_group`. (This would allow to use scikit-learn directly for e.g. query-grouped search results ranking.) I proposed the `sample_*` API convention but it has quite a few downsides; if I remember correctly Joel proposed a param_routing API where you would pass a routing dict {‘sample_group’: ‘fit’, ‘score’}: such an API would be much more extensible.
Yep, we need to have this discussion at some point.


Andy
Raghav R V
2015-03-25 22:27:13 UTC
Permalink
Hi all,

thanks a lot for the comments!

I've just edited/formatted my prop. based on all of your comments...

https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements

Only thing to be done is to plan what I should do for the month of July...
( For August I intend to finish any leftovers and clean up the tutorials /
documentations / docstrings )

I have the following options for July -
* discussing and attempting implementation of generalized cv and early
stopping as suggested by @amueller
* evaluating and attempting to implement or atleast document how out of
core grid search / cv can be done as suggested by @ogrisel
* A new CV generator that is a blend of `ShuffleSplit` and `LeavePLabel` as
suggested by @ogrisel (I have a feeling this is trivial and can be
completed in one/two week max)

Kindly let me know how you feel about this revised proposal and also let me
know which one I could do for the month of July.
Post by Vlad Niculae
Post by Vlad Niculae
Hi Raghav, hi everyone,
If I may, I have a very high-level comment on your proposal. It clearly
shows that you are very involved in the project and understand the
internals well. However, I feel like it’s written from a way too technical
perspective. Your proposal contains implementation details, but little or
no discussion of why each change is important and how it impacts users.
Taking a step back and writing such discussion can help gain perspective,
which is important for planning.
Great comment! (as are your following points).
Post by Vlad Niculae
3. How does multiple metric support interfere with model selection APIs?
Suddenly there is no more “best_{score|params|estimator}_”. There is an API
discussion to be had there, and your review of possible options would be a
great addition to the proposal. For example, will model selection objects
gain a “criterion” function, that maybe defaults to getting the first
specified metric? If so, could this API be used to make global decisions,
e.g. "the model which is within 1 standard error of the best score, but has
the largest C?” Or should it essentially just return a number per parameter
configuration, that we then sort by?
Actually I would not fiddle with this. Why not always the first one? The
rest is just additional information.
Post by Vlad Niculae
4. There is another API discussion about `sample_weight`: is that the
only parameter that we want to route to scoring? I have some applications
where I want some notion of `sample_group`. (This would allow to use
scikit-learn directly for e.g. query-grouped search results ranking.) I
proposed the `sample_*` API convention but it has quite a few downsides; if
I remember correctly Joel proposed a param_routing API where you would pass
a routing dict {‘sample_group’: ‘fit’, ‘score’}: such an API would be much
more extensible.
Yep, we need to have this discussion at some point.
Andy
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Raghav R V
2015-03-26 12:40:08 UTC
Permalink
Hey Gael,

I am sorry that I missed this comment of yours -
Post by Gael Varoquaux
Post by Vlad Niculae
1. The design of multiple metric support is important and would bring
an immense usability gain.
Post by Gael Varoquaux
But it will also require a framework of its own. I would say that this is
to be considered in a second step.

Could you expand a little on this? Do you mean to say I should probably
allocate time for considering the framework and API involved in the same?

Thanks,

Raghav RV (ragv)
Post by Gael Varoquaux
Hi all,
thanks a lot for the comments!
I've just edited/formatted my prop. based on all of your comments...
https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2015-Proposal:-Multiple-metric-support-for-CV-and-grid_search-and-other-general-improvements
Only thing to be done is to plan what I should do for the month of July...
( For August I intend to finish any leftovers and clean up the tutorials /
documentations / docstrings )
I have the following options for July -
* discussing and attempting implementation of generalized cv and early
* evaluating and attempting to implement or atleast document how out of
* A new CV generator that is a blend of `ShuffleSplit` and `LeavePLabel`
completed in one/two week max)
Kindly let me know how you feel about this revised proposal and also let
me know which one I could do for the month of July.
Post by Vlad Niculae
Post by Vlad Niculae
Hi Raghav, hi everyone,
If I may, I have a very high-level comment on your proposal. It clearly
shows that you are very involved in the project and understand the
internals well. However, I feel like it’s written from a way too technical
perspective. Your proposal contains implementation details, but little or
no discussion of why each change is important and how it impacts users.
Taking a step back and writing such discussion can help gain perspective,
which is important for planning.
Great comment! (as are your following points).
Post by Vlad Niculae
3. How does multiple metric support interfere with model selection
APIs? Suddenly there is no more “best_{score|params|estimator}_”. There is
an API discussion to be had there, and your review of possible options
would be a great addition to the proposal. For example, will model
selection objects gain a “criterion” function, that maybe defaults to
getting the first specified metric? If so, could this API be used to make
global decisions, e.g. "the model which is within 1 standard error of the
best score, but has the largest C?” Or should it essentially just return a
number per parameter configuration, that we then sort by?
Actually I would not fiddle with this. Why not always the first one? The
rest is just additional information.
Post by Vlad Niculae
4. There is another API discussion about `sample_weight`: is that the
only parameter that we want to route to scoring? I have some applications
where I want some notion of `sample_group`. (This would allow to use
scikit-learn directly for e.g. query-grouped search results ranking.) I
proposed the `sample_*` API convention but it has quite a few downsides; if
I remember correctly Joel proposed a param_routing API where you would pass
a routing dict {‘sample_group’: ‘fit’, ‘score’}: such an API would be much
more extensible.
Yep, we need to have this discussion at some point.
Andy
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2015-03-23 14:18:08 UTC
Permalink
Post by Raghav R V
2. Given that there is a huge interest among students in learning
about ML, do you think it would be within the scope of/beneficial to
skl to have all the exercises and/or concepts, from a good quality
book (ESL / PRML / Murphy) or an academic course like NG's CS229 (not
the less rigorous coursera version), implemented using sklearn? Or
perhaps we could instead enhance our tutorials and examples, to be a
self study guide to learn about ML?
I have included this in my GSoC proposal but was not quite sure if
this would be an useful idea!!
We cover most of the algorithms in ESL. We don't want to cover all of
PRML / Murphy, as we don't want to include general graphical models, and
some of the Bayesian models we have need polish before we include more.
For exercises: As far as I can see / remember, all exercises are
mathematical, and mostly proofs. I don't see how scikit-learn would help
with that.

For practical purposes, I currently know of 2 (3?) sklearn books
published with PACKT. There is also an OReilly book coming up:
http://shop.oreilly.com/product/0636920030515.do
Matthieu Brucher
2015-03-23 14:23:25 UTC
Permalink
Post by Andreas Mueller
For practical purposes, I currently know of 2 (3?) sklearn books
http://shop.oreilly.com/product/0636920030515.do
2 general books, 1 cookbook and I think there is another one
half-written as well. Didn't know about O'Reilly, good to know!

Cheers,

Matthieu
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/
Raghav R V
2015-03-23 16:40:03 UTC
Permalink
Thanks for all the good comments!! I'll replace that section of my proposal
with some other more important work! :)

On Mon, Mar 23, 2015 at 7:53 PM, Matthieu Brucher <
Post by Matthieu Brucher
Post by Andreas Mueller
For practical purposes, I currently know of 2 (3?) sklearn books
http://shop.oreilly.com/product/0636920030515.do
2 general books, 1 cookbook and I think there is another one
half-written as well. Didn't know about O'Reilly, good to know!
Cheers,
Matthieu
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website,
sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for
all
things parallel software development, from weekly thought leadership blogs
to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Continue reading on narkive:
Loading...