Discussion:
GSoC applications are open (based on Fwd: [Soc2013-general] Student Application Template (Applications start April 22!))
(too old to reply)
Vlad Niculae
2013-04-24 01:57:17 UTC
Permalink
Dear students interested in applying for this year's GSoC with scikit-learn:

As of a couple of days, applications are open. Scikit-learn is a
suborganization of the PSF this year, like in previous years, so you
will apply with PSF as an organzation, specifying that you will work
on scikit-learn. The PSF were kind enough to provide instructions and
a template for the proposals, so you are
invited to peruse them. I am forwarding the specific e-mail here.

I would like to stress that there is a strict requirement for having
contributed some code to scikit-learn, and to blog weekly about your
project.

But most importantly, please discuss your proposal on the mailing
list, the sooner the better. There have been discussions until now,
maybe less active than they should have been, but I hope ideas and
directions have crystallised a bit and we will soon see good
discussion and competitive proposals.

Yours,
Vlad

---------- Forwarded message ----------
From: Terri Oda <***@zone12.com>
Date: Mon, Apr 22, 2013 at 7:22 AM
Subject: [Soc2013-general] Student Application Template (Applications
start April 22!)
To: soc2013-***@python.org


As hopefully all of you are aware, student applications to GSoC will
be opening April 22 19:00 UTC (tomorrow to me) and closing May 3rd. I
highly recommend that you all submit applications early -- you can
modify them up until the final deadline. Google will not extend the
deadline for any reason, including technical problems with the melange
system (which have been known to happen at the last minute in the
past), so the sooner you can get an application in the better!

We have a template to help you prepare your application with the PSF:

http://wiki.python.org/moin/SummerOfCode/ApplicationTemplate2013

Your sub-organizations may have additional requirements; ask them if
there's any extra information they need from you.

Please note a few things we ask for that are not always required by other orgs:
* We do require students to blog about their projects, so you will
need to set up a GSoC blog for weekly status updates and any other
thoughts you wish to record about your project.

* We do require students to submit a link to some sort of code sample,
preferably a patch to the sub-org to which you are applying. Talk to
your mentors if you're uncertain what would be appropriate.

* Don't forget to put the name of your sub-organization (e.g.
OpenHatch, MNE-Python) into the title of your application.

If you're not sure about how to write a good proposal, ask your
prospective mentors: they're the ones who will be deciding if they
hire you or not, so they get the final word as to what a good proposal
looks like for them.

Terri
Vlad Niculae
2013-04-24 03:46:48 UTC
Permalink
Sorry to reply to myself but I want to point something else out to all
possible GSoC students:

All proposals we had until now are new additions of algorithms. In my
opinion this is always welcome, given that several conditions are
checked: the algorithm should have proved to be useful generally, and
it should fit the scikit-learn API and spirit.

However, I think it would be nice to have some proposals that focus on
internals: consistency, clean up, refactoring of modules that need it
or documentation improvements. As long as the task is measurable,
closed-ended and well-defined, I think such a project could really
push scikit-learn towards version 1.0, whereas adding more and more
algorithms are actually ant-steps backwards in terms of global
tightness of the package.

Just my 2 yen,
Vlad.
Post by Vlad Niculae
As of a couple of days, applications are open. Scikit-learn is a
suborganization of the PSF this year, like in previous years, so you
will apply with PSF as an organzation, specifying that you will work
on scikit-learn. The PSF were kind enough to provide instructions and
a template for the proposals, so you are
invited to peruse them. I am forwarding the specific e-mail here.
I would like to stress that there is a strict requirement for having
contributed some code to scikit-learn, and to blog weekly about your
project.
But most importantly, please discuss your proposal on the mailing
list, the sooner the better. There have been discussions until now,
maybe less active than they should have been, but I hope ideas and
directions have crystallised a bit and we will soon see good
discussion and competitive proposals.
Yours,
Vlad
---------- Forwarded message ----------
Date: Mon, Apr 22, 2013 at 7:22 AM
Subject: [Soc2013-general] Student Application Template (Applications
start April 22!)
As hopefully all of you are aware, student applications to GSoC will
be opening April 22 19:00 UTC (tomorrow to me) and closing May 3rd. I
highly recommend that you all submit applications early -- you can
modify them up until the final deadline. Google will not extend the
deadline for any reason, including technical problems with the melange
system (which have been known to happen at the last minute in the
past), so the sooner you can get an application in the better!
http://wiki.python.org/moin/SummerOfCode/ApplicationTemplate2013
Your sub-organizations may have additional requirements; ask them if
there's any extra information they need from you.
* We do require students to blog about their projects, so you will
need to set up a GSoC blog for weekly status updates and any other
thoughts you wish to record about your project.
* We do require students to submit a link to some sort of code sample,
preferably a patch to the sub-org to which you are applying. Talk to
your mentors if you're uncertain what would be appropriate.
* Don't forget to put the name of your sub-organization (e.g.
OpenHatch, MNE-Python) into the title of your application.
If you're not sure about how to write a good proposal, ask your
prospective mentors: they're the ones who will be deciding if they
hire you or not, so they get the final word as to what a good proposal
looks like for them.
Terri
_______________________________________________
Soc2013-general mailing list
http://mail.python.org/mailman/listinfo/soc2013-genera
Mathieu Blondel
2013-04-24 08:27:54 UTC
Permalink
Post by Vlad Niculae
However, I think it would be nice to have some proposals that focus on
internals: consistency, clean up, refactoring of modules that need it
or documentation improvements. As long as the task is measurable,
closed-ended and well-defined, I think such a project could really
push scikit-learn towards version 1.0, whereas adding more and more
algorithms are actually ant-steps backwards in terms of global
tightness of the package.
For this kind of project, I think it is better if the candidate is a core
developer or has excellent knowledge of the code base. Otherwise, it is
likely to fail (not to mention that it would be hard to write the proposal).

Mathieu
Mathieu Blondel
2013-04-24 21:45:26 UTC
Permalink
Something I would like to see in the scikit, if someone is looking for an
idea, is biclustering:

http://en.wikipedia.org/wiki/Biclustering

Mathieu
Kemal Eren
2013-04-24 21:56:54 UTC
Permalink
Hi Mathieu and team,

If you are looking for biclustering algorithms I could certainly do that. I
did my Master's thesis on it and wrote this software:
http://bmi.osu.edu/hpc/software/bibench/. Its biclustering algorithms are
wrappers to existing tools. It would be really nice to have Python/Cython
implementations in scikit-learn.

A couple days ago I already suggested a GSOC project on stacked
generalization. Either project would be interesting to me.

Best regards,
Kemal
Post by Mathieu Blondel
Something I would like to see in the scikit, if someone is looking for an
http://en.wikipedia.org/wiki/Biclustering
Mathieu
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2013-04-24 23:56:09 UTC
Permalink
Hi Kemal,
Post by Kemal Eren
If you are looking for biclustering algorithms I could certainly do that.
http://bmi.osu.edu/hpc/software/bibench/. Its biclustering algorithms are
wrappers to existing tools. It would be really nice to have Python/Cython
implementations in scikit-learn.
The biclustering project would be my personal favorite. It's nice that you
have a code base to start from. I will try to see it it's not too late to
register as a mentor. Also, I see that you already have a few pull-requests
under review. This is nice, since this is a requirement of the PSF for
eligibility to the GSOC.

What algorithms do you have in mind? If you decide to go for biclustering,
you can send us a proposal draft on the mailing-list (in another thread).

Thanks,
Mathieu
Vlad Niculae
2013-04-25 00:08:18 UTC
Permalink
The Baader-Meinhof phenomenon in action -- only 2 days ago I saw a
talk about information-theoretic biclustering (aka co-clustering)
applied to opinion mining of video game reviews and the method raised
my attention. An efficient implementation would be very nice, but it
will definitely require a new API.
Post by Mathieu Blondel
Hi Kemal,
Post by Kemal Eren
If you are looking for biclustering algorithms I could certainly do that.
http://bmi.osu.edu/hpc/software/bibench/. Its biclustering algorithms are
wrappers to existing tools. It would be really nice to have Python/Cython
implementations in scikit-learn.
The biclustering project would be my personal favorite. It's nice that you
have a code base to start from. I will try to see it it's not too late to
register as a mentor. Also, I see that you already have a few pull-requests
under review. This is nice, since this is a requirement of the PSF for
eligibility to the GSOC.
What algorithms do you have in mind? If you decide to go for biclustering,
you can send us a proposal draft on the mailing-list (in another thread).
Thanks,
Mathieu
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2013-04-25 00:56:50 UTC
Permalink
Could you elaborate why it would require a new API?

Mathieu
Post by Vlad Niculae
The Baader-Meinhof phenomenon in action -- only 2 days ago I saw a
talk about information-theoretic biclustering (aka co-clustering)
applied to opinion mining of video game reviews and the method raised
my attention. An efficient implementation would be very nice, but it
will definitely require a new API.
Post by Mathieu Blondel
Hi Kemal,
Post by Kemal Eren
If you are looking for biclustering algorithms I could certainly do
that.
Post by Mathieu Blondel
Post by Kemal Eren
http://bmi.osu.edu/hpc/software/bibench/. Its biclustering algorithms
are
Post by Mathieu Blondel
Post by Kemal Eren
wrappers to existing tools. It would be really nice to have
Python/Cython
Post by Mathieu Blondel
Post by Kemal Eren
implementations in scikit-learn.
The biclustering project would be my personal favorite. It's nice that
you
Post by Mathieu Blondel
have a code base to start from. I will try to see it it's not too late to
register as a mentor. Also, I see that you already have a few
pull-requests
Post by Mathieu Blondel
under review. This is nice, since this is a requirement of the PSF for
eligibility to the GSOC.
What algorithms do you have in mind? If you decide to go for
biclustering,
Post by Mathieu Blondel
you can send us a proposal draft on the mailing-list (in another thread).
Thanks,
Mathieu
------------------------------------------------------------------------------
Post by Mathieu Blondel
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring
service
Post by Mathieu Blondel
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
Post by Mathieu Blondel
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Vlad Niculae
2013-04-25 01:26:01 UTC
Permalink
If we are talking about the same thing, you are returning clusters of
samples and features together (ie rows and columns). So if in K-means
we return a 1D array with cluster labels, here the output would be two
arrays, one of (n_samples,) and one of (n_features,). Another
alternative would be a list of length `n_clusters` where each element
is a pair of lists of row, respectively column indices. But I believe
the first one can be uniform enough wrt our current API.
Post by Mathieu Blondel
Could you elaborate why it would require a new API?
Mathieu
Post by Vlad Niculae
The Baader-Meinhof phenomenon in action -- only 2 days ago I saw a
talk about information-theoretic biclustering (aka co-clustering)
applied to opinion mining of video game reviews and the method raised
my attention. An efficient implementation would be very nice, but it
will definitely require a new API.
Post by Mathieu Blondel
Hi Kemal,
Post by Kemal Eren
If you are looking for biclustering algorithms I could certainly do that.
http://bmi.osu.edu/hpc/software/bibench/. Its biclustering algorithms are
wrappers to existing tools. It would be really nice to have Python/Cython
implementations in scikit-learn.
The biclustering project would be my personal favorite. It's nice that you
have a code base to start from. I will try to see it it's not too late to
register as a mentor. Also, I see that you already have a few pull-requests
under review. This is nice, since this is a requirement of the PSF for
eligibility to the GSOC.
What algorithms do you have in mind? If you decide to go for biclustering,
you can send us a proposal draft on the mailing-list (in another thread).
Thanks,
Mathieu
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt!
http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2013-04-25 01:54:59 UTC
Permalink
Post by Vlad Niculae
If we are talking about the same thing, you are returning clusters of
samples and features together (ie rows and columns). So if in K-means
we return a 1D array with cluster labels, here the output would be two
arrays, one of (n_samples,) and one of (n_features,). Another
alternative would be a list of length `n_clusters` where each element
is a pair of lists of row, respectively column indices. But I believe
the first one can be uniform enough wrt our current API.
I think you are talking about the predict method. In the case of the fit
method, I think we only need fit(X). Then the fitted attributes could be
row_clusters_ where row_clusters_[i, k] = 1 means that the row i belongs to
cluster k and col_clusters_ where col_clusters_[j, k] = 1 means that
column j belongs to cluster k.

Mathieu
Vlad Niculae
2013-04-25 01:58:22 UTC
Permalink
Exactly, I was talking about predict and about the state of the
estimator. It seemed much more difficult before I thought about it
better :)
Post by Mathieu Blondel
Post by Vlad Niculae
If we are talking about the same thing, you are returning clusters of
samples and features together (ie rows and columns). So if in K-means
we return a 1D array with cluster labels, here the output would be two
arrays, one of (n_samples,) and one of (n_features,). Another
alternative would be a list of length `n_clusters` where each element
is a pair of lists of row, respectively column indices. But I believe
the first one can be uniform enough wrt our current API.
I think you are talking about the predict method. In the case of the fit
method, I think we only need fit(X). Then the fitted attributes could be
row_clusters_ where row_clusters_[i, k] = 1 means that the row i belongs to
cluster k and col_clusters_ where col_clusters_[j, k] = 1 means that column
j belongs to cluster k.
Mathieu
Kemal Eren
2013-04-25 20:25:39 UTC
Permalink
Okay, then I'll put together a biclustering proposal tomorrow after work.
It will be a difficult task to come up with a good set of core algorithms,
because the field is so varied. There are over a hundred published methods,
each of which formulates the biclustering problem differently. Any
particular algorithms you would like to see in scikit-learn?

Best,
Kemal
Post by Vlad Niculae
Exactly, I was talking about predict and about the state of the
estimator. It seemed much more difficult before I thought about it
better :)
Post by Mathieu Blondel
Post by Vlad Niculae
If we are talking about the same thing, you are returning clusters of
samples and features together (ie rows and columns). So if in K-means
we return a 1D array with cluster labels, here the output would be two
arrays, one of (n_samples,) and one of (n_features,). Another
alternative would be a list of length `n_clusters` where each element
is a pair of lists of row, respectively column indices. But I believe
the first one can be uniform enough wrt our current API.
I think you are talking about the predict method. In the case of the fit
method, I think we only need fit(X). Then the fitted attributes could be
row_clusters_ where row_clusters_[i, k] = 1 means that the row i belongs
to
Post by Mathieu Blondel
cluster k and col_clusters_ where col_clusters_[j, k] = 1 means that
column
Post by Mathieu Blondel
j belongs to cluster k.
Mathieu
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...