Discussion:
GSoC 2012 pre-application
(too old to reply)
Lee Zamparo
2012-03-30 04:30:46 UTC
Permalink
Hello everyone,

I'm a prospective applicant to GSoC 2012, and am drafting a proposal.
I would really appreciate if you could spare some time to give me
feedback. My proposal is centred around sklearn.cluster, so I would
like to ask Andreas Muller, Olivier Grisel or Lars Buitinck if they
would consider being potential mentors.

Here is the link to the Google doc containing my application:
https://docs.google.com/document/d/180TbWNahVmlLiVEUNYU9nSPPeUwJ3DY4N5b3aRXhaQo/edit

Once again, I am very grateful for any advice or feedback you can provide.

Thanks,

Lee.
Gael Varoquaux
2012-03-30 05:19:23 UTC
Permalink
Hi Lee,

Welcome! Thanks for preparing a proposal. My impression looking at it, is
that it seems a bit light for 2.5 months of work. It is pretty much
centered around implementing one algorithm, weighted k-means.

Cheers,

Gael
Post by Lee Zamparo
Hello everyone,
I'm a prospective applicant to GSoC 2012, and am drafting a proposal.
I would really appreciate if you could spare some time to give me
feedback. My proposal is centred around sklearn.cluster, so I would
like to ask Andreas Muller, Olivier Grisel or Lars Buitinck if they
would consider being potential mentors.
https://docs.google.com/document/d/180TbWNahVmlLiVEUNYU9nSPPeUwJ3DY4N5b3aRXhaQo/edit
Once again, I am very grateful for any advice or feedback you can provide.
Thanks,
Lee.
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info
Robert Layton
2012-03-30 05:24:25 UTC
Permalink
Post by Gael Varoquaux
Hi Lee,
Welcome! Thanks for preparing a proposal. My impression looking at it, is
that it seems a bit light for 2.5 months of work. It is pretty much
centered around implementing one algorithm, weighted k-means.
Cheers,
Gael
Post by Lee Zamparo
Hello everyone,
I'm a prospective applicant to GSoC 2012, and am drafting a proposal.
I would really appreciate if you could spare some time to give me
feedback. My proposal is centred around sklearn.cluster, so I would
like to ask Andreas Muller, Olivier Grisel or Lars Buitinck if they
would consider being potential mentors.
https://docs.google.com/document/d/180TbWNahVmlLiVEUNYU9nSPPeUwJ3DY4N5b3aRXhaQo/edit
Post by Lee Zamparo
Once again, I am very grateful for any advice or feedback you can
provide.
Post by Lee Zamparo
Thanks,
Lee.
------------------------------------------------------------------------------
Post by Lee Zamparo
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
I agree with Gael, but also like the general idea.

One method for increasing the scope would be to add other spectral
clustering algorithms to the project. Then create a testing example,
comparing them in terms of space/time/efficacy for different datasets.

Thoughts?

Robert
--
Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)
Olivier Grisel
2012-03-30 13:21:12 UTC
Permalink
Post by Gael Varoquaux
Hi Lee,
Welcome! Thanks for preparing a proposal. My impression looking at it, is
that it seems a bit light for 2.5 months of work. It is pretty much
centered around implementing one algorithm, weighted k-means.
One way to complement this proposal would be to take over the
development of the Power Iteration Clustering. I am pretty sure PIC
can be a scalable alternative to Spectral Clustering.

https://github.com/scikit-learn/scikit-learn/pull/138

Here is the paper:

http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf

Another interesting task related to clustering is to try to implement
model selection using the stability of the clustering algorithm across
partially overlapping training sets as meta performance metric.

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5211310

http://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf

I am aware that this is still a bit experimental (still an open
research area) but I would really like to invest some time to check on
some realistic datasets whether this unsupervised model selection
strategy works in practice. If it proves useful, then its practicality
could motivate the inclusions of such tooling into the scikit-learn
projects despite not being established yet (disclaimer: this is my own
opinion and is subject for debate).

Anyway, to strengthen the GSoC proposal it would be necessary to do
some actual code contributions before the GSoC proposal submission
deadline.

That can involve bugfixing stuff from master, contributing small
improvements in new pull request or even starting some work on the
Power Iteration Clustering a branch such as rebasing it on top the
current master and starting to write some tests.

Lee, what is your github account? Do you have prior experience with
Numpy / Scipy / Cython development?

Also about kernel k-means: I don't know this algorithm myself. Do you
have practical evidence that this approach is really working a
scalable way? e.g. an implementation in another language that works
and beat spectral clustering on realistic datasets?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2012-04-02 14:51:56 UTC
Permalink
Post by Olivier Grisel
Lee, what is your github account? Do you have prior experience with
Numpy / Scipy / Cython development?
Also about kernel k-means: I don't know this algorithm myself. Do you
have practical evidence that this approach is really working a
scalable way? e.g. an implementation in another language that works
and beat spectral clustering on realistic datasets?
Lee, could you answer those questions and the comments by Mathieu,
Bertrand and Gael ?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2012-03-30 06:03:20 UTC
Permalink
Could you elaborate your strategies for speeding up kernel k-means? As far
as I know, kernel k-means is very expensive.

Mathieu
Bertrand Thirion
2012-03-30 10:38:03 UTC
Permalink
Regarding clustering algorithms, I would suggest to have a look at convex formulations, that can be advantageous for the sake of convergence/stability, wrt standard algorithms that never have any guarantee. Among others: - http://www.icml-2011.org/papers/419_icmlpaper.pdf - http://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCkQFjAA&url=http%3A%2F%2Fpeople.csail.mit.edu%2Fpolina%2Fpapers%2FLashkariGolland_NIPS07.pdf&ei=4Ip1T-eUHYK90QWV6NjHDQ&usg=AFQjCNFCTuLQ2q1j9LBz3TPlV5Bdf6TZXQ&sig2=bpzB9HSYc3OI1ICnWY92Og I must say however, that I haven't looked in detail to those, and I'm not sure which one should be preferred. The pros and cons of each each algo should be discussed as a preliminary step. I'm not sure whether anybody has enough hindsight on these techniques. My 2c, Bertrand ----- Mail original -----
Envoyé: Vendredi 30 Mars 2012 07:24:25
Objet: Re: [Scikit-learn-general] GSoC 2012 pre-application
Post by Gael Varoquaux
Hi Lee,
Welcome! Thanks for preparing a proposal. My impression looking at
it,
is
that it seems a bit light for 2.5 months of work. It is pretty much
centered around implementing one algorithm, weighted k-means.
Cheers,
Gael
Post by Lee Zamparo
Hello everyone,
I'm a prospective applicant to GSoC 2012, and am drafting a
proposal.
I would really appreciate if you could spare some time to give me
feedback. My proposal is centred around sklearn.cluster, so I
would
like to ask Andreas Muller, Olivier Grisel or Lars Buitinck if
they
would consider being potential mentors.
https://docs.google.com/document/d/180TbWNahVmlLiVEUNYU9nSPPeUwJ3DY4N5b3aRXhaQo/edit
Once again, I am very grateful for any advice or feedback you can
provide.
Thanks,
Lee.
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
I agree with Gael, but also like the general idea.
One method for increasing the scope would be to add other spectral
clustering algorithms to the project. Then create a testing example,
comparing them in terms of space/time/efficacy for different datasets.
Thoughts?
Robert
--
Public key at: http://pgp.mit.edu/ Search for this email address and
select the key from "2011-08-19" (key id: 54BA8735)
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas
2012-03-30 11:03:26 UTC
Permalink
Hi Lee.
I'd have to have a look that the papers again to judge this better.
Maybe I'll have time on the weekend.

What I would also like to see in the clustering module would be a
more scalable mean shift and maybe also quickshift.

Cheers,
Andy
Post by Lee Zamparo
Hello everyone,
I'm a prospective applicant to GSoC 2012, and am drafting a proposal.
I would really appreciate if you could spare some time to give me
feedback. My proposal is centred around sklearn.cluster, so I would
like to ask Andreas Muller, Olivier Grisel or Lars Buitinck if they
would consider being potential mentors.
https://docs.google.com/document/d/180TbWNahVmlLiVEUNYU9nSPPeUwJ3DY4N5b3aRXhaQo/edit
Once again, I am very grateful for any advice or feedback you can provide.
Thanks,
Lee.
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2012-03-30 11:24:45 UTC
Permalink
Our affinity propagation really doesn't scale. I quickly tried to make it work on sparse matrices, but it turned out to be some work. That said, it's a crappy algorithm :$.

Gael

----- Original message -----
Post by Andreas
Hi Lee.
I'd have to have a look that the papers again to judge this better.
Maybe I'll have time on the weekend.
What I would also like to see in the clustering module would be a
more scalable mean shift and maybe also quickshift.
Cheers,
Andy
Post by Lee Zamparo
Hello everyone,
I'm a prospective applicant to GSoC 2012, and am drafting a proposal.
I would really appreciate if you could spare some time to give me
feedback.  My proposal is centred around sklearn.cluster, so I would
like to ask Andreas Muller, Olivier Grisel or Lars Buitinck if they
would consider being potential mentors.
https://docs.google.com/document/d/180TbWNahVmlLiVEUNYU9nSPPeUwJ3DY4N5b3aRXhaQo/edit
Once again, I am very grateful for any advice or feedback you can provide.
Thanks,
Lee.
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Lee Zamparo
2012-04-02 16:06:06 UTC
Permalink
Hi everyone,

Thanks for all your comments on my proposal. I apologize for not
responding earlier, and I'll try to address each of your concerns or
comments in this mail.

@Olivier: my git hub account is lzamparo. I don't have any prior
Cython development experience, but I do have some exposure to Numpy
and Scipy through some contributions to the CellProfiler project. The
kernel k-means algorithm works by replacing euclidean distance from
points to their cluster centres in the input space by euclidean
distance in the kernel feature space (section 2.1 of [1] in my
proposal). The authors show that it is equivalent to the
normalized-cut formulation of spectral clustering. While I have not
implemented it myself, the paper shows that it performs well (and
quickly) on the Pendigits data set (from UCI machine learning
repository), as well as the Rosetta Inpharmatics yeast gene expression
data set.

The reason I think it will beat other formulations of spectral
clustering is that the affinity matrix need not be stored in memory,
which can be a problem for very large data sets. Also, the kernel
matrix need not be sparsified a priori, which is sometimes the case
for spectral clustering. I think it would be a nice addition to
sklearn.

@Gael: I agree, my proposal is a bit light for only 2.5 months of
work. I had prepared an addition of 'nice to haves' for my original
proposal, but not included it for the sake of brevity. The idea was
to implement the a large margin multi-class metric learning algorithm
(K.Q Weinberger, L.K. Saul. Distance Metric Learning for Large Margin
Nearest Neighbour Classification. JMLR 10 (2009) 207-244), which is
intended to learn a metric for multi-way nearest neighbour
classification, but which I thought could also be a nice
pre-processing step for clustering. The gist is that it learns a
Mahalanobis distance that optimizes multi-class hinge loss. The
metric is applied to the training set as a linear transformation,
which could then be followed by K-means in the transformed space.
However, in light of the suggestions by Bertrand and Olivier, I'm more
inclined to include an implementation of power iteration clustering
(see Olivier's reply) or exemplar based clustering (see Bertrand's
reply).

@Mathieu: My proposal for speeding up kernel k-means is two-fold. The
first wold be caching of values for the kernel function, while the
second is a triangle-inequality based scheme to cut down on the number
of distance evaluations required. They update a K x K matrix and a K
x N matrix that are used to estimate a lower bound on the distance
from points to all new potential cluster centres, and only compute the
distances when any lower bound is smaller than the distance from a
point to its current centre. The experiments in the paper show it
saves a lot of distance calculation time, which dominates the running
time for K-means.

Regarding the suggested additions, I'm interested in Olivier's
suggestion of Power Iteration Clustering, and seeing how it fares
against kernel K-means as well as the convex exemplar based clustering
paper suggested by Bertrand. I'll revise my proposal accordingly.

Thanks and apologies for the long reply, I made the mistake of getting
the list in digest mode.

Lee.
Message: 8
Date: Mon, 2 Apr 2012 16:51:56 +0200
Subject: Re: [Scikit-learn-general] GSoC 2012 pre-application
Content-Type: text/plain; charset=UTF-8
Post by Olivier Grisel
Lee, what is your github account? Do you have prior experience with
Numpy / Scipy / Cython development?
Also about kernel k-means: I don't know this algorithm myself. Do you
have practical evidence that this approach is really working a
scalable way? e.g. an implementation in another language that works
and beat spectral clustering on realistic datasets?
Lee, could you answer those questions and the comments by Mathieu,
Bertrand and Gael ?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
End of Scikit-learn-general Digest, Vol 27, Issue 1
***************************************************
Olivier Grisel
2012-04-02 16:14:47 UTC
Permalink
Post by Lee Zamparo
Regarding the suggested additions, I'm interested in Olivier's
suggestion of Power Iteration Clustering, and seeing how it fares
against kernel K-means as well as the convex exemplar based clustering
paper suggested by Bertrand.  I'll revise my proposal accordingly.
Thanks and apologies for the long reply, I made the mistake of getting
the list in digest mode.
It's alright. Thanks for the reply. I think you can switch to mail by
mail subscriptions from the mailman UI if you want to.

Also, for the application to be successful it's not enough to write a
proposal, you should start contributing (small stuff) before the end
of April as the contribution workflow is part of the evaluation of the
student application.

See the intro of this page:

https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012

and more details on contributing here:

http://scikit-learn.org/dev/developers/index.html#contributing-code

Also please add an short summary of you proposal on the wiki for the
sake of completeness:

https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Vlad Niculae
2012-04-04 13:59:29 UTC
Permalink
Hello guys,

Unfortunately I have come down with the flu, and therefore missed a good amount of time to work on gsoc 2012 proposals. I know that there's not much time left for review, but here is my pre-proposal for a overall speedup and benchmarking project.

https://docs.google.com/document/d/1SfygxG1xTcVrVXH7J1lgI9NEDCB--piKoFCI8SzTLAg/edit

I will fix as much of it as I can over the next two days. Please take a look. If you give me a pre-thumbs up, I'll put it up on the wiki as well ASAP.

Yours,
Vlad
Post by Olivier Grisel
Post by Lee Zamparo
Regarding the suggested additions, I'm interested in Olivier's
suggestion of Power Iteration Clustering, and seeing how it fares
against kernel K-means as well as the convex exemplar based clustering
paper suggested by Bertrand. I'll revise my proposal accordingly.
Thanks and apologies for the long reply, I made the mistake of getting
the list in digest mode.
It's alright. Thanks for the reply. I think you can switch to mail by
mail subscriptions from the mailman UI if you want to.
Also, for the application to be successful it's not enough to write a
proposal, you should start contributing (small stuff) before the end
of April as the contribution workflow is part of the evaluation of the
student application.
https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
http://scikit-learn.org/dev/developers/index.html#contributing-code
Also please add an short summary of you proposal on the wiki for the
https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Alexandre Gramfort
2012-04-04 18:19:27 UTC
Permalink
hello vlad,

hope you're doing better.

My gut feeling reading the proposal is that you clearly know what you're talking
about as you know well the code base but I think you should be more specific
about where the low hanging fruits are and which modules deserve some love
in terms of speed.

Alex
Post by Vlad Niculae
Hello guys,
Unfortunately I have come down with the flu, and therefore missed a good amount of time to work on gsoc 2012 proposals. I know that there's not much time left for review, but here is my pre-proposal for a overall speedup and benchmarking project.
https://docs.google.com/document/d/1SfygxG1xTcVrVXH7J1lgI9NEDCB--piKoFCI8SzTLAg/edit
I will fix as much of it as I can over the next two days. Please take a look. If you give me a pre-thumbs up, I'll put it up on the wiki as well ASAP.
Yours,
Vlad
Post by Olivier Grisel
Post by Lee Zamparo
Regarding the suggested additions, I'm interested in Olivier's
suggestion of Power Iteration Clustering, and seeing how it fares
against kernel K-means as well as the convex exemplar based clustering
paper suggested by Bertrand.  I'll revise my proposal accordingly.
Thanks and apologies for the long reply, I made the mistake of getting
the list in digest mode.
It's alright. Thanks for the reply. I think you can switch to mail by
mail subscriptions from the mailman UI if you want to.
Also, for the application to be successful it's not enough to write a
proposal, you should start contributing (small stuff) before the end
of April as the contribution workflow is part of the evaluation of the
student application.
 https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
 http://scikit-learn.org/dev/developers/index.html#contributing-code
Also please add an short summary of you proposal on the wiki for the
 https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2012-04-04 18:32:19 UTC
Permalink
Le 4 avril 2012 20:19, Alexandre Gramfort
Post by Alexandre Gramfort
hello vlad,
hope you're doing better.
My gut feeling reading the proposal is that you clearly know what you're talking
about as you know well the code base but I think you should be more specific
about where the low hanging fruits are and which modules deserve some love
in terms of speed.
Maybe you could state explicitly that the work will include a
scalability profile of all the available models:

Pickup a selection of ~5 differents datasets with very different
n_samples, n_features and sparsity profiles and compile a list of all
the estimators that are able to converge to a useable model in less
than 1s, 10s, 100s or 1000s for instance and less than 1GB memory for
instance.

This kind of high level information would a be really nice complement
to the table in [1] for instance.

[1] http://scikit-learn.org/dev/modules/clustering.html

While doing so, you could using the cProfile / line_profiler modules
to help identify low hanging fruits.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Vlad Niculae
2012-04-05 12:25:59 UTC
Permalink
Hi everyone

I have updated my proposal thanks to your excellent suggestions.

I also pointed out the style of optimization that will be applied by linking to my blog post on optimizing orthogonal matching pursuit code. Unfortunately this will also flash the bug I introduced before everyone's eyes. I hope it doesn't look so bad… does it? :)

I have yet to point out obvious low-hanging fruits. What do you suggest?

Do you think the proposal makes it clear enough that it's not just about making stuff run faster, but also setting up a benchmarking system and making sure things stay fast and that new code will be easily benchable?

I plan to submit tonight.

Regards,
Vlad
Post by Olivier Grisel
Le 4 avril 2012 20:19, Alexandre Gramfort
Post by Alexandre Gramfort
hello vlad,
hope you're doing better.
My gut feeling reading the proposal is that you clearly know what you're talking
about as you know well the code base but I think you should be more specific
about where the low hanging fruits are and which modules deserve some love
in terms of speed.
Maybe you could state explicitly that the work will include a
Pickup a selection of ~5 differents datasets with very different
n_samples, n_features and sparsity profiles and compile a list of all
the estimators that are able to converge to a useable model in less
than 1s, 10s, 100s or 1000s for instance and less than 1GB memory for
instance.
This kind of high level information would a be really nice complement
to the table in [1] for instance.
[1] http://scikit-learn.org/dev/modules/clustering.html
While doing so, you could using the cProfile / line_profiler modules
to help identify low hanging fruits.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2012-04-05 12:29:53 UTC
Permalink
Post by Vlad Niculae
I plan to submit tonight.
Do: you should be able to edit till the deadline.

To all the students: I must apologize, Alex (Gramfort) and I have a
conference deadline tonight, and we must worry about our graduate
students that are submitting to this conference. It has been taking a lot
of time, and we haven't been giving as much feedback we should on the
proposals. This is not from lack of interest!

G
Olivier Grisel
2012-04-05 13:00:19 UTC
Permalink
Post by Lee Zamparo
Hi everyone
I have updated my proposal thanks to your excellent suggestions.
I also pointed out the style of optimization that will be applied by linking to my blog post on optimizing orthogonal matching pursuit code. Unfortunately this will also flash the bug I introduced before everyone's eyes. I hope it doesn't look so bad… does it? :)
I have yet to point out obvious low-hanging fruits. What do you suggest?
Do you think the proposal makes it clear enough that it's not just about making stuff run faster, but also setting up a benchmarking system and making sure things stay fast and that new code will be easily benchable?
I am ok with the fact of not giving the low hanging fruits in the GSoC
proposal and use the results of the first part (the benchmarking step)
of the GSoC to identify them in a principled manner.

The performance bottlenecks are rarely where you expect them to be.
Better use a profiler rather than a semi-educated guess.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Vlad Niculae
2012-04-05 20:58:56 UTC
Permalink
Submitted: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/vladn/15002#

Just need to make some links blue, but the editor is tedious.

I'm afraid I won't have the time to write up a proposal for the matrix completion project. I still remain highly interested in it.

Best,
Vlad
Post by Olivier Grisel
Post by Lee Zamparo
Hi everyone
I have updated my proposal thanks to your excellent suggestions.
I also pointed out the style of optimization that will be applied by linking to my blog post on optimizing orthogonal matching pursuit code. Unfortunately this will also flash the bug I introduced before everyone's eyes. I hope it doesn't look so bad… does it? :)
I have yet to point out obvious low-hanging fruits. What do you suggest?
Do you think the proposal makes it clear enough that it's not just about making stuff run faster, but also setting up a benchmarking system and making sure things stay fast and that new code will be easily benchable?
I am ok with the fact of not giving the low hanging fruits in the GSoC
proposal and use the results of the first part (the benchmarking step)
of the GSoC to identify them in a principled manner.
The performance bottlenecks are rarely where you expect them to be.
Better use a profiler rather than a semi-educated guess.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2012-04-05 23:32:36 UTC
Permalink
Post by Vlad Niculae
Submitted: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/vladn/15002#
Just need to make some links blue, but the editor is tedious.
I'm afraid I won't have the time to write up a proposal for the matrix completion project. I still remain highly interested in it.
Alright, next year then :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Lee Zamparo
2012-04-06 02:31:45 UTC
Permalink
Hi folks,

I haven't had time to work on and submit a patch to scikit-learn by
this evening, as I'm facing down a conference deadline. Thanks to
everyone who provided valuable feedback, and hopefully I'll be able to
submit next year.

Thanks,

Lee.

On Mon, Apr 2, 2012 at 12:14 PM, Olivier Grisel
Post by Olivier Grisel
Post by Lee Zamparo
Regarding the suggested additions, I'm interested in Olivier's
suggestion of Power Iteration Clustering, and seeing how it fares
against kernel K-means as well as the convex exemplar based clustering
paper suggested by Bertrand.  I'll revise my proposal accordingly.
Thanks and apologies for the long reply, I made the mistake of getting
the list in digest mode.
It's alright. Thanks for the reply. I think you can switch to mail by
mail subscriptions from the mailman UI if you want to.
Also, for the application to be successful it's not enough to write a
proposal, you should start contributing (small stuff) before the end
of April as the contribution workflow is part of the evaluation of the
student application.
 https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
 http://scikit-learn.org/dev/developers/index.html#contributing-code
Also please add an short summary of you proposal on the wiki for the
 https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2012-04-06 05:35:55 UTC
Permalink
Hi Lee,

As you can see on http://wiki.python.org/moin/SummerOfCode/2012, you have
another 10 days to meet these expectations. That said, if you do not have
time to pursue the GSOC, you shouldn't apply. I stress that it is a full
time job.

Good luck with your deadline,

Gael
Post by Lee Zamparo
Hi folks,
I haven't had time to work on and submit a patch to scikit-learn by
this evening, as I'm facing down a conference deadline. Thanks to
everyone who provided valuable feedback, and hopefully I'll be able to
submit next year.
Thanks,
Lee.
On Mon, Apr 2, 2012 at 12:14 PM, Olivier Grisel
Post by Olivier Grisel
Post by Lee Zamparo
Regarding the suggested additions, I'm interested in Olivier's
suggestion of Power Iteration Clustering, and seeing how it fares
against kernel K-means as well as the convex exemplar based clustering
paper suggested by Bertrand.  I'll revise my proposal accordingly.
Thanks and apologies for the long reply, I made the mistake of getting
the list in digest mode.
It's alright. Thanks for the reply. I think you can switch to mail by
mail subscriptions from the mailman UI if you want to.
Also, for the application to be successful it's not enough to write a
proposal, you should start contributing (small stuff) before the end
of April as the contribution workflow is part of the evaluation of the
student application.
 https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
 http://scikit-learn.org/dev/developers/index.html#contributing-code
Also please add an short summary of you proposal on the wiki for the
 https://github.com/scikit-learn/scikit-learn/wiki/A-list-of-topics-for-a-google-summer-of-code-%28gsoc%29-2012
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info
Olivier Grisel
2012-04-06 06:57:14 UTC
Permalink
Post by Lee Zamparo
Hi folks,
I haven't had time to work on and submit a patch to scikit-learn by
this evening, as I'm facing down a conference deadline.  Thanks to
everyone who provided valuable feedback, and hopefully I'll be able to
submit next year.
You can submit the updated proposal on melange today and submit a
patch on scikit-learn before the student selection deadline (the
mentors will vote on the submitted proposal during the next couple of
days / weeks).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Continue reading on narkive:
Loading...