Thanks for all your comments on my proposal. I apologize for not
responding earlier, and I'll try to address each of your concerns or
comments in this mail.
@Olivier: my git hub account is lzamparo. I don't have any prior
Cython development experience, but I do have some exposure to Numpy
and Scipy through some contributions to the CellProfiler project. The
kernel k-means algorithm works by replacing euclidean distance from
points to their cluster centres in the input space by euclidean
distance in the kernel feature space (section 2.1 of  in my
proposal). The authors show that it is equivalent to the
normalized-cut formulation of spectral clustering. While I have not
implemented it myself, the paper shows that it performs well (and
quickly) on the Pendigits data set (from UCI machine learning
repository), as well as the Rosetta Inpharmatics yeast gene expression
The reason I think it will beat other formulations of spectral
clustering is that the affinity matrix need not be stored in memory,
which can be a problem for very large data sets. Also, the kernel
matrix need not be sparsified a priori, which is sometimes the case
for spectral clustering. I think it would be a nice addition to
@Gael: I agree, my proposal is a bit light for only 2.5 months of
work. I had prepared an addition of 'nice to haves' for my original
proposal, but not included it for the sake of brevity. The idea was
to implement the a large margin multi-class metric learning algorithm
(K.Q Weinberger, L.K. Saul. Distance Metric Learning for Large Margin
Nearest Neighbour Classification. JMLR 10 (2009) 207-244), which is
intended to learn a metric for multi-way nearest neighbour
classification, but which I thought could also be a nice
pre-processing step for clustering. The gist is that it learns a
Mahalanobis distance that optimizes multi-class hinge loss. The
metric is applied to the training set as a linear transformation,
which could then be followed by K-means in the transformed space.
However, in light of the suggestions by Bertrand and Olivier, I'm more
inclined to include an implementation of power iteration clustering
(see Olivier's reply) or exemplar based clustering (see Bertrand's
@Mathieu: My proposal for speeding up kernel k-means is two-fold. The
first wold be caching of values for the kernel function, while the
second is a triangle-inequality based scheme to cut down on the number
of distance evaluations required. They update a K x K matrix and a K
x N matrix that are used to estimate a lower bound on the distance
from points to all new potential cluster centres, and only compute the
distances when any lower bound is smaller than the distance from a
point to its current centre. The experiments in the paper show it
saves a lot of distance calculation time, which dominates the running
time for K-means.
Regarding the suggested additions, I'm interested in Olivier's
suggestion of Power Iteration Clustering, and seeing how it fares
against kernel K-means as well as the convex exemplar based clustering
paper suggested by Bertrand. I'll revise my proposal accordingly.
Thanks and apologies for the long reply, I made the mistake of getting
the list in digest mode.
Date: Mon, 2 Apr 2012 16:51:56 +0200
Subject: Re: [Scikit-learn-general] GSoC 2012 pre-application
Content-Type: text/plain; charset=UTF-8
Post by Olivier Grisel
Lee, what is your github account? Do you have prior experience with
Numpy / Scipy / Cython development?
Also about kernel k-means: I don't know this algorithm myself. Do you
have practical evidence that this approach is really working a
scalable way? e.g. an implementation in another language that works
and beat spectral clustering on realistic datasets?
Lee, could you answer those questions and the comments by Mathieu,
Bertrand and Gael ?
http://twitter.com/ogrisel - http://github.com/ogrisel
Try Windows Azure free for 90 days Click Here
Scikit-learn-general mailing list
End of Scikit-learn-general Digest, Vol 27, Issue 1