Discussion:
[Scikit-learn-general] kmeans distance function not configurable
Pieraut, Francis
2013-04-02 19:05:33 UTC
Permalink
Hi guys,

Is there is simple way to change the distance function used in the kmeans implementation?

Thanks,
Francis
Andreas Mueller
2013-04-02 19:09:46 UTC
Permalink
Hi Francis.
No. It is highly non-trivial for most distance functions to do k-means as
the computation of the mean has to be replaced by a different computation.

If you know how to do that, implementing k-means in pure numpy is not
all that hard.

This question comes up quite a lot. Maybe we should do a faq or something.

Cheers,
Andy
Post by Pieraut, Francis
Hi guys,
Is there is simple way to change the distance function used in the kmeans implementation?
Thanks,
Francis
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire
the most talented Cisco Certified professionals. Visit the
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Kenneth C. Arnold
2013-04-02 19:31:53 UTC
Permalink
If you want a Mahalanobis distance, though, you can instead just transform
your data using the Cholesky decomposition of the distance matrix.


-Ken
Post by Andreas Mueller
Hi Francis.
No. It is highly non-trivial for most distance functions to do k-means as
the computation of the mean has to be replaced by a different computation.
If you know how to do that, implementing k-means in pure numpy is not all
that hard.
This question comes up quite a lot. Maybe we should do a faq or something.
Cheers,
Andy
Hi guys,****
** **
Is there is simple way to change the distance function used in the kmeans
implementation?****
** **
Thanks,****
Francis****
** **
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire
the most talented Cisco Certified professionals. Visit the
Employer Resources Portalhttp://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire
the most talented Cisco Certified professionals. Visit the
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Pieraut, Francis
2013-04-02 20:26:34 UTC
Permalink
Hi Andy & Ken,

Thanks Ken for the alternative but I am using a cosine distance.
Andy, concerning the computation of the mean, the function has to be configurable too but the default function mean is also good for cosine & bregman divergence (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf see table 8.2 page 501). Yes I could implement easily k-means but I will lose lot of benefits from sklearn frameworks such as the ability to compare easily several unsupervised algorithms. I was simply expected the distance function to be configurable as it is with many other sklearn functions.

On the other hand, do you know why metrics.cluster.unsupervised.silhouette_score required the labels? I understand that we can compute the supervised version of the silhouette score but was looking for the unsupervised version. Even the help doesn’t mention anywhere the labels.

I am trying to push for sklearn in my team, quite impress so far.
Thanks,
Francis



From: Kenneth C. Arnold [mailto:***@seas.harvard.edu]
Sent: April-02-13 3:32 PM
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] kmeans distance function not configurable

If you want a Mahalanobis distance, though, you can instead just transform your data using the Cholesky decomposition of the distance matrix.

-Ken

On Tue, Apr 2, 2013 at 3:09 PM, Andreas Mueller <***@ais.uni-bonn.de<mailto:***@ais.uni-bonn.de>> wrote:
Hi Francis.
No. It is highly non-trivial for most distance functions to do k-means as
the computation of the mean has to be replaced by a different computation.

If you know how to do that, implementing k-means in pure numpy is not all that hard.

This question comes up quite a lot. Maybe we should do a faq or something.

Cheers,
Andy


On 04/02/2013 09:05 PM, Pieraut, Francis wrote:
Hi guys,

Is there is simple way to change the distance function used in the kmeans implementation?

Thanks,
Francis



------------------------------------------------------------------------------

Minimize network downtime and maximize team effectiveness.

Reduce network management and security costs.Learn how to hire

the most talented Cisco Certified professionals. Visit the

Employer Resources Portal

http://www.cisco.com/web/learning/employer_resources/index.html


_______________________________________________

Scikit-learn-general mailing list

Scikit-learn-***@lists.sourceforge.net<mailto:Scikit-learn-***@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire
the most talented Cisco Certified professionals. Visit the
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net<mailto:Scikit-learn-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andreas Mueller
2013-04-02 21:05:46 UTC
Permalink
Post by Pieraut, Francis
Hi Andy & Ken,
Thanks Ken for the alternative but I am using a cosine distance.
Andy, concerning the computation of the mean, the function has to be
configurable too but the default function mean is also good for cosine
& bregman divergence
(http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf see table 8.2 page
501). Yes I could implement easily k-means but I will lose lot of
benefits from sklearn frameworks such as the ability to compare easily
several unsupervised algorithms. I was simply expected the distance
function to be configurable as it is with many other sklearn functions.
You can implement it and inherit from BaseEstimator (and optionally
cluster mixin)
There is no magic to making a sklearn estimator, you just have to define
"fit" and "predict".

Actually there are not many place in sklearn where you can pass
callables to customize an algorithm.
The thing is that you would need to provide a pairwise distance measure
and a function to compute the center and these should be compatible.
If they are not, the algorithm might not stop (afaik). So does the
algorithm check for endless loops? Does it check whether computing the
center
does the right thing? Or does it check for infinite loops?
Post by Pieraut, Francis
On the other hand, do you know why
metrics.cluster.unsupervised.silhouette_score required the labels? I
understand that we can compute the supervised version of the
silhouette score but was looking for the unsupervised version. Even
the help doesn't mention anywhere the labels.
The labels here are the cluster assignments you want to evaluate. Just
passing X is no good, is it ;)

Cheers,
Andy
Gael Varoquaux
2013-04-03 09:19:33 UTC
Permalink
Post by Pieraut, Francis
Andy, concerning the computation of the mean, the function has to be
configurable too but the default function mean is also good for cosine &
bregman divergence (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf see
table 8.2 page 501).
I think that you are mis-understanding the table (granted it's
confusing): the mean is then a Frechet mean, using as a distance in the
Frechet mean the distance of interest. This can be in general not
completely trivial to code.
Post by Pieraut, Francis
I was simply expected the distance function to be configurable as it is
with many other sklearn functions.
Features in scikit-learn are pretty much implemented as people contribute
them. It seem that nobody so far has needed this feature badly enough to
contribute it.

Cheers,

Gaël

Loading...