Discussion:
[Scikit-learn-general] Other distance metrics for kNN
Mathias Verbeke
2012-01-04 15:22:03 UTC
Permalink
Dear all,

I just started working with Scikit Learn and I'm currently using the
Nearest Neighbors module. In the documentation is stated that it currently
only supports the Euclidean distance metric, and I was wondering if it
would be easy to extend it with other distance metrics? Since it uses the
scipy.sparse matrices as input, I was thinking about the distance metrics
in scipy.distance.spatial. Would that be possible, or were there certain
considerations to only allow for Euclidean distance?

Best and thanks in advance,

Mathias
Olivier Grisel
2012-01-04 15:31:58 UTC
Permalink
Post by Mathias Verbeke
Dear all,
I just started working with Scikit Learn and I'm currently using the Nearest
Neighbors module. In the documentation is stated that it currently only
supports the Euclidean distance metric, and I was wondering if it would be
easy to extend it with other distance metrics? Since it uses the
scipy.sparse matrices as input, I was thinking about the distance metrics in
scipy.distance.spatial.
scipy.spatial.distance does not work on scipy.sparse matrices, only on
numpy arrays AFAIK. The kNN classifier only works with sparse matrices
with the "bruteforce" mode as BallTree and kd-tree do not work with
scipy.sparse matrices either.
Post by Mathias Verbeke
Would that be possible, or were there certain
considerations to only allow for Euclidean distance?
Would be great to make this pluggable indeed. This should be quite
easy for the brute force mode. For the ball tree mode that will
require to dive into the cython code and read the reference paper to
check whether any assumption on the metrics is used or not (or just
ask Jake :).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Jacob VanderPlas
2012-01-04 15:59:04 UTC
Permalink
It would be very straightforward to extend neighbors to use any
minkowski distance (see
https://github.com/scikit-learn/scikit-learn/issues/351 ).
BallTree does not yet work for arbitrary distance metrics, but this is
not due to any inherent limitation: it will work for any metric which
satisfies the triangle inequality.

I've been thinking about how to extend BallTree to work with other
metrics: for speed, it should use C-style function pointers to compute
the distance (python functions are too slow). This is difficult because
different distance metrics have different ancillary data associated
(i.e. integer p for minkowski distance, matrix V for mahalanobis
distance, etc). Using C++ style class abstraction, this sort of thing
could be accomplished fairly easily and readably. But that doesn't feel
"cythony" to me.

If someone has a good idea about how one could specify these distance
metrics from python code, with optional ancillary parameters, and
convert these specifications into code for fast distance computation
within cython, I think Mathias' suggestion could be accomplished with a
bit of effort.
Jake
Post by Olivier Grisel
Post by Mathias Verbeke
Dear all,
I just started working with Scikit Learn and I'm currently using the Nearest
Neighbors module. In the documentation is stated that it currently only
supports the Euclidean distance metric, and I was wondering if it would be
easy to extend it with other distance metrics? Since it uses the
scipy.sparse matrices as input, I was thinking about the distance metrics in
scipy.distance.spatial.
scipy.spatial.distance does not work on scipy.sparse matrices, only on
numpy arrays AFAIK. The kNN classifier only works with sparse matrices
with the "bruteforce" mode as BallTree and kd-tree do not work with
scipy.sparse matrices either.
Post by Mathias Verbeke
Would that be possible, or were there certain
considerations to only allow for Euclidean distance?
Would be great to make this pluggable indeed. This should be quite
easy for the brute force mode. For the ball tree mode that will
require to dive into the cython code and read the reference paper to
check whether any assumption on the metrics is used or not (or just
ask Jake :).
Olivier Grisel
2012-01-04 16:39:03 UTC
Permalink
It should be possible to define a cython struct with a function pointer:

http://docs.cython.org/src/reference/language_basics.html#function-pointers

I have no experience with that kind of cython constructs though.

Alternatively that could be the right opportunity to introduce the
cython template system we want to use for templating the dtypes of
input and output data in cython extensions.
--
Olivier
Gael Varoquaux
2012-01-04 17:35:40 UTC
Permalink
Post by Jacob VanderPlas
If someone has a good idea about how one could specify these distance
metrics from python code, with optional ancillary parameters, and
convert these specifications into code for fast distance computation
within cython,
How about using small Cython classes with one or two methods? This is
what ended up working well for the decision trees (see _tree.pyx).

Gael
Jacob VanderPlas
2012-01-04 18:12:35 UTC
Permalink
Post by Gael Varoquaux
Post by Jacob VanderPlas
If someone has a good idea about how one could specify these distance
metrics from python code, with optional ancillary parameters, and
convert these specifications into code for fast distance computation
within cython,
How about using small Cython classes with one or two methods? This is
what ended up working well for the decision trees (see _tree.pyx).
Interesting... I won't have a lot of time as the quarter ramps up, but
I'll start experimenting with that when I can. If it could be done
without overhead that would slow the euclidean case (probably the most
common use case) then I think it would be a worthwhile improvement.
Jake
Jacob VanderPlas
2012-01-05 16:33:01 UTC
Permalink
Here's a small example I coded up that shows how I envision including
multiple distance metrics in BallTree

https://gist.github.com/1565998

The idea is that you create functions to compute distance which expose C
function pointers, so that the ball tree cython code can call these
without python overhead. I'd be curious to hear peoples' thoughts
Jake
Post by Gael Varoquaux
Post by Jacob VanderPlas
If someone has a good idea about how one could specify these distance
metrics from python code, with optional ancillary parameters, and
convert these specifications into code for fast distance computation
within cython,
How about using small Cython classes with one or two methods? This is
what ended up working well for the decision trees (see _tree.pyx).
Gael
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathias Verbeke
2012-01-05 17:00:26 UTC
Permalink
Hi,

First, thanks for all the answers! Waauw, really interesting discussion. I
have only basic Python skills, and never programmed in Cython (together
with a lot of time constraints, as most of you probably), but I would like
to give it a try to add new distance metrics to the brute force method.
Would it be possible to give some pointers to what should need to be
done/changed to add e.g. the keyword p as was mentioned in Jake's first
reply?

Cheers and thanks,

Mathias


On Thu, Jan 5, 2012 at 5:33 PM, Jacob VanderPlas <
Post by Jacob VanderPlas
Here's a small example I coded up that shows how I envision including
multiple distance metrics in BallTree
https://gist.github.com/1565998
The idea is that you create functions to compute distance which expose C
function pointers, so that the ball tree cython code can call these
without python overhead. I'd be curious to hear peoples' thoughts
Jake
Post by Gael Varoquaux
Post by Jacob VanderPlas
If someone has a good idea about how one could specify these distance
metrics from python code, with optional ancillary parameters, and
convert these specifications into code for fast distance computation
within cython,
How about using small Cython classes with one or two methods? This is
what ended up working well for the decision trees (see _tree.pyx).
Gael
------------------------------------------------------------------------------
Post by Gael Varoquaux
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Jacob VanderPlas
2012-01-05 17:30:56 UTC
Permalink
Mathias,
I'm glad you're excited to work on this! I think starting with just the
minkowski p-distance in this case is a good idea, and it would be a
great way for you to gain experience with the code. I'd do the following:

- in sklearn/neighbors/base.py, add a parameter `p` to the NeighborsBase
class. This class holds the basic parameters needed for all distance
computations.
- Also in this class, when BallTree is constructed, this parameter `p`
needs to be passed to it.
- Next in the places in this file where the distances are actually
computed, you need to make use of this parameter `p` to make sure the
correct distance metric is returned. For Ball Tree, this is accomplished
when it is constructed. For KDtree, `p` is passed as a parameter when
finding the nearest neighbors. For brute, you'll need to modify the
code to use minkowski rather than euclidean distance.
- Next the rest of the neighbors module needs to be modified to accept a
keyword `p` with default value 2, and pass it to the _init_params function.
- Finally, some tests and examples should be created.

If you'd like to start on this, go ahead and fork scikit-learn on github
and start a new branch. This has the advantage that I can look over
your code if you have questions! If anything is unclear, feel free to
email me off-list or to get in touch with me within github: my username
is jakevdp.
Jake
Post by Mathias Verbeke
Hi,
First, thanks for all the answers! Waauw, really interesting
discussion. I have only basic Python skills, and never programmed in
Cython (together with a lot of time constraints, as most of you
probably), but I would like to give it a try to add new distance
metrics to the brute force method. Would it be possible to give some
pointers to what should need to be done/changed to add e.g. the
keyword p as was mentioned in Jake's first reply?
Cheers and thanks,
Mathias
On Thu, Jan 5, 2012 at 5:33 PM, Jacob VanderPlas
Here's a small example I coded up that shows how I envision including
multiple distance metrics in BallTree
https://gist.github.com/1565998
The idea is that you create functions to compute distance which expose C
function pointers, so that the ball tree cython code can call these
without python overhead. I'd be curious to hear peoples' thoughts
Jake
Post by Gael Varoquaux
Post by Jacob VanderPlas
If someone has a good idea about how one could specify these
distance
Post by Gael Varoquaux
Post by Jacob VanderPlas
metrics from python code, with optional ancillary parameters, and
convert these specifications into code for fast distance
computation
Post by Gael Varoquaux
Post by Jacob VanderPlas
within cython,
How about using small Cython classes with one or two methods?
This is
Post by Gael Varoquaux
what ended up working well for the decision trees (see _tree.pyx).
Gael
------------------------------------------------------------------------------
Post by Gael Varoquaux
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need
a complex
Post by Gael Varoquaux
infrastructure or vast IT resources to deliver seamless, secure
access to
Post by Gael Varoquaux
virtual desktops. With this all-in-one solution, easily deploy
virtual
Post by Gael Varoquaux
desktops for less than the cost of PCs and save 60% on VDI
infrastructure
Post by Gael Varoquaux
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI
infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------
------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2012-01-05 17:16:17 UTC
Permalink
Post by Jacob VanderPlas
Here's a small example I coded up that shows how I envision including
multiple distance metrics in BallTree
https://gist.github.com/1565998
The idea is that you create functions to compute distance which expose C
function pointers, so that the ball tree cython code can call these
without python overhead. I'd be curious to hear peoples' thoughts
You are cimporting malloc and free. I have a personnal difficult
relationship with those two old friends. However, it seems not to be used
in the code. I just wanted to check.

Why do you have a not implemented method? I would favor no method at all
in this case.

Also, I don't like the 'tmp' attribute of the cv_matrix structure: I must
confess that I don't know what it mean (same thing for dist_params.VI).

But I think that the fact that I am fussing on such minor details really
means that I think that the general design is good. What do others think?

Gaƫl
Jacob VanderPlas
2012-01-05 17:33:28 UTC
Permalink
Gael Varoquaux wrote:
<snip>
Post by Gael Varoquaux
You are cimporting malloc and free. I have a personnal difficult
relationship with those two old friends. However, it seems not to be used
in the code. I just wanted to check.
I initially used malloc and free, but settled on the `tmp` pointer to
avoid this (see below). These imports can be removed.
<