Discussion:
[Scikit-learn-general] what is 'distance' in the nearest-neighbors module when p=2? p>2?
Thomas Dent
2013-01-21 19:53:48 UTC
Hi all,

I just started using sklearn nearest-neighbors for classification & would like to apply my own distance weighting function.

To do this I need to know exactly what the 'distance' that is fed to the function represents. (Current documentation doesn't give me an immediate answer.)

For example if I set p=2 do I get the Euclidean distance, i.e. the square root of the sums of squares of coordinate differences; or the square of it?

If p>2 do I get the distance, in the sense of the p-th root of sum of p-th powers, or the p-th power of distance?

Thanks,
Tom

--
-----------------------------------------
Institute for Gravitational Physics
(Albert Einstein Institute)
Callinstr. 38
D-30167 Hannover, Germany
Willi Richert
2013-01-21 19:57:28 UTC
Yes, it is the parameter p in the Minkowski distance:
http://en.wikipedia.org/wiki/Minkowski_distance.

I think the master branch should have this information in the docs.
Post by Thomas Dent
Hi all,
I just started using sklearn nearest-neighbors for classification & would
like to apply my own distance weighting function.
To do this I need to know exactly what the 'distance' that is fed to the
function represents. (Current documentation doesn't give me an immediate
For example if I set p=2 do I get the Euclidean distance, i.e. the square
root of the sums of squares of coordinate differences; or the square of it?
If p>2 do I get the distance, in the sense of the p-th root of sum of p-th
powers, or the p-th power of distance?
Thanks,
Tom
--
-----------------------------------------
Institute for Gravitational Physics
(Albert Einstein Institute)
Callinstr. 38
D-30167 Hannover, Germany
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
http://p.sf.net/sfu/learnmore_122412
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Thomas Dent
2013-01-21 20:45:25 UTC
Hi,

my question was whether the 'distance' function used for weighting returns the Minkowski distance as defined in Wikipedia, or instead the p-th power of it.

For the p=2 which is treated separately I see in the neighbors/base.py source file (class KNeighborsMixin)

elif self.p == 2:
dist = pairwise_distances(X, self._fit_X, 'euclidean',
squared=True)

(which gave me some doubt) and then a few lines later

if self.p == 2:
return np.sqrt(dist[j, neigh_ind]), neigh_ind

seems to indicate we do get the Euclidean distance rather than the square of it (which is goodâ€¦).

Should the corresponding case hold for other p values?

Thanks, Tom

--
-----------------------------------------
Institute for Gravitational Physics
(Albert Einstein Institute)
Callinstr. 38
D-30167 Hannover, Germany
Yes, it is the parameter p in the Minkowski distance: http://en.wikipedia.org/wiki/Minkowski_distance.
I think the master branch should have this information in the docs.
Hi all,
I just started using sklearn nearest-neighbors for classification & would like to apply my own distance weighting function.
To do this I need to know exactly what the 'distance' that is fed to the function represents. (Current documentation doesn't give me an immediate answer.)
For example if I set p=2 do I get the Euclidean distance, i.e. the square root of the sums of squares of coordinate differences; or the square of it?
If p>2 do I get the distance, in the sense of the p-th root of sum of p-th powers, or the p-th power of distance?
Thanks,
Tom
--
-----------------------------------------
Institute for Gravitational Physics
(Albert Einstein Institute)
Callinstr. 38
D-30167 Hannover, Germany
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
http://p.sf.net/sfu/learnmore_122412
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
http://p.sf.net/sfu/learnmore_122412_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Lars Buitinck
2013-01-21 21:13:21 UTC
Post by Thomas Dent
my question was whether the 'distance' function used for weighting returns the Minkowski distance as defined in Wikipedia, or instead the p-th power of it.
Good question. Judging from the code, it would seem that only
Euclidean distance is specialized when the brute force algorithm is
used; I'm not sure about the ball tree algorithm. This may lead to
inconsistent results when distance weighting is enabled.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Jake Vanderplas
2013-01-21 22:39:32 UTC
Post by Lars Buitinck
Post by Thomas Dent
my question was whether the 'distance' function used for weighting returns the Minkowski distance as defined in Wikipedia, or instead the p-th power of it.
Good question. Judging from the code, it would seem that only
Euclidean distance is specialized when the brute force algorithm is
used; I'm not sure about the ball tree algorithm. This may lead to
inconsistent results when distance weighting is enabled.
For the ball tree, the p-th root is returned. Unless you take the p-th
root, the distance is not a proper metric and the ball tree algorithm
would fail.
Jake
Lars Buitinck
2013-01-21 22:41:35 UTC
Post by Jake Vanderplas
Post by Lars Buitinck
Post by Thomas Dent
my question was whether the 'distance' function used for weighting returns the Minkowski distance as defined in Wikipedia, or instead the p-th power of it.
Good question. Judging from the code, it would seem that only
Euclidean distance is specialized when the brute force algorithm is
used; I'm not sure about the ball tree algorithm. This may lead to
inconsistent results when distance weighting is enabled.
For the ball tree, the p-th root is returned. Unless you take the p-th
root, the distance is not a proper metric and the ball tree algorithm
would fail.
Alright. Then this should be easy to fix.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Thomas Dent
2013-01-22 14:15:15 UTC
Hi Jake -

ok, this answers the question. I would say the standard definition of Euclidean or Minkowski distance [i.e. with the root] is what we expect here, and also is needed to satisfy the triangle inequality, so no problem.

Thanks, Tom

--
-----------------------------------------
Institute for Gravitational Physics
(Albert Einstein Institute)
Callinstr. 38
D-30167 Hannover, Germany
Post by Lars Buitinck
Post by Jake Vanderplas
Post by Lars Buitinck
Post by Thomas Dent
my question was whether the 'distance' function used for weighting returns the Minkowski distance as defined in Wikipedia, or instead the p-th power of it.
Good question. Judging from the code, it would seem that only
Euclidean distance is specialized when the brute force algorithm is
used; I'm not sure about the ball tree algorithm. This may lead to
inconsistent results when distance weighting is enabled.
For the ball tree, the p-th root is returned. Unless you take the p-th
root, the distance is not a proper metric and the ball tree algorithm
would fail.
Alright. Then this should be easy to fix.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
http://p.sf.net/sfu/learnmore_122412
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Willi Richert
2013-01-21 19:56:57 UTC
Yes, it is the parameter p in the Minkowski distance:
http://en.wikipedia.org/wiki/Minkowski_distance.

I think the master branch should have this information in the docs.
Post by Thomas Dent
Hi all,
I just started using sklearn nearest-neighbors for classification & would
like to apply my own distance weighting function.
To do this I need to know exactly what the 'distance' that is fed to the
function represents. (Current documentation doesn't give me an immediate
For example if I set p=2 do I get the Euclidean distance, i.e. the square
root of the sums of squares of coordinate differences; or the square of it?
If p>2 do I get the distance, in the sense of the p-th root of sum of p-th
powers, or the p-th power of distance?
Thanks,
Tom
--
-----------------------------------------
Institute for Gravitational Physics
(Albert Einstein Institute)
Callinstr. 38
D-30167 Hannover, Germany
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
http://p.sf.net/sfu/learnmore_122412
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general