[Scikit-learn-general] Implementing Locality Sensitivity Hashing to approximate neighbor search

<html style="direction: ltr;">
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<style type="text/css">body p { margin-bottom: 5px; margin-top: 0pt; } </style>
</head>
<body style="direction: ltr;"
bidimailui-detected-decoding-type="latin-charset" bgcolor="#FFFFFF"
text="#000000">
<div class="moz-cite-prefix">Hi scikit-learn, Maheshakya,<br>
<br>
In my experience implementing and using LSH, the vanilla
algorithms are very impractical. Two extensions that help a lot
(and combine nicely) are:<br>
- Multi-probe LSH [1], where hash values neighboring the target's
hash are also used.<br>
- LSH Forest [2], where fixed length hashes are replaced by length
bounded hashes; the hash used is just long enough to get
sufficiently few candidates for the target's nearest neighbor. <br>
<br>
Together they make it feasible to limit the number of independent
copies of the index to a reasonable number. Incidentally, the LSH
Forest approach may be implemented using binary trees.<br>
<br>
Daniel Vainsencher<br>
<br>
<cite class="_Fe">[1] <a class="moz-txt-link-abbreviated" href="http://www.cs.princeton.edu/cass/papers/mp">www.cs.princeton.edu/cass/papers/mp</a><b>lsh</b>_vldb07.pdf</cite>‎<br>
[2] <cite class="_Fe">www2005.org/docs/p651.pdf</cite><br>
<br>
On 03/04/2014 06:09 PM, Maheshakya Wijewardena wrote:<br>
</div>
<blockquote
cite="mid:CA+t+wPi2VwY+su=GiKM6CtvGuZbMJfkuWKBrF=***@mail.gmail.com"
type="cite">
<div dir="ltr">Hi, <br>
<br>
Currently, unsupervised neighbor search is implemented in
sklearn.neighbors. There are three algorithms used to perform
neighbor search.
<div>
<ul>
<li>kd tree</li>
<li>ball tree</li>
<li>brute force method</li>
</ul>
kd tree and ball tree are implemented separately as cython
implementations. Both of them use binary
tree(binary_tree.pxi). In the NeighborBass class, above
implementations are used.</div>
<div><br>
</div>
<div>As Locality Sensitivity Hashing will also be used in
approximating nearest neighbors, its' implementation should
also be a part of sklearn.neighbors. But Locality sensitivity
hashing algorithms do not involve binary tree implementation.
Therefore it has to be implemented separately and be used in
sklearn.neighbors as another algorithm.Is this approach
correct?</div>
<div><br>
</div>
<div>I'm trying to implement a base LSH class and the hashing
algorithms as cython implementations. To approximate neighbor
search using those algorithms, multiple hash tables will be
created with those hash functions so, for that an efficient
way to store is required.</div>
<div><br>
</div>
<div>These are some of the issues I found disturbing while I was
trying to implement LSH to approximate neighbor search. Is
this an appropriate approach to implement this or are there
any other better ways?</div>
<div><br>
</div>
<div>  </div>
<div>
<div> <br clear="all">
<div>Regards,</div>
<div>Maheshakya,</div>
-- <br>
<div dir="ltr">Undergraduate,
<div>Department of Computer Science and Engineering,</div>
<div>Faculty of Engineering.</div>
<div>University of Moratuwa,</div>
<div>Sri Lanka</div>
</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
<a class="moz-txt-link-freetext" href="http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk">http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk</a></pre>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Scikit-learn-general mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a>
<a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a>
</pre>
</blockquote>
<br>
</body>
</html>

Maheshakya Wijewardena

2014-03-05 05:18:11 UTC

Hi Daniel,
These alternative strategies of LSH are a good way to start using the
existing implementations of Scikit-learn. But why do you say the standard
algorithms are impractical?

Maheshakya

On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <

Post by Daniel Vainsencher
Hi scikit-learn, Maheshakya,
In my experience implementing and using LSH, the vanilla algorithms are
- Multi-probe LSH [1], where hash values neighboring the target's hash are
also used.
- LSH Forest [2], where fixed length hashes are replaced by length bounded
hashes; the hash used is just long enough to get sufficiently few
candidates for the target's nearest neighbor.
Together they make it feasible to limit the number of independent copies
of the index to a reasonable number. Incidentally, the LSH Forest approach
may be implemented using binary trees.
Daniel Vainsencher
[1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
[2] www2005.org/docs/p651.pdf
Hi,
Currently, unsupervised neighbor search is implemented in
sklearn.neighbors. There are three algorithms used to perform neighbor
search.
- kd tree
- ball tree
- brute force method
kd tree and ball tree are implemented separately as cython
implementations. Both of them use binary tree(binary_tree.pxi). In the
NeighborBass class, above implementations are used.
As Locality Sensitivity Hashing will also be used in approximating
nearest neighbors, its' implementation should also be a part of
sklearn.neighbors. But Locality sensitivity hashing algorithms do not
involve binary tree implementation. Therefore it has to be implemented
separately and be used in sklearn.neighbors as another algorithm.Is this
approach correct?
I'm trying to implement a base LSH class and the hashing algorithms as
cython implementations. To approximate neighbor search using those
algorithms, multiple hash tables will be created with those hash functions
so, for that an efficient way to store is required.
These are some of the issues I found disturbing while I was trying to
implement LSH to approximate neighbor search. Is this an appropriate
approach to implement this or are there any other better ways?
Regards,
Maheshakya,
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to
Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Daniel Vainsencher

2014-03-06 09:04:38 UTC

<html style="direction: ltr;">
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<style type="text/css">body p { margin-bottom: 5px; margin-top: 0pt; } </style>
<style type="text/css">body p { margin-bottom: 5px; margin-top: 0pt; } </style>
</head>
<body style="direction: ltr;"
bidimailui-detected-decoding-type="latin-charset" bgcolor="#FFFFFF"
text="#000000">
<div style="direction: ltr;" class="moz-cite-prefix">Hi Maheshakya,<br>
<br>
In regular LSH, a particular setting of the number of hash
functions per index (k) and the number of indexes (L) essentially
determines the size of the region in space from which candidates
will be chosen in response to a query. If queries q1 and q2 are in
areas where the density of data points is very different, then one
will receive many more candidates than the other, thus be slower
and more accurate. Thus tuning k, L involves a tradeoff, and with
varied density, you cannot win. <br>
<br>
Even worse: as you add data, your k, L stop being optimal. So
tuning on a  significantly smaller sample is not effective, which
makes tuning even more expensive.<br>
<br>
Going back to LSH Forest, while an implementation using binary
trees is possible and ok for testing, but not very space or cache
efficient. Other data structures (b-trees and variants, see [2]
for some discussion) are more appropriate. The key operations are
range queries and depending on data change rate, insertion.<br>
<br>
Space efficiency of indexes is also affected by the hash type: if
you use a binary hash functions, choose k in {32, 64} and use the
appropriate integer types to avoid numpy's awkwardness around bit
vectors. LSH-Forest has no difficulty with such high k in the
index (as opposed to plain LSH).<br>
<br>
Daniel Vainsencher<br>
<br>
On 03/05/2014 07:18 AM, Maheshakya Wijewardena wrote:<br>
</div>
<blockquote
cite="mid:CA+t+wPjb6=dp750z7EYhEP-NRLAeJn=ERVVDf1u9N+***@mail.gmail.com"
type="cite">
<div dir="ltr">Hi <span
style="font-family:arial,sans-serif;font-size:13px">Daniel, </span>
<div><span style="font-family:arial,sans-serif;font-size:13px">These

alternative strategies of LSH are a good way to start using
the existing implementations of Scikit-learn. But why do you
say the standard algorithms are impractical?</span></div>
<div><span style="font-family:arial,sans-serif;font-size:13px"><br>
</span></div>
<div><span style="font-family:arial,sans-serif;font-size:13px">Maheshakya</span></div>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote"> On Wed, Mar 5, 2014 at 1:31 AM, Daniel
Vainsencher <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="direction:ltr" bgcolor="#FFFFFF" text="#000000">
<div>Hi scikit-learn, Maheshakya,<br>
<br>
In my experience implementing and using LSH, the vanilla
algorithms are very impractical. Two extensions that
help a lot (and combine nicely) are:<br>
- Multi-probe LSH [1], where hash values neighboring the
target's hash are also used.<br>
- LSH Forest [2], where fixed length hashes are replaced
by length bounded hashes; the hash used is just long
enough to get sufficiently few candidates for the
target's nearest neighbor. <br>
<br>
Together they make it feasible to limit the number of
independent copies of the index to a reasonable number.
Incidentally, the LSH Forest approach may be implemented
using binary trees.<br>
<br>
Daniel Vainsencher<br>
<br>
<cite>[1] <a moz-do-not-send="true"
href="http://www.cs.princeton.edu/cass/papers/mp"
target="_blank">www.cs.princeton.edu/cass/papers/mp</a><b>lsh</b>_vldb07.pdf</cite><br>
[2] <cite><a moz-do-not-send="true"
href="http://www2005.org/docs/p651.pdf"
target="_blank">www2005.org/docs/p651.pdf</a></cite>
<div>
<div class="h5"><br>
<br>
On 03/04/2014 06:09 PM, Maheshakya Wijewardena
wrote:<br>
</div>
</div>
</div>
<blockquote type="cite">
<div>
<div class="h5">
<div dir="ltr">Hi, <br>
<br>
Currently, unsupervised neighbor search is
implemented in sklearn.neighbors. There are three
algorithms used to perform neighbor search.
<div>
<ul>
<li>kd tree</li>
<li>ball tree</li>
<li>brute force method</li>
</ul>
kd tree and ball tree are implemented separately
as cython implementations. Both of them use
binary tree(binary_tree.pxi). In the
NeighborBass class, above implementations are
used.</div>
<div><br>
</div>
<div>As Locality Sensitivity Hashing will also be
used in approximating nearest neighbors, its'
implementation should also be a part of
sklearn.neighbors. But Locality sensitivity
hashing algorithms do not involve binary tree
implementation. Therefore it has to be
implemented separately and be used in
sklearn.neighbors as another algorithm.Is this
approach correct?</div>
<div><br>
</div>
<div>I'm trying to implement a base LSH class and
the hashing algorithms as cython
implementations. To approximate neighbor search
using those algorithms, multiple hash tables
will be created with those hash functions so,
for that an efficient way to store is required.</div>
<div><br>
</div>
<div>These are some of the issues I found
disturbing while I was trying to implement LSH
to approximate neighbor search. Is this an
appropriate approach to implement this or are
there any other better ways?</div>
<div><br>
</div>
<div>  </div>
<div>
<div> <br clear="all">
<div>Regards,</div>
<div>Maheshakya,</div>
-- <br>
<div dir="ltr">Undergraduate,
<div>Department of Computer Science and
Engineering,</div>
<div>Faculty of Engineering.</div>
<div>University of Moratuwa,</div>
<div>Sri Lanka</div>
</div>
</div>
</div>
</div>
<br>
<fieldset></fieldset>
<br>
</div>
</div>
<pre>------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
<a moz-do-not-send="true" href="http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk</a></pre>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
Scikit-learn-general mailing list
<a moz-do-not-send="true" href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a>
<a moz-do-not-send="true" href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a>
</pre>
</blockquote>
<br>
</div>
<br>
------------------------------------------------------------------------------<br>
Subversion Kills Productivity. Get off Subversion & Make
the Move to Perforce.<br>
With Perforce, you get hassle-free workflows. Merge that
actually works.<br>
Faster operations. Version large binaries.  Built-in WAN
optimization and the<br>
freedom to use Git, Perforce or both. Make the move to
Perforce.<br>
<a moz-do-not-send="true"
href="http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk"
target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk</a><br>
_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a moz-do-not-send="true"
href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a moz-do-not-send="true"
href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general"
target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br>
</blockquote>
</div>
<br>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">Undergraduate,
<div>Department of Computer Science and Engineering,</div>
<div>Faculty of Engineering.</div>
<div>University of Moratuwa,</div>
<div>Sri Lanka</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
<a class="moz-txt-link-freetext" href="http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk">http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk</a></pre>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Scikit-learn-general mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a>
<a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a>
</pre>
</blockquote>
<br>
</body>
</html>

Maheshakya Wijewardena

2014-03-09 06:36:12 UTC

Hi Daniel,

Thank you for pointing this out. I also have noticed that storing hash
tables of training data and dealing with binary vectors might be
problematic when using plain LSH based ANN.
However, before implementing LSH forests, I need to get the approval of the
community. I'm looking forward to have feedback on this.

Thank you,
Maheshakya.

On Thu, Mar 6, 2014 at 2:34 PM, Daniel Vainsencher <

Post by Daniel Vainsencher
Hi Maheshakya,
In regular LSH, a particular setting of the number of hash functions per
index (k) and the number of indexes (L) essentially determines the size of
the region in space from which candidates will be chosen in response to a
query. If queries q1 and q2 are in areas where the density of data points
is very different, then one will receive many more candidates than the
other, thus be slower and more accurate. Thus tuning k, L involves a
tradeoff, and with varied density, you cannot win.
Even worse: as you add data, your k, L stop being optimal. So tuning on a
significantly smaller sample is not effective, which makes tuning even more
expensive.
Going back to LSH Forest, while an implementation using binary trees is
possible and ok for testing, but not very space or cache efficient. Other
data structures (b-trees and variants, see [2] for some discussion) are
more appropriate. The key operations are range queries and depending on
data change rate, insertion.
Space efficiency of indexes is also affected by the hash type: if you use
a binary hash functions, choose k in {32, 64} and use the appropriate
integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
has no difficulty with such high k in the index (as opposed to plain LSH).
Daniel Vainsencher
Hi Daniel,
These alternative strategies of LSH are a good way to start using the
existing implementations of Scikit-learn. But why do you say the standard
algorithms are impractical?
Maheshakya
On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <

Post by Daniel Vainsencher
Hi scikit-learn, Maheshakya,
In my experience implementing and using LSH, the vanilla algorithms are
- Multi-probe LSH [1], where hash values neighboring the target's hash
are also used.
- LSH Forest [2], where fixed length hashes are replaced by length
bounded hashes; the hash used is just long enough to get sufficiently few
candidates for the target's nearest neighbor.
Together they make it feasible to limit the number of independent copies
of the index to a reasonable number. Incidentally, the LSH Forest approach
may be implemented using binary trees.
Daniel Vainsencher
[1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
[2] www2005.org/docs/p651.pdf
Hi,
Currently, unsupervised neighbor search is implemented in
sklearn.neighbors. There are three algorithms used to perform neighbor
search.
- kd tree
- ball tree
- brute force method
kd tree and ball tree are implemented separately as cython
implementations. Both of them use binary tree(binary_tree.pxi). In the
NeighborBass class, above implementations are used.
As Locality Sensitivity Hashing will also be used in approximating
nearest neighbors, its' implementation should also be a part of
sklearn.neighbors. But Locality sensitivity hashing algorithms do not
involve binary tree implementation. Therefore it has to be implemented
separately and be used in sklearn.neighbors as another algorithm.Is this
approach correct?
I'm trying to implement a base LSH class and the hashing algorithms as
cython implementations. To approximate neighbor search using those
algorithms, multiple hash tables will be created with those hash functions
so, for that an efficient way to store is required.
These are some of the issues I found disturbing while I was trying to
implement LSH to approximate neighbor search. Is this an appropriate
approach to implement this or are there any other better ways?
Regards,
Maheshakya,
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Ronnie Ghose

2014-03-09 06:42:41 UTC

rather than approval of the community you should ask for a mentor as I
assumed you read the GSOC idea list - a mentor is a heavy job relatively

On Sun, Mar 9, 2014 at 1:36 AM, Maheshakya Wijewardena <

Post by Maheshakya Wijewardena
Hi Daniel,
Thank you for pointing this out. I also have noticed that storing hash
tables of training data and dealing with binary vectors might be
problematic when using plain LSH based ANN.
However, before implementing LSH forests, I need to get the approval of
the community. I'm looking forward to have feedback on this.
Thank you,
Maheshakya.
On Thu, Mar 6, 2014 at 2:34 PM, Daniel Vainsencher <

Post by Daniel Vainsencher
Hi Maheshakya,
In regular LSH, a particular setting of the number of hash functions per
index (k) and the number of indexes (L) essentially determines the size of
the region in space from which candidates will be chosen in response to a
query. If queries q1 and q2 are in areas where the density of data points
is very different, then one will receive many more candidates than the
other, thus be slower and more accurate. Thus tuning k, L involves a
tradeoff, and with varied density, you cannot win.
Even worse: as you add data, your k, L stop being optimal. So tuning on
a significantly smaller sample is not effective, which makes tuning even
more expensive.
Going back to LSH Forest, while an implementation using binary trees is
possible and ok for testing, but not very space or cache efficient. Other
data structures (b-trees and variants, see [2] for some discussion) are
more appropriate. The key operations are range queries and depending on
data change rate, insertion.
Space efficiency of indexes is also affected by the hash type: if you use
a binary hash functions, choose k in {32, 64} and use the appropriate
integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
has no difficulty with such high k in the index (as opposed to plain LSH).
Daniel Vainsencher
Hi Daniel,
These alternative strategies of LSH are a good way to start using the
existing implementations of Scikit-learn. But why do you say the standard
algorithms are impractical?
Maheshakya
On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <

Post by Daniel Vainsencher
Hi scikit-learn, Maheshakya,
In my experience implementing and using LSH, the vanilla algorithms are
- Multi-probe LSH [1], where hash values neighboring the target's hash
are also used.
- LSH Forest [2], where fixed length hashes are replaced by length
bounded hashes; the hash used is just long enough to get sufficiently few
candidates for the target's nearest neighbor.
Together they make it feasible to limit the number of independent copies
of the index to a reasonable number. Incidentally, the LSH Forest approach
may be implemented using binary trees.
Daniel Vainsencher
[1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
[2] www2005.org/docs/p651.pdf
Hi,
Currently, unsupervised neighbor search is implemented in
sklearn.neighbors. There are three algorithms used to perform neighbor
search.
- kd tree
- ball tree
- brute force method
kd tree and ball tree are implemented separately as cython
implementations. Both of them use binary tree(binary_tree.pxi). In the
NeighborBass class, above implementations are used.
As Locality Sensitivity Hashing will also be used in approximating
nearest neighbors, its' implementation should also be a part of
sklearn.neighbors. But Locality sensitivity hashing algorithms do not
involve binary tree implementation. Therefore it has to be implemented
separately and be used in sklearn.neighbors as another algorithm.Is this
approach correct?
I'm trying to implement a base LSH class and the hashing algorithms as
cython implementations. To approximate neighbor search using those
algorithms, multiple hash tables will be created with those hash functions
so, for that an efficient way to store is required.
These are some of the issues I found disturbing while I was trying to
implement LSH to approximate neighbor search. Is this an appropriate
approach to implement this or are there any other better ways?
Regards,
Maheshakya,
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Maheshakya Wijewardena

2014-03-09 06:51:57 UTC

Yes, I think Robert Layton is a possible mentor for this project. I have
already discussed some facts about this with him in the issues in Github.
I was thinking of getting the opinion of other contributors as well as they
may have something to say about this or other suggestions for better ways
of implementing.

Post by Ronnie Ghose
rather than approval of the community you should ask for a mentor as I
assumed you read the GSOC idea list - a mentor is a heavy job relatively
On Sun, Mar 9, 2014 at 1:36 AM, Maheshakya Wijewardena <

Post by Daniel Vainsencher
Hi Maheshakya,
In regular LSH, a particular setting of the number of hash functions per
index (k) and the number of indexes (L) essentially determines the size of
the region in space from which candidates will be chosen in response to a
query. If queries q1 and q2 are in areas where the density of data points
is very different, then one will receive many more candidates than the
other, thus be slower and more accurate. Thus tuning k, L involves a
tradeoff, and with varied density, you cannot win.
Even worse: as you add data, your k, L stop being optimal. So tuning on
a significantly smaller sample is not effective, which makes tuning even
more expensive.
Going back to LSH Forest, while an implementation using binary trees is
possible and ok for testing, but not very space or cache efficient. Other
data structures (b-trees and variants, see [2] for some discussion) are
more appropriate. The key operations are range queries and depending on
data change rate, insertion.
Space efficiency of indexes is also affected by the hash type: if you
use a binary hash functions, choose k in {32, 64} and use the appropriate
integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
has no difficulty with such high k in the index (as opposed to plain LSH).
Daniel Vainsencher
Hi Daniel,
These alternative strategies of LSH are a good way to start using the
existing implementations of Scikit-learn. But why do you say the standard
algorithms are impractical?
Maheshakya
On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <

Post by Daniel Vainsencher
Hi scikit-learn, Maheshakya,
In my experience implementing and using LSH, the vanilla algorithms are
- Multi-probe LSH [1], where hash values neighboring the target's hash
are also used.
- LSH Forest [2], where fixed length hashes are replaced by length
bounded hashes; the hash used is just long enough to get sufficiently few
candidates for the target's nearest neighbor.
Together they make it feasible to limit the number of independent
copies of the index to a reasonable number. Incidentally, the LSH Forest
approach may be implemented using binary trees.
Daniel Vainsencher
[1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
[2] www2005.org/docs/p651.pdf
Hi,
Currently, unsupervised neighbor search is implemented in
sklearn.neighbors. There are three algorithms used to perform neighbor
search.
- kd tree
- ball tree
- brute force method
kd tree and ball tree are implemented separately as cython
implementations. Both of them use binary tree(binary_tree.pxi). In the
NeighborBass class, above implementations are used.
As Locality Sensitivity Hashing will also be used in approximating
nearest neighbors, its' implementation should also be a part of
sklearn.neighbors. But Locality sensitivity hashing algorithms do not
involve binary tree implementation. Therefore it has to be implemented
separately and be used in sklearn.neighbors as another algorithm.Is this
approach correct?
I'm trying to implement a base LSH class and the hashing algorithms
as cython implementations. To approximate neighbor search using those
algorithms, multiple hash tables will be created with those hash functions
so, for that an efficient way to store is required.
These are some of the issues I found disturbing while I was trying to
implement LSH to approximate neighbor search. Is this an appropriate
approach to implement this or are there any other better ways?
Regards,
Maheshakya,
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Robert Layton

2014-03-09 09:46:32 UTC

I think I'd be better as a second mentor -- this would be my first GSoC of
any type.
That said, I am here to help!

*That said*, you'll need to make the case to the community -- the Forest
LSH is a variation of a key algorithm. LSH by itself is definitely worthy
of inclusion in scikit-learn, and it looks like Forest LSH would be
sufficiently well-used (Google tells me 138 citations), but I think we need
the community's decision.

Post by Maheshakya Wijewardena
Yes, I think Robert Layton is a possible mentor for this project. I have
already discussed some facts about this with him in the issues in Github.
I was thinking of getting the opinion of other contributors as well as
they may have something to say about this or other suggestions for better
ways of implementing.

Post by Daniel Vainsencher
Hi Maheshakya,
In regular LSH, a particular setting of the number of hash functions
per index (k) and the number of indexes (L) essentially determines the size
of the region in space from which candidates will be chosen in response to
a query. If queries q1 and q2 are in areas where the density of data points
is very different, then one will receive many more candidates than the
other, thus be slower and more accurate. Thus tuning k, L involves a
tradeoff, and with varied density, you cannot win.
Even worse: as you add data, your k, L stop being optimal. So tuning on
a significantly smaller sample is not effective, which makes tuning even
more expensive.
Going back to LSH Forest, while an implementation using binary trees is
possible and ok for testing, but not very space or cache efficient. Other
data structures (b-trees and variants, see [2] for some discussion) are
more appropriate. The key operations are range queries and depending on
data change rate, insertion.
Space efficiency of indexes is also affected by the hash type: if you
use a binary hash functions, choose k in {32, 64} and use the appropriate
integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
has no difficulty with such high k in the index (as opposed to plain LSH).
Daniel Vainsencher
Hi Daniel,
These alternative strategies of LSH are a good way to start using the
existing implementations of Scikit-learn. But why do you say the standard
algorithms are impractical?
Maheshakya
On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <

Post by Daniel Vainsencher
Hi scikit-learn, Maheshakya,
In my experience implementing and using LSH, the vanilla algorithms
are very impractical. Two extensions that help a lot (and combine nicely)
- Multi-probe LSH [1], where hash values neighboring the target's hash
are also used.
- LSH Forest [2], where fixed length hashes are replaced by length
bounded hashes; the hash used is just long enough to get sufficiently few
candidates for the target's nearest neighbor.
Together they make it feasible to limit the number of independent
copies of the index to a reasonable number. Incidentally, the LSH Forest
approach may be implemented using binary trees.
Daniel Vainsencher
[1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
[2] www2005.org/docs/p651.pdf
Hi,
Currently, unsupervised neighbor search is implemented in
sklearn.neighbors. There are three algorithms used to perform neighbor
search.
- kd tree
- ball tree
- brute force method
kd tree and ball tree are implemented separately as cython
implementations. Both of them use binary tree(binary_tree.pxi). In the
NeighborBass class, above implementations are used.
As Locality Sensitivity Hashing will also be used in approximating
nearest neighbors, its' implementation should also be a part of
sklearn.neighbors. But Locality sensitivity hashing algorithms do not
involve binary tree implementation. Therefore it has to be implemented
separately and be used in sklearn.neighbors as another algorithm.Is this
approach correct?
I'm trying to implement a base LSH class and the hashing algorithms
as cython implementations. To approximate neighbor search using those
algorithms, multiple hash tables will be created with those hash functions
so, for that an efficient way to store is required.
These are some of the issues I found disturbing while I was trying
to implement LSH to approximate neighbor search. Is this an appropriate
approach to implement this or are there any other better ways?
Regards,
Maheshakya,
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Maheshakya Wijewardena

2014-03-09 16:16:56 UTC

Thank you Robert.
With this method, it is possible to use binary tree implementation in*
sklearn.neighbors* module to store transformed data points. As majority of
LSH algorithms produce binary values after hashing, concatenated k number
of hash values can be stored in binary trees. Height of the binary tree
will be k and there will be L number of binary trees.(as L number of Hash
tables in plain LSH based ANN algorithm)

While searching for a first mentor this project, I will keep contributing
as I can.

Best Regards,
Maheshakya.

Post by Robert Layton
I think I'd be better as a second mentor -- this would be my first GSoC of
any type.
That said, I am here to help!
*That said*, you'll need to make the case to the community -- the Forest
LSH is a variation of a key algorithm. LSH by itself is definitely worthy
of inclusion in scikit-learn, and it looks like Forest LSH would be
sufficiently well-used (Google tells me 138 citations), but I think we need
the community's decision.

Post by Daniel Vainsencher
Hi Maheshakya,
In regular LSH, a particular setting of the number of hash functions
per index (k) and the number of indexes (L) essentially determines the size
of the region in space from which candidates will be chosen in response to
a query. If queries q1 and q2 are in areas where the density of data points
is very different, then one will receive many more candidates than the
other, thus be slower and more accurate. Thus tuning k, L involves a
tradeoff, and with varied density, you cannot win.
Even worse: as you add data, your k, L stop being optimal. So tuning
on a significantly smaller sample is not effective, which makes tuning
even more expensive.
Going back to LSH Forest, while an implementation using binary trees
is possible and ok for testing, but not very space or cache efficient.
Other data structures (b-trees and variants, see [2] for some discussion)
are more appropriate. The key operations are range queries and depending on
data change rate, insertion.
Space efficiency of indexes is also affected by the hash type: if you
use a binary hash functions, choose k in {32, 64} and use the appropriate
integer types to avoid numpy's awkwardness around bit vectors. LSH-Forest
has no difficulty with such high k in the index (as opposed to plain LSH).
Daniel Vainsencher
Hi Daniel,
These alternative strategies of LSH are a good way to start using the
existing implementations of Scikit-learn. But why do you say the standard
algorithms are impractical?
Maheshakya
On Wed, Mar 5, 2014 at 1:31 AM, Daniel Vainsencher <

Post by Daniel Vainsencher
Hi scikit-learn, Maheshakya,
In my experience implementing and using LSH, the vanilla algorithms
are very impractical. Two extensions that help a lot (and combine nicely)
- Multi-probe LSH [1], where hash values neighboring the target's
hash are also used.
- LSH Forest [2], where fixed length hashes are replaced by length
bounded hashes; the hash used is just long enough to get sufficiently few
candidates for the target's nearest neighbor.
Together they make it feasible to limit the number of independent
copies of the index to a reasonable number. Incidentally, the LSH Forest
approach may be implemented using binary trees.
Daniel Vainsencher
[1] www.cs.princeton.edu/cass/papers/mp*lsh*_vldb07.pdf
[2] www2005.org/docs/p651.pdf
Hi,
Currently, unsupervised neighbor search is implemented in
sklearn.neighbors. There are three algorithms used to perform neighbor
search.
- kd tree
- ball tree
- brute force method
kd tree and ball tree are implemented separately as cython
implementations. Both of them use binary tree(binary_tree.pxi). In the
NeighborBass class, above implementations are used.
As Locality Sensitivity Hashing will also be used in approximating
nearest neighbors, its' implementation should also be a part of
sklearn.neighbors. But Locality sensitivity hashing algorithms do not
involve binary tree implementation. Therefore it has to be implemented
separately and be used in sklearn.neighbors as another algorithm.Is this
approach correct?
I'm trying to implement a base LSH class and the hashing algorithms
as cython implementations. To approximate neighbor search using those
algorithms, multiple hash tables will be created with those hash functions
so, for that an efficient way to store is required.
These are some of the issues I found disturbing while I was trying
to implement LSH to approximate neighbor search. Is this an appropriate
approach to implement this or are there any other better ways?
Regards,
Maheshakya,
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Gael Varoquaux

2014-03-11 02:17:05 UTC

That said, you'll need to make the case to the community -- the Forest
LSH is a variation of a key algorithm. LSH by itself is definitely
worthy of inclusion in scikit-learn, and it looks like Forest LSH would
be sufficiently well-used (Google tells me 138 citations), but I think
we need the community's decision.

Daniel Vainsencher

2014-03-11 08:17:06 UTC

My 0.02 cents; not a scikit learn contributor but have implemented this
before.

First, its 138 citations on an important but simple datastructure/query
optimization: replace hashed maps with ordered maps for the range
queries, and use those to get the right number of candidates, instead of
too many or too few depending on local density. The problem of variable
density that it solves was pretty serious in my application, and seems
pretty generic.

The main risk is in trying too hard to optimize the ordered map for
efficiency and compactness. A Judy array would probably be ideal, but
its C/C++ code with a patent in the background, and another B-Tree might
be a hassle as a dependency or maintenance.

Fortunately, sorted arrays and binary search are already very efficient
ordered maps (as long as batched construction is acceptable), and
something more specialized can be swapped in later if justified. This
avoids the main risk completely for this GSOC at low cost.

Other than that, the construction code is the same as in LSH, and the
query code is novel but well localized. Maybe test code for some of the
other NN search methods can be reused.

Daniel Vainsencher

138 citations on Google scholar isn't that much given the current
standards of inclusion in scikit-learn. I have probably missed the
discussion, but can someone make a compelling case for Forest LSH in a
few lines and evaluate the cost to benefit ratio of implementing it
(factoring in risk analysis and maintenance cost).
Cheers,
Gaël
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Maheshakya Wijewardena

2014-03-12 13:52:20 UTC

Hi Daniel,

You are suggesting that current numpy arrays and implementation of a binary
search can be used to create the functionality of an ordered map
successfully, right? With this structure, trees for LSH forest can be
built. Or we can go for a Cython implementation as well.

Gaël, What is the status of the evaluation of the cost to benefit ratio for
this? I would appreciate it if we can decide on storing structure for the
LSH implementation quickly as I have to make a sound plan for this project.
[?]

Best Regards,
Maheshakya.

On Tue, Mar 11, 2014 at 1:47 PM, Daniel Vainsencher <

Post by Daniel Vainsencher
My 0.02 cents; not a scikit learn contributor but have implemented this
before.
First, its 138 citations on an important but simple datastructure/query
optimization: replace hashed maps with ordered maps for the range
queries, and use those to get the right number of candidates, instead of
too many or too few depending on local density. The problem of variable
density that it solves was pretty serious in my application, and seems
pretty generic.
The main risk is in trying too hard to optimize the ordered map for
efficiency and compactness. A Judy array would probably be ideal, but
its C/C++ code with a patent in the background, and another B-Tree might
be a hassle as a dependency or maintenance.
Fortunately, sorted arrays and binary search are already very efficient
ordered maps (as long as batched construction is acceptable), and
something more specialized can be swapped in later if justified. This
avoids the main risk completely for this GSOC at low cost.
Other than that, the construction code is the same as in LSH, and the
query code is novel but well localized. Maybe test code for some of the
other NN search methods can be reused.
Daniel Vainsencher

------------------------------------------------------------------------------

Post by Gael Varoquaux
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and

their

Post by Gael Varoquaux
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Daniel Vainsencher

2014-03-12 14:59:25 UTC

Yes, at least the parts of ordered maps critical to implementing and
testing LSH-Forest. Removal would be expensive, growing on line also.

But the usual usage of:
1. Index a million points
2. Run 10 million ANN queries

would be served very well by numpy.arrays (with dtype=int). Sorting and
binary search there are probably already optimized to death, I wouldn't
spend time on it. Then you can implement and test LSH-Forest itself
relatively quickly, leaving time to solve problems like "how many
candidates should be considered" etc.

Daniel

Post by Maheshakya Wijewardena
Hi Daniel,
You are suggesting that current numpy arrays and implementation of a
binary search can be used to create the functionality of an ordered map
successfully, right? With this structure, trees for LSH forest can be
built. Or we can go for a Cython implementation as well.
Gaël, What is the status of the evaluation of the cost to benefit ratio
for this? I would appreciate it if we can decide on storing structure
for the LSH implementation quickly as I have to make a sound plan for
this project.
Best Regards,
Maheshakya.
On Tue, Mar 11, 2014 at 1:47 PM, Daniel Vainsencher
My 0.02 cents; not a scikit learn contributor but have implemented this
before.
First, its 138 citations on an important but simple datastructure/query
optimization: replace hashed maps with ordered maps for the range
queries, and use those to get the right number of candidates, instead of
too many or too few depending on local density. The problem of variable
density that it solves was pretty serious in my application, and seems
pretty generic.
The main risk is in trying too hard to optimize the ordered map for
efficiency and compactness. A Judy array would probably be ideal, but
its C/C++ code with a patent in the background, and another B-Tree might
be a hassle as a dependency or maintenance.
Fortunately, sorted arrays and binary search are already very efficient
ordered maps (as long as batched construction is acceptable), and
something more specialized can be swapped in later if justified. This
avoids the main risk completely for this GSOC at low cost.
Other than that, the construction code is the same as in LSH, and the
query code is novel but well localized. Maybe test code for some of the
other NN search methods can be reused.
Daniel Vainsencher

That said, you'll need to make the case to the community -- the

Forest

LSH is a variation of a key algorithm. LSH by itself is definitely
worthy of inclusion in scikit-learn, and it looks like Forest

LSH would

be sufficiently well-used (Google tells me 138 citations), but I

think

we need the community's decision.

138 citations on Google scholar isn't that much given the current
standards of inclusion in scikit-learn. I have probably missed the
discussion, but can someone make a compelling case for Forest LSH

in a

Post by Gael Varoquaux
few lines and evaluate the cost to benefit ratio of implementing it
(factoring in risk analysis and maintenance cost).
Cheers,
Gaël

------------------------------------------------------------------------------

Post by Gael Varoquaux
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases

and their

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Gael Varoquaux

2014-03-12 17:03:06 UTC

Gaël, What is the status of the evaluation of the cost to benefit ratio for
this?

I don't know. I just cannot be an expert of everything machine-learning
related. Somebody needs to do a cost-benefit analysis. Ideally different
people should comment on it. And yes, it's challenging.

I would appreciate it if we can decide on storing structure for the LSH
implementation quickly as I have to make a sound plan for this project.

Well, anything that uses structure more complex than numpy arrays is much
harder to develop and maintain. You need to factor this in the
cost/benefit analysis. Once again, I stress: much harder!

G

Maheshakya Wijewardena

2014-03-12 20:00:43 UTC

In this case, I think the entire structure can be implemented in numpy
arrays as they contain an (can be performed on) every operation required
this LSH forest implementation. But I'm not aware of this cost/benefit
analysis. I need help of those who have more experience.
Robert, can you give me some assistance here?

On Wed, Mar 12, 2014 at 10:33 PM, Gael Varoquaux <

Post by Maheshakya Wijewardena
Gaël, What is the status of the evaluation of the cost to benefit ratio

for

Post by Maheshakya Wijewardena
this?

Post by Maheshakya Wijewardena
I would appreciate it if we can decide on storing structure for the LSH
implementation quickly as I have to make a sound plan for this project.

Well, anything that uses structure more complex than numpy arrays is much
harder to develop and maintain. You need to factor this in the
cost/benefit analysis. Once again, I stress: much harder!
G
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Robert Layton

2014-03-12 23:00:30 UTC

I apologise if I haven't been as active as I would have liked -- I had my
wisdom teeth removed on Tuesday (ouch!).

Daniel -- are you interested in becoming a mentor for this project?

On the LSH forest, I don't have experience with it, but I think Daniel's
point probably allows it to be included. It is an improvement on a standard
algorithm.
(That said, I've always had the opinion of erring on the side of inclusion,
but I understand the push-back on this).

On the data structure, it is my strong preference to make it work in
"native" numpy, if that is possible (i.e. without doing ugly hacks). This
is due mainly to long-term maintainability -- i.e. if we need to fix a bug
and you aren't around, will someone with numpy/scipy experience be able to
work out what is going on?

*That said*, it would be good to allocate some extra time for the analysis
of other data structures in the GSoC project scope. Something like "Week
8-12: Implement other data structure and test in comparison to existing
method". I'm not sure if the GSoC accepts projects with that sort of risk,
I'll find out.

Thanks,

Robert

Post by Maheshakya Wijewardena
In this case, I think the entire structure can be implemented in numpy
arrays as they contain an (can be performed on) every operation required
this LSH forest implementation. But I'm not aware of this cost/benefit
analysis. I need help of those who have more experience.
Robert, can you give me some assistance here?
On Wed, Mar 12, 2014 at 10:33 PM, Gael Varoquaux <

GaÃ«l, What is the status of the evaluation of the cost to benefit ratio

for

this?

I would appreciate it if we can decide on storing structure for the LSH
implementation quickly as I have to make a sound plan for this project.

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Daniel Vainsencher

2014-03-13 08:12:04 UTC

Robert Layton

2014-03-13 09:22:24 UTC

Maheshakya Wijewardena

2014-03-13 10:44:38 UTC

I am already working on that. Thank you Robert and Daniel.

Post by unknown
Great. Make sure you sign up (Gael has posted the link elsewhere on this
mailing list).
Maheshakya -- can you start to put together a proposal. For now, I would
include LSH forests.

Post by unknown
Hi Robert,
I would like to help with the subject matter aspects; integration (in
more sense than one) requires more familiarity with the scikit learn
project. If you want to co-mentor it together, that would be cool.
Other data structures worth evaluating (without depending on them) are
BTrees (complicated but do everything reasonably efficiently, and very
well known) and COLA [1]: efficient at random insertion and range
queries, still based on sorted arrays and binary search.

Post by Maheshakya Wijewardena
In this case, I think the entire structure can be implemented in
numpy arrays as they contain an (can be performed on) every operation
required this LSH forest implementation. But I'm not aware of this
cost/benefit analysis. I need help of those who have more experience.
Robert, can you give me some assistance here?

- Is the likely product useful? how hard to maintain?
- Are there parts of the proposal that are unclear or complicated, thus
risking delay?
We've discussed some of those. You should address them all in your
proposal (see [2]) which you should probably write as soon as possible,
the deadline is pretty close.
How should an LSH project be evaluated? one proxy measurement is
improvement on kNN speed vs. accuracy.
Daniel
[1] supertech.csail.mit.edu/papers/sbtree.pdf
[2]
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014

Post by Maheshakya Wijewardena
I apologise if I haven't been as active as I would have liked -- I had
my wisdom teeth removed on Tuesday (ouch!).
Daniel -- are you interested in becoming a mentor for this project?
On the LSH forest, I don't have experience with it, but I think Daniel's
point probably allows it to be included. It is an improvement on a
standard algorithm.
(That said, I've always had the opinion of erring on the side of
inclusion, but I understand the push-back on this).
On the data structure, it is my strong preference to make it work in
"native" numpy, if that is possible (i.e. without doing ugly hacks).
This is due mainly to long-term maintainability -- i.e. if we need to
fix a bug and you aren't around, will someone with numpy/scipy
experience be able to work out what is going on?
/That said/, it would be good to allocate some extra time for the
analysis of other data structures in the GSoC project scope. Something
like "Week 8-12: Implement other data structure and test in comparison
to existing method". I'm not sure if the GSoC accepts projects with that
sort of risk, I'll find out.
Thanks,
Robert
In this case, I think the entire structure can be implemented in
numpy arrays as they contain an (can be performed on) every
operation required this LSH forest implementation. But I'm not aware
of this cost/benefit analysis. I need help of those who have more
experience.
Robert, can you give me some assistance here?
On Wed, Mar 12, 2014 at 10:33 PM, Gael Varoquaux

Post by Maheshakya Wijewardena
Gaël, What is the status of the evaluation of the cost to

benefit ratio for

Post by Maheshakya Wijewardena
this?

I don't know. I just cannot be an expert of everything
machine-learning
related. Somebody needs to do a cost-benefit analysis. Ideally
different
people should comment on it. And yes, it's challenging.

Post by Maheshakya Wijewardena
I would appreciate it if we can decide on storing structure

for the LSH

Post by Maheshakya Wijewardena
implementation quickly as I have to make a sound plan for

------------------------------------------------------------------------------

today!

Post by Maheshakya Wijewardena
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Maheshakya Wijewardena
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and

their

Post by Maheshakya Wijewardena
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Gael Varoquaux

2014-03-13 16:59:53 UTC

Post by unknown
Other data structures worth evaluating (without depending on them) are
BTrees

Let's really avoid this for now. It's not like we have too many people
who understand the codebase and are available to maintain it.

Robert Layton

2014-03-13 22:15:10 UTC

Thanks Gael. My thinking was to implement "Basic LSH with basic data
structures" and then spend some of the time working on seeing if moderate
improvements (i.e. a more complex data structure) can deliver benefits.
This way, we get the key deliverable, and spend some time trying to see if
we can do better.

I'd also like to see scalability added to the evaluation criteria!

Post by unknown
Other data structures worth evaluating (without depending on them) are
BTrees

Let's really avoid this for now. It's not like we have too many people
who understand the codebase and are available to maintain it.

To stress things again: we often see projects, whether they are GSOC or
elsewhere, that fail to deliver, because people implement plenty of fancy
things, and do not focus on the quality and maintainability. At the end
of the day, these projects leave _nothing_ for scikit-learn.
There is no point shooting first for the moon. Remember the 80/20
tradeoff rules: 80% of the benefits (in term of usecases, speed, ...) can
be achieved with 20% of the efforts.
GaÃ«l
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Olivier Grisel

2014-03-14 09:38:17 UTC

Post by Robert Layton
Thanks Gael. My thinking was to implement "Basic LSH with basic data
structures" and then spend some of the time working on seeing if moderate
improvements (i.e. a more complex data structure) can deliver benefits. This
way, we get the key deliverable, and spend some time trying to see if we can
do better.
I'd also like to see scalability added to the evaluation criteria!

The problem is that by having had a look at the ANN literature in the
past, my gut feeling is that basic LSH is mostly worthless for
practical applications. I would rather not have a method is
scikit-learn that is not performing good enough (in terms of speed up
vs exact brute force queries) to be useful on real data.

To decide I think the best way to proceed would be to have a
evaluation / prototyping stage in the project that would first hack an
implementation in of the basic methods (at least random projections
and maybe others) in a gist outside of the scikit-learn codebase, not
to worry about documentation and compliance with the API and benchmark
it / profile it on representative datasets with various statistical
structures and compare the outcome to alternative methods such as the
methods implemented in FLANN and Annoy.

Actually there exists several Python implementation of vanilla LSH:

https://pypi.python.org/pypi/lshash
http://nearpy.io/
https://github.com/yahoo/Optimal-LSH/blob/master/lsh.py

It would be interesting to benchmark them all on the same datasets
(with various dimensionality, ranks and sparsity patterns).

Note, that Radim already had a look at the yahoo LSH implementation
and found it impractical to use:

http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Gael Varoquaux

2014-03-14 10:23:24 UTC

Post by Olivier Grisel
The problem is that by having had a look at the ANN literature in the
past, my gut feeling is that basic LSH is mostly worthless for
practical applications. I would rather not have a method is
scikit-learn that is not performing good enough (in terms of speed up
vs exact brute force queries) to be useful on real data.

Quite clearly. We are not interested in such contribution. Indeed, for my
applications, something like random projections would work much better
than LSH, as far as I have seen.

So what are exactly the usecases and the usage patterns that this
proposal is addressing and for which we should see a big speedup?

Gaël

Olivier Grisel

2014-03-14 11:49:14 UTC

Quite clearly. We are not interested in such contribution. Indeed, for my
applications, something like random projections would work much better
than LSH, as far as I have seen.
So what are exactly the usecases and the usage patterns that this
proposal is addressing and for which we should see a big speedup?

It is my understanding that the canonical application of Approx
Nearest Neighbors methods is for multi-media similarity search with
dense, medium dimensional descriptor vectors. For instance find the
top 100 similar images in a database of at least several millions
images summarized by feature vectors ranging from 100s to 1000s of
descriptors based on possibly pooled BoW encodings of histogram based
local descriptors.

Spotify is using ANN queries for a very different use case: they do
matrix factorization of User-Item interaction matrix and perform
similarity queries between items in the latent space (I would say in
the order of 100 components).

See the readme for slightly more details: https://github.com/spotify/annoy

That sounds like a very useful application. We could try to build a
benchmark dataset by extracting the top 100 SVD components of the
movielens 10M matrix:

http://grouplens.org/datasets/movielens/

but that represents only 72K users and 10K items so I am not sure that
ANN queries would yield a speed up compared to brute force on such
data.

Another user case would be to implement k-Nearest Neighbors
classification on datasets with a dimension high enough to render
exact methods such as KD-tree and ball-tree inefficient (I would say
500+ features).

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Gael Varoquaux

2014-03-14 12:32:44 UTC

Post by Olivier Grisel
It is my understanding that the canonical application of Approx
Nearest Neighbors methods is for multi-media similarity search with
dense, medium dimensional descriptor vectors. For instance find the
top 100 similar images in a database of at least several millions
images summarized by feature vectors ranging from 100s to 1000s of
descriptors based on possibly pooled BoW encodings of histogram based
local descriptors.

Yes, agreed. However I have the impression that for this to work requires
more than algorithmic settings, in particular some form of adapted
storage.

Post by Olivier Grisel
Spotify is using ANN queries for a very different use case: they do
matrix factorization of User-Item interaction matrix and perform
similarity queries between items in the latent space (I would say in
the order of 100 components).
See the readme for slightly more details: https://github.com/spotify/annoy

Indeed, but they use random projections rather than LSH.

Post by Olivier Grisel
Another user case would be to implement k-Nearest Neighbors
classification on datasets with a dimension high enough to render
exact methods such as KD-tree and ball-tree inefficient (I would say
500+ features).

We do this quite often in the lab, and we simply use a randomized PCA on
the train set. It works very well.

For simple KNN on numerical features, what is the evidence that LSH works
better than random projections? Forgive me for asking this question, I
may be unaware of the literature.

Gaël

Olivier Grisel

2014-03-14 12:52:35 UTC

Yes, agreed. However I have the impression that for this to work requires
more than algorithmic settings, in particular some form of adapted
storage.

Indeed, but they use random projections rather than LSH.

It is my understanding that Annoy implements ANN hashing by using a
data-driven forests of Random Projections rather than (data
independent) uniformly distributed RPs to bucket the samples.

Both are hashing schemes, but annoy is focusing on high density
regions of the dataset, hence problem more efficient.

We do this quite often in the lab, and we simply use a randomized PCA on
the train set. It works very well.
For simple KNN on numerical features, what is the evidence that LSH works
better than random projections? Forgive me for asking this question, I
may be unaware of the literature.

Hashing the samples into buckets like Vanilla LSH and the Annoy variant do, is:

- more memory efficient: you don't store the results of the
projections in memory, just the integer index of the buckets,
- as a faster query time as you don't need to compute pairwise
distances with the whole (projected or not dataset): only with the
samples in the same bucket.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Olivier Grisel

2014-03-14 12:54:02 UTC

BTW to make things clearer: vanilla LSH is thresholded Random
Projections + bucketing and queries
are performed hashing into conflicting buckets

So LSH is a memory efficient approximation to NN queries in RP space.

--
Olivier

Olivier Grisel

2014-03-14 12:56:12 UTC

Post by Gael Varoquaux
Indeed, but they use random projections rather than LSH.

It is my understanding that Annoy implements ANN hashing by using a
data-driven forests of Random Projections rather than (data
independent) uniformly distributed RPs to bucket the samples as
vanilla LSH does.

Both LSH and Annoy-Tree-LSH are hashing schemes, but Annoy is focusing
on high density
regions of the dataset, hence problem more efficient.

--
Olivier

Gael Varoquaux

2014-03-14 12:55:30 UTC

Post by Gael Varoquaux
Indeed, but they use random projections rather than LSH.

It is my understanding that Annoy implements ANN hashing by using a
data-driven forests of Random Projections rather than (data
independent) uniformly distributed RPs to bucket the samples.
Both are hashing schemes, but annoy is focusing on high density
regions of the dataset, hence problem more efficient.

We do this quite often in the lab, and we simply use a randomized PCA on
the train set. It works very well.
For simple KNN on numerical features, what is the evidence that LSH works
better than random projections? Forgive me for asking this question, I
may be unaware of the literature.

- more memory efficient: you don't store the results of the
projections in memory, just the integer index of the buckets,
- as a faster query time as you don't need to compute pairwise
distances with the whole (projected or not dataset): only with the
samples in the same bucket.

OK. This is interesting. I think that the technical part of the proposal
should focus on these implementation considerations, as they are
important to make the proposal viable.

Gaël

Olivier Grisel

2014-03-14 13:07:04 UTC

This implementation looks simple and has more readable code than most
(IMHO) and would be worth reviewing / benchmarking as well:

https://github.com/gamboviol/lsh
https://github.com/gamboviol/lsh/blob/master/lsh.py

--
Olivier

Daniel Vainsencher

2014-03-14 18:40:46 UTC

Hi Olivier,

Have you looked at the LSH-Forest (and to a lesser extent, Multiprobe)
paper? I've used it in practice, and it makes a significant improvement
vs. basic LSH. I've proposed in recent emails to the list how to
implement it based on sorting and binary search (in order to avoid the
dependency/maintenance burden of range-queries).

I think evaluating and tweaking vanilla LSH is a waste of time, when an
obvious improvement that fixes a significant real world flaw (control of
number of candidates seen) in the method is available.

Daniel

The problem is that by having had a look at the ANN literature in the
past, my gut feeling is that basic LSH is mostly worthless for
practical applications. I would rather not have a method is
scikit-learn that is not performing good enough (in terms of speed up
vs exact brute force queries) to be useful on real data.
To decide I think the best way to proceed would be to have a
evaluation / prototyping stage in the project that would first hack an
implementation in of the basic methods (at least random projections
and maybe others) in a gist outside of the scikit-learn codebase, not
to worry about documentation and compliance with the API and benchmark
it / profile it on representative datasets with various statistical
structures and compare the outcome to alternative methods such as the
methods implemented in FLANN and Annoy.
https://pypi.python.org/pypi/lshash
http://nearpy.io/
https://github.com/yahoo/Optimal-LSH/blob/master/lsh.py
It would be interesting to benchmark them all on the same datasets
(with various dimensionality, ranks and sparsity patterns).
Note, that Radim already had a look at the yahoo LSH implementation
http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/

Maheshakya Wijewardena

2014-03-14 21:36:03 UTC

Isn't the algorithm used in Annoy is somewhat similar to LSH-forest. In
Annoy, a binary tree is built using the result of a random projection on a
training data points at each node. (Random projection itself is method to
do LSH) The main difference I see is the data driven nature of Random
projection in Annoy.(Only when metric for AnnoyIndex='euclidean')

Maheshakya

On Sat, Mar 15, 2014 at 12:10 AM, Daniel Vainsencher <

Post by Daniel Vainsencher
Hi Olivier,
Have you looked at the LSH-Forest (and to a lesser extent, Multiprobe)
paper? I've used it in practice, and it makes a significant improvement
vs. basic LSH. I've proposed in recent emails to the list how to
implement it based on sorting and binary search (in order to avoid the
dependency/maintenance burden of range-queries).
I think evaluating and tweaking vanilla LSH is a waste of time, when an
obvious improvement that fixes a significant real world flaw (control of
number of candidates seen) in the method is available.
Daniel

Post by Robert Layton
Thanks Gael. My thinking was to implement "Basic LSH with basic data
structures" and then spend some of the time working on seeing if

moderate

Post by Robert Layton
improvements (i.e. a more complex data structure) can deliver benefits.

This

Post by Robert Layton
way, we get the key deliverable, and spend some time trying to see if

we can

Post by Robert Layton
do better.
I'd also like to see scalability added to the evaluation criteria!

The problem is that by having had a look at the ANN literature in the
past, my gut feeling is that basic LSH is mostly worthless for
practical applications. I would rather not have a method is
scikit-learn that is not performing good enough (in terms of speed up
vs exact brute force queries) to be useful on real data.
To decide I think the best way to proceed would be to have a
evaluation / prototyping stage in the project that would first hack an
implementation in of the basic methods (at least random projections
and maybe others) in a gist outside of the scikit-learn codebase, not
to worry about documentation and compliance with the API and benchmark
it / profile it on representative datasets with various statistical
structures and compare the outcome to alternative methods such as the
methods implemented in FLANN and Annoy.
https://pypi.python.org/pypi/lshash
http://nearpy.io/
https://github.com/yahoo/Optimal-LSH/blob/master/lsh.py
It would be interesting to benchmark them all on the same datasets
(with various dimensionality, ranks and sparsity patterns).
Note, that Radim already had a look at the yahoo LSH implementation

http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Maheshakya Wijewardena

2014-03-16 15:56:08 UTC

Hi Olivier,

According to the documentation of the FLANN, among the neighbor search
methods, there is a LSH based approach. It uses multi-probe LSH. But that
type of index can only be used for matching binary features using Hamming
distances. So the LSH-ANN used in that is not usable in general setting.

According to Radim(
http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/)
ANNOYs performance is much better than FLANN. The only method used in ANNOY
is random projection trees with a little tweak on generating the random
hyperplanes. Structure of this is very similar to LSH-forest implementation
(which Daniel suggested about). Maybe if we can apply that method to
perform random projection in LSH-forest, it may show improvements. So I
have a suggestion to test this approach as well during prototyping.

There are lots of literature that states vanilla LSH is not ideal to use in
practice as it is not efficient memory-wise and the difficulties when
choosing parameters signature length and number of hash tables(k and L).
Robert, you have experience in using standard LSH algorithm, right? Can you
suggest any improvements that might come in handy while prototyping.

Best Regards,
Maheshakya.

On Sat, Mar 15, 2014 at 3:06 AM, Maheshakya Wijewardena <

Post by Maheshakya Wijewardena
Isn't the algorithm used in Annoy is somewhat similar to LSH-forest. In
Annoy, a binary tree is built using the result of a random projection on a
training data points at each node. (Random projection itself is method to
do LSH) The main difference I see is the data driven nature of Random
projection in Annoy.(Only when metric for AnnoyIndex='euclidean')
Maheshakya
On Sat, Mar 15, 2014 at 12:10 AM, Daniel Vainsencher <

Post by Robert Layton
Thanks Gael. My thinking was to implement "Basic LSH with basic data
structures" and then spend some of the time working on seeing if

moderate

Post by Robert Layton
improvements (i.e. a more complex data structure) can deliver

benefits. This

Post by Robert Layton
way, we get the key deliverable, and spend some time trying to see if

we can

Post by Robert Layton
do better.
I'd also like to see scalability added to the evaluation criteria!

The problem is that by having had a look at the ANN literature in the
past, my gut feeling is that basic LSH is mostly worthless for
practical applications. I would rather not have a method is
scikit-learn that is not performing good enough (in terms of speed up
vs exact brute force queries) to be useful on real data.
To decide I think the best way to proceed would be to have a
evaluation / prototyping stage in the project that would first hack an
implementation in of the basic methods (at least random projections
and maybe others) in a gist outside of the scikit-learn codebase, not
to worry about documentation and compliance with the API and benchmark
it / profile it on representative datasets with various statistical
structures and compare the outcome to alternative methods such as the
methods implemented in FLANN and Annoy.
https://pypi.python.org/pypi/lshash
http://nearpy.io/
https://github.com/yahoo/Optimal-LSH/blob/master/lsh.py
It would be interesting to benchmark them all on the same datasets
(with various dimensionality, ranks and sparsity patterns).
Note, that Radim already had a look at the yahoo LSH implementation

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Olivier Grisel

2014-03-16 19:03:23 UTC

Post by Daniel Vainsencher
Hi Olivier,
According to the documentation of the FLANN, among the neighbor search
methods, there is a LSH based approach. It uses multi-probe LSH. But that
type of index can only be used for matching binary features using Hamming
distances. So the LSH-ANN used in that is not usable in general setting.
According to
Radim(http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/)
ANNOYs performance is much better than FLANN.

No, it rather says the opposite (look at the precision-vs-speed
plots). Here is the conclusion by Radim:

"""
FLANN is faster, but Annoy is saner. I like how it has a single
parameter that controls accuracy that “just works”. It offers pleasant
accuracy/speed tradeoffs, a responsive developer, is memory lean,
concise, supports mmap for sharing an index between processes (FLANN
can only share the input matrix, and not the index itself).
"""

I agree with Radim that the focus on simple usage (a nicer
hyperparameter to tradeoff precision with query speed) is nice if the
raw speed is comparable to the other method that has a less
user-friendly parameterization.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Maheshakya Wijewardena

2014-03-16 19:19:28 UTC

Yes, I was considering the accuracy as well when speaking of the
performance. (Actually put more weight to that)
I also agree with the focus on Annoys' simple usage and trade-off for
precision with query speed.

On Mon, Mar 17, 2014 at 12:33 AM, Olivier Grisel

http://radimrehurek.com/2014/01/performance-shootout-of-nearest-neighbours-querying/
)

Post by Daniel Vainsencher
ANNOYs performance is much better than FLANN.

No, it rather says the opposite (look at the precision-vs-speed
"""
FLANN is faster, but Annoy is saner. I like how it has a single
parameter that controls accuracy that "just works". It offers pleasant
accuracy/speed tradeoffs, a responsive developer, is memory lean,
concise, supports mmap for sharing an index between processes (FLANN
can only share the input matrix, and not the index itself).
"""
I agree with Radim that the focus on simple usage (a nicer
hyperparameter to tradeoff precision with query speed) is nice if the
raw speed is comparable to the other method that has a less
user-friendly parameterization.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Olivier Grisel

2014-03-16 19:25:53 UTC

Post by Maheshakya Wijewardena
Yes, I was considering the accuracy as well when speaking of the
performance. (Actually put more weight to that)

Even though, on most plots the FLANN points are dominating the Annoy
plots (in a Pareto-optimal sense).

But the speed difference is not large enough to justify the added
complexity of FLANN IMHO.

--
Olivier

Maheshakya Wijewardena

2014-03-17 04:32:04 UTC

Ok, I failed to notice that in a Pareto efficiency sense. But it makes
sense to try the approach used in Annoy, when the maintainability of the
code is considered.
Apart from that, as I have mentioned earlier, the only LSH based ANN method
used in FLANN is usable for matching binary features using Hamming
distances. Using that wont be much of a use for scikit-learn.

So in my opinion, it is better give a high priority to LSH forest based ANN
when doing prototyping.

On Mon, Mar 17, 2014 at 12:55 AM, Olivier Grisel

Post by Maheshakya Wijewardena
Yes, I was considering the accuracy as well when speaking of the
performance. (Actually put more weight to that)

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Daniel Vainsencher

2014-03-17 06:17:22 UTC

Olivier Grisel

2014-03-16 18:59:59 UTC

No I did not read those papers (yet). As Maheshakya said it might very
well be the method implemented in Annoy. Do you have sample python
code around?

Post by Daniel Vainsencher
I think evaluating and tweaking vanilla LSH is a waste of time, when an
obvious improvement that fixes a significant real world flaw (control of
number of candidates seen) in the method is available.

If the obvious improvement does not come with a significant increment
in code complexity (and maintenance burden) then I agree, yes.

Getting rid of hyperparameters to select manually of via grid-search
is a big usability improvement and can make the method much more
practical to use.

--
Olivier

Daniel Vainsencher

2014-03-17 07:32:14 UTC

No I did not read those papers (yet).

I don't know the culture here, but since you seemed to summarize rather
than wait for someone to read up on LSH:
- Where LSH uses hash-maps keyed on the a random hash of length exactly
k, LSHF uses an ordered map (binary tree or BTree in the paper). This
enables queries of form:
"all keys whose prefix of length h<=k matches this hash". This trivially
enables a per tree query of type "give me the best q hashes (and maybe a
few more)" by doing queries with h=k, h=k-1,... until satisfied (this
can be optimized in different ways).
- Then queries across multiple trees can be combined in different ways,
the simplest being "give me at least q/L candidates from each tree, take
the union". More complex variants don't seem to make much difference.
- So, the parameter k becomes essentially per-query, L can be determined
on its own, and we can determine the number of expensive "real distance"
comparisons we want to pay per query.

I offer the simple observation that sorted arrays support very fast
range queries, so that complex/expensive trees are not necessary for an
LSHF implementation.

As Maheshakya said it might very

Post by Olivier Grisel
well be the method implemented in Annoy.

It seems similar but distinct, see a recent mail.

Post by Olivier Grisel
Do you have sample python
code around?

No, I implemented it in C#, and source is not available.

If the obvious improvement does not come with a significant increment
in code complexity (and maintenance burden) then I agree, yes.

So, was the above was sufficient to answer your prerequisite?

Post by Olivier Grisel
Getting rid of hyperparameters to select manually of via grid-search
is a big usability improvement and can make the method much more
practical to use.

In this case, performance becomes better than mere tuning would allow,
since it adapts to database density around the query point.

Daniel

Maheshakya Wijewardena

2014-03-17 19:55:58 UTC

Hi,

I have written a proposal for this project and it is in the wiki,
https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search

It is not complete yet. I have not written the timeline yet. But I have
described my plan for this project. I think the prototyping part of this
project should be started during the community bonding period, while fixing
other issues because that needs to be done as soon as possible. I will
update this with my timeline soon.

Can you review this and give me your feedback.

Thank you,
Maheshakya.

On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher <

Post by Daniel Vainsencher

2014-03-14 19:40 GMT+01:00 Daniel Vainsencher <

No I did not read those papers (yet).

I don't know the culture here, but since you seemed to summarize rather
- Where LSH uses hash-maps keyed on the a random hash of length exactly
k, LSHF uses an ordered map (binary tree or BTree in the paper). This
"all keys whose prefix of length h<=k matches this hash". This trivially
enables a per tree query of type "give me the best q hashes (and maybe a
few more)" by doing queries with h=k, h=k-1,... until satisfied (this
can be optimized in different ways).
- Then queries across multiple trees can be combined in different ways,
the simplest being "give me at least q/L candidates from each tree, take
the union". More complex variants don't seem to make much difference.
- So, the parameter k becomes essentially per-query, L can be determined
on its own, and we can determine the number of expensive "real distance"
comparisons we want to pay per query.
I offer the simple observation that sorted arrays support very fast
range queries, so that complex/expensive trees are not necessary for an
LSHF implementation.
As Maheshakya said it might very

well be the method implemented in Annoy.

It seems similar but distinct, see a recent mail.

Do you have sample python
code around?

No, I implemented it in C#, and source is not available.

If the obvious improvement does not come with a significant increment
in code complexity (and maintenance burden) then I agree, yes.

So, was the above was sufficient to answer your prerequisite?

Getting rid of hyperparameters to select manually of via grid-search
is a big usability improvement and can make the method much more
practical to use.

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Robert Layton

2014-03-17 22:38:53 UTC

Thanks for sending this.

The grammar needs updating, but I'll let you worry about proofreading.
Comments below:
- State some *evidence* for your experience with python, i.e. projects you
worked on or links to big commits (this is covered late in the application,
move some information forward).
- Don't choose a project because "it was on the list" -- why is the problem
important and what can *you* do to help?
- Be more confident! "This is the most vital part of this project as I
think." Ensure the whole project is planned and evaluated.
- Can you talk a bit more about which hashing algorithms will be
implemented? This is a major part of the work, but isn't appropriately
covered.
- In your abstract, I would make certain your plan -- you want to implement
prototypes, evaluate them and then finish the implementations. Make this
absolutely clear (currently, the reader has to piece this together for
themselves)
- References need to be done more thoroughly.

I think it's a good project, now you need to sell it!

Post by Maheshakya Wijewardena
Hi,
I have written a proposal for this project and it is in the wiki,
https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search
It is not complete yet. I have not written the timeline yet. But I have
described my plan for this project. I think the prototyping part of this
project should be started during the community bonding period, while fixing
other issues because that needs to be done as soon as possible. I will
update this with my timeline soon.
Can you review this and give me your feedback.
Thank you,
Maheshakya.
On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher <

Post by Daniel Vainsencher

2014-03-14 19:40 GMT+01:00 Daniel Vainsencher <

No I did not read those papers (yet).

well be the method implemented in Annoy.

It seems similar but distinct, see a recent mail.

Do you have sample python
code around?

No, I implemented it in C#, and source is not available.

Post by Daniel Vainsencher
I think evaluating and tweaking vanilla LSH is a waste of time, when an
obvious improvement that fixes a significant real world flaw (control

Post by Daniel Vainsencher
number of candidates seen) in the method is available.

If the obvious improvement does not come with a significant increment
in code complexity (and maintenance burden) then I agree, yes.

So, was the above was sufficient to answer your prerequisite?

Getting rid of hyperparameters to select manually of via grid-search
is a big usability improvement and can make the method much more
practical to use.

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Maheshakya Wijewardena

2014-03-18 14:47:40 UTC

Thank you Robert for your valuable feedback. I will update my proposal
accordingly asap and let you know.

However, I prepared this according to the template given by Python
organization. So I have put information about the projects I have done in
that position according to template. If scikit-learn expects it to be
otherwise, I can change it.

I have another thing to discuss here before finalizing my proposal. Mr. Wei
Dong has informed me that there is another state-of-art approach he
implemented to perform ANN search and asked to see if that can be used in
the evaluating stage. This implementation is called
KGraph<http://www.kgraph.org/index.php?n=Main.Python>.
It has comparisons with FLANN, I have started reading it. Maybe you can
have a look into that as well.

Best regards,
Maheshakya

Post by Robert Layton
Thanks for sending this.
The grammar needs updating, but I'll let you worry about proofreading.
- State some *evidence* for your experience with python, i.e. projects
you worked on or links to big commits (this is covered late in the
application, move some information forward).
- Don't choose a project because "it was on the list" -- why is the
problem important and what can *you* do to help?
- Be more confident! "This is the most vital part of this project as I
think." Ensure the whole project is planned and evaluated.
- Can you talk a bit more about which hashing algorithms will be
implemented? This is a major part of the work, but isn't appropriately
covered.
- In your abstract, I would make certain your plan -- you want to
implement prototypes, evaluate them and then finish the implementations.
Make this absolutely clear (currently, the reader has to piece this
together for themselves)
- References need to be done more thoroughly.
I think it's a good project, now you need to sell it!

Post by Robert Layton

2014-03-14 19:40 GMT+01:00 Daniel Vainsencher <

Post by Daniel Vainsencher
Hi Olivier,
Have you looked at the LSH-Forest (and to a lesser extent, Multiprobe)
paper? I've used it in practice, and it makes a significant

improvement

Post by Daniel Vainsencher
vs. basic LSH. I've proposed in recent emails to the list how to
implement it based on sorting and binary search (in order to avoid the
dependency/maintenance burden of range-queries).

No I did not read those papers (yet).

well be the method implemented in Annoy.

It seems similar but distinct, see a recent mail.

Do you have sample python
code around?

No, I implemented it in C#, and source is not available.

Post by Daniel Vainsencher
I think evaluating and tweaking vanilla LSH is a waste of time, when

Post by Daniel Vainsencher
obvious improvement that fixes a significant real world flaw (control

Post by Daniel Vainsencher
number of candidates seen) in the method is available.

If the obvious improvement does not come with a significant increment
in code complexity (and maintenance burden) then I agree, yes.

So, was the above was sufficient to answer your prerequisite?

Getting rid of hyperparameters to select manually of via grid-search
is a big usability improvement and can make the method much more
practical to use.

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Daniel Vainsencher

2014-03-18 20:17:51 UTC

Hi Maheshakya,

I've read your proposal. Remember that evaluating doesn't bring home the
bacon: you want an effective ANN solution that scikit learn can actually
integrate. By all accounts, plain LSH is not competitive. Then you
should evaluate it only to the extent it requires almost no extra
effort. Since (to my understanding) scikit-learn prefers not to take on
any dependencies, and prefers not to maintain complex tree based data
structures, you should be wary where you spend implementation time. Do
you have a practical plan for implementing any promising variant other
than LSH-Forest?

Concretely, I would implement LSHForest first; create a variant query
method that uses constant k (thus implementing plain LSH). Connect to 1
or 2 of libraries that are easy to integrate (of FLANN, Annoy, kgraph?),
and evaluate variations comparing to them. An implementation in
scikit-learn that is in the ballpark of any of those would be an
excellent achievement in my opinion for a GSOC. Is there benchmarking
code that you can adopt into a scikit learn demo/benchmark?

Remember to allocate time to tweaking the query code: there are a couple
of different algorithms in the LSH F paper, and many variants can be
invented.

Daniel

Post by Maheshakya Wijewardena
Thank you Robert for your valuable feedback. I will update my proposal
accordingly asap and let you know.
However, I prepared this according to the template given by Python
organization. So I have put information about the projects I have done
in that position according to template. If scikit-learn expects it to be
otherwise, I can change it.
I have another thing to discuss here before finalizing my proposal.
Mr. Wei Dong has informed me that there is another state-of-art approach
he implemented to perform ANN search and asked to see if that can be
used in the evaluating stage. This implementation is called KGraph
<http://www.kgraph.org/index.php?n=Main.Python>. It has comparisons with
FLANN, I have started reading it. Maybe you can have a look into that as
well.
Best regards,
Maheshakya
Thanks for sending this.
The grammar needs updating, but I'll let you worry about
- State some /evidence/ for your experience with python, i.e.
projects you worked on or links to big commits (this is covered late
in the application, move some information forward).
- Don't choose a project because "it was on the list" -- why is the
problem important and what can *you* do to help?
- Be more confident! "This is the most vital part of this project as
I think." Ensure the whole project is planned and evaluated.
- Can you talk a bit more about which hashing algorithms will be
implemented? This is a major part of the work, but isn't
appropriately covered.
- In your abstract, I would make certain your plan -- you want to
implement prototypes, evaluate them and then finish the
implementations. Make this absolutely clear (currently, the reader
has to piece this together for themselves)
- References need to be done more thoroughly.
I think it's a good project, now you need to sell it!
On 18 March 2014 06:55, Maheshakya Wijewardena
Hi,
I have written a proposal for this project and it is in the wiki,
https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search
It is not complete yet. I have not written the timeline yet. But
I have described my plan for this project. I think the
prototyping part of this project should be started during the
community bonding period, while fixing other issues because that
needs to be done as soon as possible. I will update this with my
timeline soon.
Can you review this and give me your feedback.
Thank you,
Maheshakya.
On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher

2014-03-14 19:40 GMT+01:00 Daniel Vainsencher

Post by Daniel Vainsencher
Hi Olivier,
Have you looked at the LSH-Forest (and to a lesser

extent, Multiprobe)

Post by Daniel Vainsencher
paper? I've used it in practice, and it makes a

significant improvement

Post by Daniel Vainsencher
vs. basic LSH. I've proposed in recent emails to the

list how to

Post by Daniel Vainsencher
implement it based on sorting and binary search (in

order to avoid the

Post by Daniel Vainsencher
dependency/maintenance burden of range-queries).

No I did not read those papers (yet).

I don't know the culture here, but since you seemed to
summarize rather
- Where LSH uses hash-maps keyed on the a random hash of
length exactly
k, LSHF uses an ordered map (binary tree or BTree in the
paper). This
"all keys whose prefix of length h<=k matches this hash".
This trivially
enables a per tree query of type "give me the best q hashes
(and maybe a
few more)" by doing queries with h=k, h=k-1,... until
satisfied (this
can be optimized in different ways).
- Then queries across multiple trees can be combined in
different ways,
the simplest being "give me at least q/L candidates from
each tree, take
the union". More complex variants don't seem to make much
difference.
- So, the parameter k becomes essentially per-query, L can
be determined
on its own, and we can determine the number of expensive
"real distance"
comparisons we want to pay per query.
I offer the simple observation that sorted arrays support very fast
range queries, so that complex/expensive trees are not
necessary for an
LSHF implementation.
As Maheshakya said it might very

well be the method implemented in Annoy.

It seems similar but distinct, see a recent mail.

Do you have sample python
code around?

No, I implemented it in C#, and source is not available.

Post by Daniel Vainsencher
I think evaluating and tweaking vanilla LSH is a waste

of time, when an

Post by Daniel Vainsencher
obvious improvement that fixes a significant real world

flaw (control of

Post by Daniel Vainsencher
number of candidates seen) in the method is available.

If the obvious improvement does not come with a

significant increment

in code complexity (and maintenance burden) then I agree,

yes.
So, was the above was sufficient to answer your prerequisite?

Getting rid of hyperparameters to select manually of via

grid-search

is a big usability improvement and can make the method

much more

practical to use.

In this case, performance becomes better than mere tuning
would allow,
since it adapts to database density around the query point.
Daniel
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph
databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Robert Layton

2014-03-18 22:18:12 UTC

I see what you are saying about the template, but you can also add extra
information up the front!
My reasoning behind that comment -- I read your background and, at that
point, didn't understand why you were submitting for a python project
specifically.

Afterwards, it is clear, but a rule of thumb for grant applications: don't
make the reviewers guess, and don't surprise them!

Post by Daniel Vainsencher
Hi Maheshakya,
I've read your proposal. Remember that evaluating doesn't bring home the
bacon: you want an effective ANN solution that scikit learn can actually
integrate. By all accounts, plain LSH is not competitive. Then you
should evaluate it only to the extent it requires almost no extra
effort. Since (to my understanding) scikit-learn prefers not to take on
any dependencies, and prefers not to maintain complex tree based data
structures, you should be wary where you spend implementation time. Do
you have a practical plan for implementing any promising variant other
than LSH-Forest?
Concretely, I would implement LSHForest first; create a variant query
method that uses constant k (thus implementing plain LSH). Connect to 1
or 2 of libraries that are easy to integrate (of FLANN, Annoy, kgraph?),
and evaluate variations comparing to them. An implementation in
scikit-learn that is in the ballpark of any of those would be an
excellent achievement in my opinion for a GSOC. Is there benchmarking
code that you can adopt into a scikit learn demo/benchmark?
Remember to allocate time to tweaking the query code: there are a couple
of different algorithms in the LSH F paper, and many variants can be
invented.
Daniel

https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search

Post by Maheshakya Wijewardena
It is not complete yet. I have not written the timeline yet. But
I have described my plan for this project. I think the
prototyping part of this project should be started during the
community bonding period, while fixing other issues because that
needs to be done as soon as possible. I will update this with my
timeline soon.
Can you review this and give me your feedback.
Thank you,
Maheshakya.
On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher

2014-03-14 19:40 GMT+01:00 Daniel Vainsencher

Post by Daniel Vainsencher
Hi Olivier,
Have you looked at the LSH-Forest (and to a lesser

extent, Multiprobe)

Post by Daniel Vainsencher
paper? I've used it in practice, and it makes a

significant improvement

Post by Daniel Vainsencher
vs. basic LSH. I've proposed in recent emails to the

list how to

Post by Daniel Vainsencher
implement it based on sorting and binary search (in

order to avoid the

Post by Daniel Vainsencher
dependency/maintenance burden of range-queries).

No I did not read those papers (yet).

I don't know the culture here, but since you seemed to
summarize rather
- Where LSH uses hash-maps keyed on the a random hash of
length exactly
k, LSHF uses an ordered map (binary tree or BTree in the
paper). This
"all keys whose prefix of length h<=k matches this hash".
This trivially
enables a per tree query of type "give me the best q hashes
(and maybe a
few more)" by doing queries with h=k, h=k-1,... until
satisfied (this
can be optimized in different ways).
- Then queries across multiple trees can be combined in
different ways,
the simplest being "give me at least q/L candidates from
each tree, take
the union". More complex variants don't seem to make much
difference.
- So, the parameter k becomes essentially per-query, L can
be determined
on its own, and we can determine the number of expensive
"real distance"
comparisons we want to pay per query.
I offer the simple observation that sorted arrays support
very fast
range queries, so that complex/expensive trees are not
necessary for an
LSHF implementation.
As Maheshakya said it might very

well be the method implemented in Annoy.

It seems similar but distinct, see a recent mail.

Do you have sample python
code around?

No, I implemented it in C#, and source is not available.

Post by Daniel Vainsencher
I think evaluating and tweaking vanilla LSH is a waste

of time, when an

Post by Daniel Vainsencher
obvious improvement that fixes a significant real world

flaw (control of

Post by Daniel Vainsencher
number of candidates seen) in the method is available.

If the obvious improvement does not come with a

significant increment

in code complexity (and maintenance burden) then I agree,

yes.
So, was the above was sufficient to answer your prerequisite?

Getting rid of hyperparameters to select manually of via

grid-search

is a big usability improvement and can make the method

much more

practical to use.

In this case, performance becomes better than mere tuning
would allow,
since it adapts to database density around the query point.
Daniel

------------------------------------------------------------------------------

field,

Post by Maheshakya Wijewardena
this first edition is now available. Download your free book
today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Maheshakya Wijewardena
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------

today!

Post by Maheshakya Wijewardena
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and

their

Maheshakya Wijewardena

2014-03-19 15:06:46 UTC

Hi Robert,

I see what you are saying about the template, but you can also add extra

Post by Robert Layton
information up the front!

I have made an indication of what I have done related to machine learning
previously in the my personal statement. But the links and description my
work remain at the same position.

I've read your proposal. Remember that evaluating doesn't bring home the

Post by Robert Layton
bacon: you want an effective ANN solution that scikit learn can actually
integrate. By all accounts, plain LSH is not competitive. Then you
should evaluate it only to the extent it requires almost no extra
effort. Since (to my understanding) scikit-learn prefers not to take on
any dependencies, and prefers not to maintain complex tree based data
structures, you should be wary where you spend implementation time. Do
you have a practical plan for implementing any promising variant other
than LSH-Forest?
Concretely, I would implement LSHForest first; create a variant query
method that uses constant k (thus implementing plain LSH). Connect to 1
or 2 of libraries that are easy to integrate (of FLANN, Annoy, kgraph?),
and evaluate variations comparing to them. An implementation in
scikit-learn that is in the ballpark of any of those would be an
excellent achievement in my opinion for a GSOC. Is there benchmarking
code that you can adopt into a scikit learn demo/benchmark?

Daniel, as you have mentioned I wont put much effort on implementing the
vanilla LSH algorithm. I will just use existing Python implementations for
evaluation. Moreover, My priority is to prototype LSH-Forest and analyze it
(This is indicated in the updated proposal). But I will allocate sometime
to ANNOY as well.I am currently preparing my timeline for the project.

I think we do not need to connect LSH-Forest prototype to those libraries.
We just need a quick hack for the data structure and get it tested. But the
result will have to be compared with them.

Currently, I do not have a bench mark code. But I will design one in the
way that Radim has used in his evaluation.

Best Regards,
Maheshakya

Post by Robert Layton
I see what you are saying about the template, but you can also add extra
information up the front!
My reasoning behind that comment -- I read your background and, at that
point, didn't understand why you were submitting for a python project
specifically.
Afterwards, it is clear, but a rule of thumb for grant applications: don't
make the reviewers guess, and don't surprise them!

with

Post by Maheshakya Wijewardena
FLANN, I have started reading it. Maybe you can have a look into that as
well.
Best regards,
Maheshakya
Thanks for sending this.
The grammar needs updating, but I'll let you worry about
- State some /evidence/ for your experience with python, i.e.
projects you worked on or links to big commits (this is covered late
in the application, move some information forward).
- Don't choose a project because "it was on the list" -- why is the
problem important and what can *you* do to help?
- Be more confident! "This is the most vital part of this project as
I think." Ensure the whole project is planned and evaluated.
- Can you talk a bit more about which hashing algorithms will be
implemented? This is a major part of the work, but isn't
appropriately covered.
- In your abstract, I would make certain your plan -- you want to
implement prototypes, evaluate them and then finish the
implementations. Make this absolutely clear (currently, the reader
has to piece this together for themselves)
- References need to be done more thoroughly.
I think it's a good project, now you need to sell it!
On 18 March 2014 06:55, Maheshakya Wijewardena
Hi,
I have written a proposal for this project and it is in the

wiki,
https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search

2014-03-14 19:40 GMT+01:00 Daniel Vainsencher

Post by Daniel Vainsencher
Hi Olivier,
Have you looked at the LSH-Forest (and to a lesser

extent, Multiprobe)

Post by Daniel Vainsencher
paper? I've used it in practice, and it makes a

significant improvement

Post by Daniel Vainsencher
vs. basic LSH. I've proposed in recent emails to the

list how to

Post by Daniel Vainsencher
implement it based on sorting and binary search (in

order to avoid the

Post by Daniel Vainsencher
dependency/maintenance burden of range-queries).

No I did not read those papers (yet).

I don't know the culture here, but since you seemed to
summarize rather
- Where LSH uses hash-maps keyed on the a random hash of
length exactly
k, LSHF uses an ordered map (binary tree or BTree in the
paper). This
"all keys whose prefix of length h<=k matches this hash".
This trivially
enables a per tree query of type "give me the best q hashes
(and maybe a
few more)" by doing queries with h=k, h=k-1,... until
satisfied (this
can be optimized in different ways).
- Then queries across multiple trees can be combined in
different ways,
the simplest being "give me at least q/L candidates from
each tree, take
the union". More complex variants don't seem to make much
difference.
- So, the parameter k becomes essentially per-query, L can
be determined
on its own, and we can determine the number of expensive
"real distance"
comparisons we want to pay per query.
I offer the simple observation that sorted arrays support
very fast
range queries, so that complex/expensive trees are not
necessary for an
LSHF implementation.
As Maheshakya said it might very

well be the method implemented in Annoy.

It seems similar but distinct, see a recent mail.

Do you have sample python
code around?

No, I implemented it in C#, and source is not available.

Post by Daniel Vainsencher
I think evaluating and tweaking vanilla LSH is a waste

of time, when an

Post by Daniel Vainsencher
obvious improvement that fixes a significant real world

flaw (control of

Post by Daniel Vainsencher
number of candidates seen) in the method is available.

If the obvious improvement does not come with a

significant increment

in code complexity (and maintenance burden) then I agree,

yes.
So, was the above was sufficient to answer your

prerequisite?

Getting rid of hyperparameters to select manually of via

grid-search

is a big usability improvement and can make the method

much more

practical to use.

In this case, performance becomes better than mere tuning
would allow,
since it adapts to database density around the query point.
Daniel

------------------------------------------------------------------------------

field,

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Maheshakya Wijewardena
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------

today!

Post by Maheshakya Wijewardena
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and

their

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Maheshakya Wijewardena

2014-03-19 19:38:37 UTC

Hi all,
I have updated the timeline in my proposal. Can you review it and give me
your feedback before I finalize it.

Thank you,
Maheshakya.

Maheshakya Wijewardena

2014-03-19 19:39:21 UTC

Hi all,
I have updated the timeline in my proposal. Can you review it and give me
your feedback before I finalize it.

Thank you,
Maheshakya.

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Maheshakya Wijewardena

2014-03-20 03:05:48 UTC

I have submitted my proposal to melange. Now you can comment on that from
there as well.
http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/maheshakya/5649050225344512?validated=True#

Best regards,
Maheshakya

On Thu, Mar 20, 2014 at 1:09 AM, Maheshakya Wijewardena <

Post by Maheshakya Wijewardena
Hi all,
I have updated the timeline in my proposal. Can you review it and give me
your feedback before I finalize it.
Thank you,
Maheshakya.
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Maheshakya Wijewardena

2014-03-19 16:12:32 UTC

Daniel,

Moreover, those libraries(FLANN, ANNOY and Kgraph) are not implemented in
pure Python though they have Python wrappers. Their complexity is also
high(except for ANNOY). So integrating my prototypes into those will not be
feasible. I think the best is to have quick simple implementation of
LSH-Forest only to test on the selected datasets.

On Wed, Mar 19, 2014 at 8:36 PM, Maheshakya Wijewardena <

Post by unknown
Hi Robert,
I see what you are saying about the template, but you can also add extra

Post by Robert Layton
information up the front!

I have made an indication of what I have done related to machine learning
previously in the my personal statement. But the links and description my
work remain at the same position.
I've read your proposal. Remember that evaluating doesn't bring home the

Daniel, as you have mentioned I wont put much effort on implementing the
vanilla LSH algorithm. I will just use existing Python implementations for
evaluation. Moreover, My priority is to prototype LSH-Forest and analyze it
(This is indicated in the updated proposal). But I will allocate sometime
to ANNOY as well.I am currently preparing my timeline for the project.
I think we do not need to connect LSH-Forest prototype to those libraries.
We just need a quick hack for the data structure and get it tested. But the
result will have to be compared with them.
Currently, I do not have a bench mark code. But I will design one in the
way that Radim has used in his evaluation.
Best Regards,
Maheshakya

Post by Robert Layton
I see what you are saying about the template, but you can also add extra
information up the front!
My reasoning behind that comment -- I read your background and, at that
point, didn't understand why you were submitting for a python project
specifically.
don't make the reviewers guess, and don't surprise them!

Post by Maheshakya Wijewardena
otherwise, I can change it.
I have another thing to discuss here before finalizing my proposal.
Mr. Wei Dong has informed me that there is another state-of-art

approach

Post by Maheshakya Wijewardena
he implemented to perform ANN search and asked to see if that can be
used in the evaluating stage. This implementation is called KGraph
<http://www.kgraph.org/index.php?n=Main.Python>. It has comparisons

with

Post by Maheshakya Wijewardena
FLANN, I have started reading it. Maybe you can have a look into that

Post by Maheshakya Wijewardena
well.
Best regards,
Maheshakya
Thanks for sending this.
The grammar needs updating, but I'll let you worry about
- State some /evidence/ for your experience with python, i.e.
projects you worked on or links to big commits (this is covered

late

Post by Maheshakya Wijewardena
in the application, move some information forward).
- Don't choose a project because "it was on the list" -- why is the
problem important and what can *you* do to help?
- Be more confident! "This is the most vital part of this project

Post by Maheshakya Wijewardena
I think." Ensure the whole project is planned and evaluated.
- Can you talk a bit more about which hashing algorithms will be
implemented? This is a major part of the work, but isn't
appropriately covered.
- In your abstract, I would make certain your plan -- you want to
implement prototypes, evaluate them and then finish the
implementations. Make this absolutely clear (currently, the reader
has to piece this together for themselves)
- References need to be done more thoroughly.
I think it's a good project, now you need to sell it!
On 18 March 2014 06:55, Maheshakya Wijewardena
Hi,
I have written a proposal for this project and it is in the

wiki,
https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search

Post by Maheshakya Wijewardena
It is not complete yet. I have not written the timeline yet.

But

Post by Maheshakya Wijewardena
I have described my plan for this project. I think the
prototyping part of this project should be started during the
community bonding period, while fixing other issues because

that

Post by Maheshakya Wijewardena
needs to be done as soon as possible. I will update this with

Post by Maheshakya Wijewardena
timeline soon.
Can you review this and give me your feedback.
Thank you,
Maheshakya.
On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher

2014-03-14 19:40 GMT+01:00 Daniel Vainsencher

Post by Daniel Vainsencher
Hi Olivier,
Have you looked at the LSH-Forest (and to a lesser

extent, Multiprobe)

Post by Daniel Vainsencher
paper? I've used it in practice, and it makes a

significant improvement

Post by Daniel Vainsencher
vs. basic LSH. I've proposed in recent emails to the

list how to

Post by Daniel Vainsencher
implement it based on sorting and binary search (in

order to avoid the

Post by Daniel Vainsencher
dependency/maintenance burden of range-queries).

No I did not read those papers (yet).

I don't know the culture here, but since you seemed to
summarize rather
- Where LSH uses hash-maps keyed on the a random hash of
length exactly
k, LSHF uses an ordered map (binary tree or BTree in the
paper). This
"all keys whose prefix of length h<=k matches this hash".
This trivially
enables a per tree query of type "give me the best q hashes
(and maybe a
few more)" by doing queries with h=k, h=k-1,... until
satisfied (this
can be optimized in different ways).
- Then queries across multiple trees can be combined in
different ways,
the simplest being "give me at least q/L candidates from
each tree, take
the union". More complex variants don't seem to make much
difference.
- So, the parameter k becomes essentially per-query, L can
be determined
on its own, and we can determine the number of expensive
"real distance"
comparisons we want to pay per query.
I offer the simple observation that sorted arrays support
very fast
range queries, so that complex/expensive trees are not
necessary for an
LSHF implementation.
As Maheshakya said it might very

well be the method implemented in Annoy.

It seems similar but distinct, see a recent mail.

Do you have sample python
code around?

No, I implemented it in C#, and source is not available.

Post by Daniel Vainsencher
I think evaluating and tweaking vanilla LSH is a waste

of time, when an

Post by Daniel Vainsencher
obvious improvement that fixes a significant real world

flaw (control of

Post by Daniel Vainsencher
number of candidates seen) in the method is available.

If the obvious improvement does not come with a

significant increment

in code complexity (and maintenance burden) then I

agree,

Post by Maheshakya Wijewardena
yes.
So, was the above was sufficient to answer your

prerequisite?

Getting rid of hyperparameters to select manually of via

grid-search

is a big usability improvement and can make the method

much more

practical to use.

In this case, performance becomes better than mere tuning
would allow,
since it adapts to database density around the query point.
Daniel

------------------------------------------------------------------------------

field,

Post by Maheshakya Wijewardena
this first edition is now available. Download your free

book

Post by Maheshakya Wijewardena
today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Maheshakya Wijewardena
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph

databases

Post by Maheshakya Wijewardena
and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book

today!

Post by Maheshakya Wijewardena
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases

and

Post by Maheshakya Wijewardena
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and

their

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Maheshakya Wijewardena

2014-03-19 19:17:50 UTC

Hi all,
I have updated the timeline in my proposal. Can you review it and give me
your feedback before I finalize it.

Thank you,
Maheshakya.

On Wed, Mar 19, 2014 at 9:42 PM, Maheshakya Wijewardena <

Post by Maheshakya Wijewardena
Daniel,
Moreover, those libraries(FLANN, ANNOY and Kgraph) are not implemented in
pure Python though they have Python wrappers. Their complexity is also
high(except for ANNOY). So integrating my prototypes into those will not be
feasible. I think the best is to have quick simple implementation of
LSH-Forest only to test on the selected datasets.
On Wed, Mar 19, 2014 at 8:36 PM, Maheshakya Wijewardena <

Post by unknown
Hi Robert,
I see what you are saying about the template, but you can also add extra

Post by Robert Layton
information up the front!

I have made an indication of what I have done related to machine learning
previously in the my personal statement. But the links and description my
work remain at the same position.
I've read your proposal. Remember that evaluating doesn't bring home the

Daniel, as you have mentioned I wont put much effort on implementing the
vanilla LSH algorithm. I will just use existing Python implementations for
evaluation. Moreover, My priority is to prototype LSH-Forest and analyze it
(This is indicated in the updated proposal). But I will allocate sometime
to ANNOY as well.I am currently preparing my timeline for the project.
I think we do not need to connect LSH-Forest prototype to those
libraries. We just need a quick hack for the data structure and get it
tested. But the result will have to be compared with them.
Currently, I do not have a bench mark code. But I will design one in the
way that Radim has used in his evaluation.
Best Regards,
Maheshakya

Post by Robert Layton
I see what you are saying about the template, but you can also add extra
information up the front!
My reasoning behind that comment -- I read your background and, at that
point, didn't understand why you were submitting for a python project
specifically.
don't make the reviewers guess, and don't surprise them!

Post by Maheshakya Wijewardena
otherwise, I can change it.
I have another thing to discuss here before finalizing my proposal.
Mr. Wei Dong has informed me that there is another state-of-art

approach

with

Post by Maheshakya Wijewardena
FLANN, I have started reading it. Maybe you can have a look into that

Post by Maheshakya Wijewardena
well.
Best regards,
Maheshakya
On Tue, Mar 18, 2014 at 4:08 AM, Robert Layton <
Thanks for sending this.
The grammar needs updating, but I'll let you worry about
- State some /evidence/ for your experience with python, i.e.
projects you worked on or links to big commits (this is covered

late

Post by Maheshakya Wijewardena
in the application, move some information forward).
- Don't choose a project because "it was on the list" -- why is

the

Post by Maheshakya Wijewardena
problem important and what can *you* do to help?
- Be more confident! "This is the most vital part of this project

wiki,
https://github.com/scikit-learn/scikit-learn/wiki/GSOC-2014--Project-Proposal-:-Locality-sensitive-Hashing-for-approximate-neighbor-search

Post by Maheshakya Wijewardena
It is not complete yet. I have not written the timeline yet.

But

that

Post by Maheshakya Wijewardena
needs to be done as soon as possible. I will update this with

Post by Maheshakya Wijewardena
timeline soon.
Can you review this and give me your feedback.
Thank you,
Maheshakya.
On Mon, Mar 17, 2014 at 1:02 PM, Daniel Vainsencher

2014-03-14 19:40 GMT+01:00 Daniel Vainsencher

Post by Daniel Vainsencher
Hi Olivier,
Have you looked at the LSH-Forest (and to a lesser

extent, Multiprobe)

Post by Daniel Vainsencher
paper? I've used it in practice, and it makes a

significant improvement

Post by Daniel Vainsencher
vs. basic LSH. I've proposed in recent emails to the

list how to

Post by Daniel Vainsencher
implement it based on sorting and binary search (in

order to avoid the

Post by Daniel Vainsencher
dependency/maintenance burden of range-queries).

No I did not read those papers (yet).

hashes

Post by Maheshakya Wijewardena
(and maybe a
few more)" by doing queries with h=k, h=k-1,... until
satisfied (this
can be optimized in different ways).
- Then queries across multiple trees can be combined in
different ways,
the simplest being "give me at least q/L candidates from
each tree, take
the union". More complex variants don't seem to make much
difference.
- So, the parameter k becomes essentially per-query, L can
be determined
on its own, and we can determine the number of expensive
"real distance"
comparisons we want to pay per query.
I offer the simple observation that sorted arrays support
very fast
range queries, so that complex/expensive trees are not
necessary for an
LSHF implementation.
As Maheshakya said it might very

well be the method implemented in Annoy.

It seems similar but distinct, see a recent mail.

Do you have sample python
code around?

No, I implemented it in C#, and source is not available.

Post by Daniel Vainsencher
I think evaluating and tweaking vanilla LSH is a waste

of time, when an

Post by Daniel Vainsencher
obvious improvement that fixes a significant real

world

Post by Maheshakya Wijewardena
flaw (control of

Post by Daniel Vainsencher
number of candidates seen) in the method is available.

If the obvious improvement does not come with a

significant increment

in code complexity (and maintenance burden) then I

agree,

Post by Maheshakya Wijewardena
yes.
So, was the above was sufficient to answer your

prerequisite?

Getting rid of hyperparameters to select manually of via

grid-search

is a big usability improvement and can make the method

much more

practical to use.

In this case, performance becomes better than mere tuning
would allow,
since it adapts to database density around the query

point.

Post by Maheshakya Wijewardena
Daniel

------------------------------------------------------------------------------

field,

Post by Maheshakya Wijewardena
this first edition is now available. Download your free

book

Post by Maheshakya Wijewardena
today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Maheshakya Wijewardena
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph

databases

Post by Maheshakya Wijewardena
and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book

today!

Post by Maheshakya Wijewardena
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases

and

Post by Maheshakya Wijewardena
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book

today!

Post by Maheshakya Wijewardena
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and

their

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Maheshakya Wijewardena

2014-03-14 11:02:22 UTC

My thinking was to implement "Basic LSH with basic data structures" and
then spend some of the time working on seeing if moderate improvements
(i.e. a more complex data structure) can deliver benefits.

When complexity of the data structures are considered, the standard method
contains multiple hash tables (which can be numpy 2d arrays) and the forest
contains binary trees (which can also be implemented with numpy arrays).
The extra operation required when indexing is binary search. But when the
benefits of the forest method can easily compensate for the this minor
complexity.

Yes. But ideally there would be other applications of LSH than KNN. That

would contribute to the 'benefit' part of the cost/benefit analysis.

Currently, the scope of this project is limited to approximate neighbor
search using LSH. But the range of applications of LSH is much wider. This
can even be implemented with Hash kernel (Hashing trick in
sklearn.feature_extraction.). Further, it can be Hierarchical clustering as
well. But, since the goal of this project is mainly to approximate neighbor
search. doesn't the evaluation criteria concentrate more on that?

To decide I think the best way to proceed would be to have a

evaluation / prototyping stage in the project that would first hack an
implementation in of the basic methods (at least random projections
and maybe others) in a gist outside of the scikit-learn codebase, not
to worry about documentation and compliance with the API and benchmark
it / profile it on representative datasets with various statistical
structures and compare the outcome to alternative methods such as the
methods implemented in FLANN and Annoy.

Yes. This can be done as a pre-step for the project and find the best
implementation method. So I better include that part as well in my proposal.

Maheshakya.

Thanks Gael. My thinking was to implement "Basic LSH with basic data
structures" and then spend some of the time working on seeing if moderate
improvements (i.e. a more complex data structure) can deliver benefits.
This way, we get the key deliverable, and spend some time trying to see if
we can do better.
I'd also like to see scalability added to the evaluation criteria!

Post by unknown
Other data structures worth evaluating (without depending on them) are
BTrees

Let's really avoid this for now. It's not like we have too many people
who understand the codebase and are available to maintain it.

To stress things again: we often see projects, whether they are GSOC or
elsewhere, that fail to deliver, because people implement plenty of fancy
things, and do not focus on the quality and maintainability. At the end
of the day, these projects leave _nothing_ for scikit-learn.
There is no point shooting first for the moon. Remember the 80/20
tradeoff rules: 80% of the benefits (in term of usecases, speed, ...) can
be achieved with 20% of the efforts.
Gaël
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

Gael Varoquaux

2014-03-13 16:57:27 UTC

Post by unknown
Other data structures worth evaluating (without depending on them) are
BTrees

Let's really avoid this for now. It's not like we have too many people
who understand the codebase and are available to maintain it.

Post by unknown
- Is the likely product useful? how hard to maintain?
- Are there parts of the proposal that are unclear or complicated, thus
risking delay?

Yes.

Post by unknown
How should an LSH project be evaluated? one proxy measurement is
improvement on kNN speed vs. accuracy.

Yes. But ideally there would be other applications of LSH than KNN. That
would contribute to the 'benefit' part of the cost/benefit analysis.

Gaël

unknown

1970-01-01 00:00:00 UTC

Hi Robert,

I would like to help with the subject matter aspects; integration (in
more sense than one) requires more familiarity with the scikit learn
project. If you want to co-mentor it together, that would be cool.

Other data structures worth evaluating (without depending on them) are
BTrees (complicated but do everything reasonably efficiently, and very
well known) and COLA [1]: efficient at random insertion and range
queries, still based on sorted arrays and binary search.

Post by Maheshakya Wijewardena
In this case, I think the entire structure can be implemented in
numpy arrays as they contain an (can be performed on) every operation
required this LSH forest implementation. But I'm not aware of this
cost/benefit analysis. I need help of those who have more experience.
Robert, can you give me some assistance here?

I understand "cost/benefit analysis" and "risk analysis" as meaning:
- Is the likely product useful? how hard to maintain?
- Are there parts of the proposal that are unclear or complicated, thus
risking delay?

We've discussed some of those. You should address them all in your
proposal (see [2]) which you should probably write as soon as possible,
the deadline is pretty close.

How should an LSH project be evaluated? one proxy measurement is
improvement on kNN speed vs. accuracy.

Daniel

[1] supertech.csail.mit.edu/papers/sbtree.pdfâ
[2]
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014

GaÃ«l, What is the status of the evaluation of the cost to

benefit ratio for

this?

I don't know. I just cannot be an expert of everything
machine-learning
related. Somebody needs to do a cost-benefit analysis. Ideally
different
people should comment on it. And yes, it's challenging.

I would appreciate it if we can decide on storing structure

for the LSH

implementation quickly as I have to make a sound plan for

this project.
Well, anything that uses structure more complex than numpy
arrays is much
harder to develop and maintain. You need to factor this in the
cost/benefit analysis. Once again, I stress: much harder!
G
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

unknown

1970-01-01 00:00:00 UTC

--20cf3040ea10e843f104f4797e17
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Great. Make sure you sign up (Gael has posted the link elsewhere on this
mailing list).

Maheshakya -- can you start to put together a proposal. For now, I would
include LSH forests.

Post by Maheshakya Wijewardena
In this case, I think the entire structure can be implemented in
numpy arrays as they contain an (can be performed on) every operation
required this LSH forest implementation. But I'm not aware of this
cost/benefit analysis. I need help of those who have more experience.
Robert, can you give me some assistance here?

- Is the likely product useful? how hard to maintain?
- Are there parts of the proposal that are unclear or complicated, thus
risking delay?
We've discussed some of those. You should address them all in your
proposal (see [2]) which you should probably write as soon as possible,
the deadline is pretty close.
How should an LSH project be evaluated? one proxy measurement is
improvement on kNN speed vs. accuracy.
Daniel
[1] supertech.csail.mit.edu/papers/sbtree.pdf‎
[2]
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014

Post by Maheshakya Wijewardena
Gaël, What is the status of the evaluation of the cost to

benefit ratio for

Post by Maheshakya Wijewardena
this?

I don't know. I just cannot be an expert of everything
machine-learning
related. Somebody needs to do a cost-benefit analysis. Ideally
different
people should comment on it. And yes, it's challenging.

Post by Maheshakya Wijewardena
I would appreciate it if we can decide on storing structure

for the LSH

Post by Maheshakya Wijewardena
implementation quickly as I have to make a sound plan for

------------------------------------------------------------------------------

today!

Post by Maheshakya Wijewardena
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Post by Maheshakya Wijewardena
--
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

Post by Maheshakya Wijewardena
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and

their

--20cf3040ea10e843f104f4797e17
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable <div dir="ltr">Great. Make sure you sign up (Gael has posted the link elsewhere on this mailing list).<div><br></div><div>Maheshakya -- can you start to put together a proposal. For now, I would include LSH forests.<br></div> </div><div class="gmail_extra"><br><br><div class="gmail_quote">On 13 March 2014 19:12, Daniel Vainsencher <span dir="ltr"><<a href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Robert,<br>
<br>
I would like to help with the subject matter aspects; integration (in<br>
more sense than one) requires more familiarity with the scikit learn<br>
project. If you want to co-mentor it together, that would be cool.<br>
<br>
Other data structures worth evaluating (without depending on them) are<br>
BTrees (complicated but do everything reasonably efficiently, and very<br>
well known) and COLA [1]: efficient at random insertion and range<br>
queries, still based on sorted arrays and binary search.<br>
<br>
Maheshakya:<br> <div class="">> In this case, I think the entire structure can be implemented in<br>
> numpy arrays as they contain an (can be performed on) every operation<br>
> required this LSH forest implementation. But I'm not aware of this<br>
> cost/benefit analysis. I need help of those who have more experience.<br>
> Robert, can you give me some assistance here?<br>
<br>
</div>I understand "cost/benefit analysis" and "risk analysis" as meaning:<br>
- Is the likely product useful? how hard to maintain?<br>
- Are there parts of the proposal that are unclear or complicated, thus<br>
risking delay?<br>
<br>
We've discussed some of those. You should address them all in your<br>
proposal (see [2]) which you should probably write as soon as possible,<br>
the deadline is pretty close.<br>
<br>
How should an LSH project be evaluated? one proxy measurement is<br>
improvement on kNN speed vs. accuracy.<br>
<br>
Daniel<br>
<br>
[1] <a href="http://supertech.csail.mit.edu/papers/sbtree.pdf" target="_blank">supertech.csail.mit.edu/papers/sbtree.pdf</a>‎<br>
[2]<br>
<a href="https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014" target="_blank">https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014</a><br>
<div class=""><br>
On 03/13/2014 01:00 AM, Robert Layton wrote:<br>
> I apologise if I haven't been as active as I would have liked -- I had<br>
> my wisdom teeth removed on Tuesday (ouch!).<br>
><br>
> Daniel -- are you interested in becoming a mentor for this project?<br>
><br>
> On the LSH forest, I don't have experience with it, but I think Daniel's<br>
> point probably allows it to be included. It is an improvement on a<br>
> standard algorithm.<br>
> (That said, I've always had the opinion of erring on the side of<br>
> inclusion, but I understand the push-back on this).<br>
><br>
> On the data structure, it is my strong preference to make it work in<br>
> "native" numpy, if that is possible (i.e. without doing ugly hacks).<br>
> This is due mainly to long-term maintainability -- i.e. if we need to<br>
> fix a bug and you aren't around, will someone with numpy/scipy<br>
> experience be able to work out what is going on?<br>
><br> </div>> /That said/, it would be good to allocate some extra time for the<br> <div class="">> analysis of other data structures in the GSoC project scope. Something<br>
> like "Week 8-12: Implement other data structure and test in comparison<br>
> to existing method". I'm not sure if the GSoC accepts projects with that<br>
> sort of risk, I'll find out.<br>
><br>
> Thanks,<br>
><br>
> Robert<br>
><br>
><br>
><br>
><br>
><br>
><br>
> On 13 March 2014 07:00, Maheshakya Wijewardena <<a href="mailto:***@gmail.com">***@gmail.com</a><br> </div><div class="">> <mailto:<a href="mailto:***@gmail.com">***@gmail.com</a>>> wrote:<br>
><br>
> In this case, I think the entire structure can be implemented in<br>
> numpy arrays as they contain an (can be performed on) every<br>
> operation required this LSH forest implementation. But I'm not aware<br>
> of this cost/benefit analysis. I need help of those who have more<br>
> experience.<br>
> Robert, can you give me some assistance here?<br>
><br>
><br>
> On Wed, Mar 12, 2014 at 10:33 PM, Gael Varoquaux<br>
> � � <<a href="mailto:***@normalesup.org">***@normalesup.org</a><br> </div><div><div class="h5">> <mailto:<a href="mailto:***@normalesup.org">***@normalesup.org</a>>> wrote:<br>
><br>
> > Gaël, What is the status of the evaluation of the cost to<br>
> benefit ratio for<br>
> > this?<br>
><br>
> I don't know. I just cannot be an expert of everything<br>
> machine-learning<br>
> related. Somebody needs to do a cost-benefit analysis. Ideally<br>
> different<br>
> people should comment on it. And yes, it's challenging.<br>
><br>
> > I would appreciate it if we can decide on storing structure<br>
> for the LSH<br>
> > implementation quickly as I have to make a sound plan for<br>
> this project.<br>
><br>
> Well, anything that uses structure more complex than numpy<br>
> arrays is much<br>
> harder to develop and maintain. You need to factor this in the<br>
> cost/benefit analysis. Once again, I stress: much harder!<br>
><br>
> G<br>
><br>
> ------------------------------------------------------------------------------<br>
> Learn Graph Databases - Download FREE O'Reilly Book<br>
> "Graph Databases" is the definitive new guide to graph databases<br>
> and their<br>
> applications. Written by three acclaimed leaders in the field,<br>
> this first edition is now available. Download your free book today!<br>
> <a href="http://p.sf.net/sfu/13534_NeoTech" target="_blank">http://p.sf.net/sfu/13534_NeoTech</a><br>
> _______________________________________________<br>
> Scikit-learn-general mailing list<br>
> ? ? ? ? <a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br> </div></div>> � � � � <mailto:<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a>><br> <div class="">> <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
><br>
><br>
><br>
><br>
> --<br>
> Undergraduate,<br>
> Department of Computer Science and Engineering,<br>
> Faculty of Engineering.<br>
> University of Moratuwa,<br>
> Sri Lanka<br>
><br>
> ------------------------------------------------------------------------------<br>
> Learn Graph Databases - Download FREE O'Reilly Book<br>
> "Graph Databases" is the definitive new guide to graph databases and<br>
> their<br>
> applications. Written by three acclaimed leaders in the field,<br>
> this first edition is now available. Download your free book today!<br>
> <a href="http://p.sf.net/sfu/13534_NeoTech" target="_blank">http://p.sf.net/sfu/13534_NeoTech</a><br>
> _______________________________________________<br>
> Scikit-learn-general mailing list<br>
> � � <a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br> </div>> <mailto:<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a>><br>
> � � <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br> <div class="HOEnZb"><div class="h5">><br>
><br>
><br>
><br>
> ------------------------------------------------------------------------------<br>
> Learn Graph Databases - Download FREE O'Reilly Book<br>
> "Graph Databases" is the definitive new guide to graph databases and their<br>
> applications. Written by three acclaimed leaders in the field,<br>
> this first edition is now available. Download your free book today!<br>
> <a href="http://p.sf.net/sfu/13534_NeoTech" target="_blank">http://p.sf.net/sfu/13534_NeoTech</a><br>
><br>
><br>
><br>
> _______________________________________________<br>
> Scikit-learn-general mailing list<br>
> <a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
> <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
><br>
<br>
<br>
------------------------------------------------------------------------------<br>
Learn Graph Databases - Download FREE O'Reilly Book<br>
"Graph Databases" is the definitive new guide to graph databases and their<br>
applications. Written by three acclaimed leaders in the field,<br>
this first edition is now available. Download your free book today!<br>
<a href="http://p.sf.net/sfu/13534_NeoTech" target="_blank">http://p.sf.net/sfu/13534_NeoTech</a><br>
_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
</div></div></blockquote></div><br></div>

--20cf3040ea10e843f104f4797e17--

unknown

1970-01-01 00:00:00 UTC