<html style="direction: ltr;">
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<style type="text/css">body p { margin-bottom: 5px; margin-top: 0pt; } </style>
<style type="text/css">body p { margin-bottom: 5px; margin-top: 0pt; } </style>
</head>
<body style="direction: ltr;"
bidimailui-detected-decoding-type="latin-charset" bgcolor="#FFFFFF"
text="#000000">
<div style="direction: ltr;" class="moz-cite-prefix">Hi Maheshakya,<br>
<br>
In regular LSH, a particular setting of the number of hash
functions per index (k) and the number of indexes (L) essentially
determines the size of the region in space from which candidates
will be chosen in response to a query. If queries q1 and q2 are in
areas where the density of data points is very different, then one
will receive many more candidates than the other, thus be slower
and more accurate. Thus tuning k, L involves a tradeoff, and with
varied density, you cannot win. <br>
<br>
Even worse: as you add data, your k, L stop being optimal. So
tuning on a significantly smaller sample is not effective, which
makes tuning even more expensive.<br>
<br>
Going back to LSH Forest, while an implementation using binary
trees is possible and ok for testing, but not very space or cache
efficient. Other data structures (b-trees and variants, see [2]
for some discussion) are more appropriate. The key operations are
range queries and depending on data change rate, insertion.<br>
<br>
Space efficiency of indexes is also affected by the hash type: if
you use a binary hash functions, choose k in {32, 64} and use the
appropriate integer types to avoid numpy's awkwardness around bit
vectors. LSH-Forest has no difficulty with such high k in the
index (as opposed to plain LSH).<br>
<br>
Daniel Vainsencher<br>
<br>
On 03/05/2014 07:18 AM, Maheshakya Wijewardena wrote:<br>
</div>
<blockquote
cite="mid:CA+t+wPjb6=dp750z7EYhEP-NRLAeJn=ERVVDf1u9N+***@mail.gmail.com"
type="cite">
<div dir="ltr">Hi <span
style="font-family:arial,sans-serif;font-size:13px">Daniel, </span>
<div><span style="font-family:arial,sans-serif;font-size:13px">These
alternative strategies of LSH are a good way to start using
the existing implementations of Scikit-learn. But why do you
say the standard algorithms are impractical?</span></div>
<div><span style="font-family:arial,sans-serif;font-size:13px"><br>
</span></div>
<div><span style="font-family:arial,sans-serif;font-size:13px">Maheshakya</span></div>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote"> On Wed, Mar 5, 2014 at 1:31 AM, Daniel
Vainsencher <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:***@gmail.com" target="_blank">***@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="direction:ltr" bgcolor="#FFFFFF" text="#000000">
<div>Hi scikit-learn, Maheshakya,<br>
<br>
In my experience implementing and using LSH, the vanilla
algorithms are very impractical. Two extensions that
help a lot (and combine nicely) are:<br>
- Multi-probe LSH [1], where hash values neighboring the
target's hash are also used.<br>
- LSH Forest [2], where fixed length hashes are replaced
by length bounded hashes; the hash used is just long
enough to get sufficiently few candidates for the
target's nearest neighbor. <br>
<br>
Together they make it feasible to limit the number of
independent copies of the index to a reasonable number.
Incidentally, the LSH Forest approach may be implemented
using binary trees.<br>
<br>
Daniel Vainsencher<br>
<br>
<cite>[1] <a moz-do-not-send="true"
href="http://www.cs.princeton.edu/cass/papers/mp"
target="_blank">www.cs.princeton.edu/cass/papers/mp</a><b>lsh</b>_vldb07.pdf</cite><br>
[2] <cite><a moz-do-not-send="true"
href="http://www2005.org/docs/p651.pdf"
target="_blank">www2005.org/docs/p651.pdf</a></cite>
<div>
<div class="h5"><br>
<br>
On 03/04/2014 06:09 PM, Maheshakya Wijewardena
wrote:<br>
</div>
</div>
</div>
<blockquote type="cite">
<div>
<div class="h5">
<div dir="ltr">Hi, <br>
<br>
Currently, unsupervised neighbor search is
implemented in sklearn.neighbors. There are three
algorithms used to perform neighbor search.
<div>
<ul>
<li>kd tree</li>
<li>ball tree</li>
<li>brute force method</li>
</ul>
kd tree and ball tree are implemented separately
as cython implementations. Both of them use
binary tree(binary_tree.pxi). In the
NeighborBass class, above implementations are
used.</div>
<div><br>
</div>
<div>As Locality Sensitivity Hashing will also be
used in approximating nearest neighbors, its'
implementation should also be a part of
sklearn.neighbors. But Locality sensitivity
hashing algorithms do not involve binary tree
implementation. Therefore it has to be
implemented separately and be used in
sklearn.neighbors as another algorithm.Is this
approach correct?</div>
<div><br>
</div>
<div>I'm trying to implement a base LSH class and
the hashing algorithms as cython
implementations. To approximate neighbor search
using those algorithms, multiple hash tables
will be created with those hash functions so,
for that an efficient way to store is required.</div>
<div><br>
</div>
<div>These are some of the issues I found
disturbing while I was trying to implement LSH
to approximate neighbor search. Is this an
appropriate approach to implement this or are
there any other better ways?</div>
<div><br>
</div>
<div> </div>
<div>
<div> <br clear="all">
<div>Regards,</div>
<div>Maheshakya,</div>
-- <br>
<div dir="ltr">Undergraduate,
<div>Department of Computer Science and
Engineering,</div>
<div>Faculty of Engineering.</div>
<div>University of Moratuwa,</div>
<div>Sri Lanka</div>
</div>
</div>
</div>
</div>
<br>
<fieldset></fieldset>
<br>
</div>
</div>
<pre>------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
<a moz-do-not-send="true" href="http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk" target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk</a></pre>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
Scikit-learn-general mailing list
<a moz-do-not-send="true" href="mailto:Scikit-learn-***@lists.sourceforge.net" target="_blank">Scikit-learn-***@lists.sourceforge.net</a>
<a moz-do-not-send="true" href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general" target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a>
</pre>
</blockquote>
<br>
</div>
<br>
------------------------------------------------------------------------------<br>
Subversion Kills Productivity. Get off Subversion & Make
the Move to Perforce.<br>
With Perforce, you get hassle-free workflows. Merge that
actually works.<br>
Faster operations. Version large binaries. Built-in WAN
optimization and the<br>
freedom to use Git, Perforce or both. Make the move to
Perforce.<br>
<a moz-do-not-send="true"
href="http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk"
target="_blank">http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk</a><br>
_______________________________________________<br>
Scikit-learn-general mailing list<br>
<a moz-do-not-send="true"
href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a><br>
<a moz-do-not-send="true"
href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general"
target="_blank">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a><br>
<br>
</blockquote>
</div>
<br>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">Undergraduate,
<div>Department of Computer Science and Engineering,</div>
<div>Faculty of Engineering.</div>
<div>University of Moratuwa,</div>
<div>Sri Lanka</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
<a class="moz-txt-link-freetext" href="http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk">http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk</a></pre>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Scikit-learn-general mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Scikit-learn-***@lists.sourceforge.net">Scikit-learn-***@lists.sourceforge.net</a>
<a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general">https://lists.sourceforge.net/lists/listinfo/scikit-learn-general</a>
</pre>
</blockquote>
<br>
</body>
</html>