Discussion:
[Scikit-learn-general] Pairwise Distances for n closest matches
Debanjan Bhattacharyya
2016-02-01 06:14:35 UTC
Permalink
Hi

I have written a method pairwise_distances_argmin_min_n in my "develop"
mode.
Functionality is similar to pairwise_distances_argmin_min, but, it returns
n minimas
rather than only one (both indices and the minimas). And it does it in
chunk mode (parallel) on sparse matrices which needed some stacking and
combining etc code

This is particularly useful in word vector models where you need to find
the n closest documents against an input document given clustered vectors
of the documents.

I had a 40GB numpy array, of size, 483858*21058 (where 21058 is the number
of clusters), and I was trying to find out pairwise distances between the
first 250,000 documents and the rest. A 2500 only big chunk of a resultant
distance array from pairwise_distance results in a 2 GB file. The total
distance file would have been 200 GB! in size. That was not making any
sense to get only the top 100 or 200 closest matches.

Hence I implemented this function. I have tested its performance. Its
good.

Please let me know whether I should create a pull request for this and
contribute.

Thanks

Regards
Deb
Andy
2016-02-02 01:36:31 UTC
Permalink
Hi Deb.
We don't really expose low-level functions like these, and only include
them if there is a particular use-case.
Why not create a pull request for scipy?

Cheers,
Andy
Post by Debanjan Bhattacharyya
Hi
I have written a method pairwise_distances_argmin_min_n in my
"develop" mode.
Functionality is similar to pairwise_distances_argmin_min, but, it
returns n minimas
rather than only one (both indices and the minimas). And it does it in
chunk mode (parallel) on sparse matrices which needed some stacking
and combining etc code
This is particularly useful in word vector models where you need to
find the n closest documents against an input document given clustered
vectors of the documents.
I had a 40GB numpy array, of size, 483858*21058 (where 21058 is the
number of clusters), and I was trying to find out pairwise distances
between the first 250,000 documents and the rest. A 2500 only big
chunk of a resultant distance array from pairwise_distance results in
a 2 GB file. The total distance file would have been 200 GB! in size.
That was not making any sense to get only the top 100 or 200 closest
matches.
Hence I implemented this function. I have tested its performance.
Its good.
Please let me know whether I should create a pull request for this and
contribute.
Thanks
Regards
Deb
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...