Debanjan Bhattacharyya
2016-02-01 06:14:35 UTC
Hi
I have written a method pairwise_distances_argmin_min_n in my "develop"
mode.
Functionality is similar to pairwise_distances_argmin_min, but, it returns
n minimas
rather than only one (both indices and the minimas). And it does it in
chunk mode (parallel) on sparse matrices which needed some stacking and
combining etc code
This is particularly useful in word vector models where you need to find
the n closest documents against an input document given clustered vectors
of the documents.
I had a 40GB numpy array, of size, 483858*21058 (where 21058 is the number
of clusters), and I was trying to find out pairwise distances between the
first 250,000 documents and the rest. A 2500 only big chunk of a resultant
distance array from pairwise_distance results in a 2 GB file. The total
distance file would have been 200 GB! in size. That was not making any
sense to get only the top 100 or 200 closest matches.
Hence I implemented this function. I have tested its performance. Its
good.
Please let me know whether I should create a pull request for this and
contribute.
Thanks
Regards
Deb
I have written a method pairwise_distances_argmin_min_n in my "develop"
mode.
Functionality is similar to pairwise_distances_argmin_min, but, it returns
n minimas
rather than only one (both indices and the minimas). And it does it in
chunk mode (parallel) on sparse matrices which needed some stacking and
combining etc code
This is particularly useful in word vector models where you need to find
the n closest documents against an input document given clustered vectors
of the documents.
I had a 40GB numpy array, of size, 483858*21058 (where 21058 is the number
of clusters), and I was trying to find out pairwise distances between the
first 250,000 documents and the rest. A 2500 only big chunk of a resultant
distance array from pairwise_distance results in a 2 GB file. The total
distance file would have been 200 GB! in size. That was not making any
sense to get only the top 100 or 200 closest matches.
Hence I implemented this function. I have tested its performance. Its
good.
Please let me know whether I should create a pull request for this and
contribute.
Thanks
Regards
Deb