Joel Nothman
2016-02-18 23:58:25 UTC
If not stack overflow, the appropriate venue for such questions is the
scikit-learn-general mailing list.
The current dbscan implementation is by default not memory efficient,
constructing a full pairwise similarity matrix in the case where
kd/ball-trees cannot be used (e.g. with sparse matrices). This matrix will
consume n^2 floats, perhaps 40GB in your case.
We provide a couple of mechanisms for getting around this:
- You can precompute a sparse radius neighborhood graph (where missing
entries are presumed to be out of eps) in a memory-efficient way, and run
dbscan over this with metric='precomputed'.
- You can compress the dataset, either by removing exact duplicates if
these occur in your data, or by using BIRCH. Then you only have a
relatively small number of representatives for a large number of points.
You can then provide a sample_weight when fitting DBSCAN.
I suspect this could be clearer in the documentation, and a pull request is
welcome.
Perhaps default implementation of radius_neighbors and kneighbors in the
brute force case should be more memory-sensitive; or dbscan should return
to / have an option to search for nearest neighbors when needed rather than
in advance, which is the source of the high memory consumption.
Cheers; but please don't email developers personally, and continue
correspondence through the mailing list.
Joel
scikit-learn-general mailing list.
The current dbscan implementation is by default not memory efficient,
constructing a full pairwise similarity matrix in the case where
kd/ball-trees cannot be used (e.g. with sparse matrices). This matrix will
consume n^2 floats, perhaps 40GB in your case.
We provide a couple of mechanisms for getting around this:
- You can precompute a sparse radius neighborhood graph (where missing
entries are presumed to be out of eps) in a memory-efficient way, and run
dbscan over this with metric='precomputed'.
- You can compress the dataset, either by removing exact duplicates if
these occur in your data, or by using BIRCH. Then you only have a
relatively small number of representatives for a large number of points.
You can then provide a sample_weight when fitting DBSCAN.
I suspect this could be clearer in the documentation, and a pull request is
welcome.
Perhaps default implementation of radius_neighbors and kneighbors in the
brute force case should be more memory-sensitive; or dbscan should return
to / have an option to search for nearest neighbors when needed rather than
in advance, which is the source of the high memory consumption.
Cheers; but please don't email developers personally, and continue
correspondence through the mailing list.
Joel
Dear Joel and Robert,
Sorry for contacting you directly, there may be a more
formal way of contacting you about this. Anyway, here is my question.
I tried using dbscan on scikit learn v0.17 today and got a
memory Error. After reading about it on stackoverflow, I am still puzzled,
since I am using a compressed sparse row matrix as input, of size 100,000 x
400, with density 0.01, which is far from huge (300 MB on disk).
Apparently, the reason is that I am using the l1 distance as a metric.
Please find below a sample of code to reproduce the error, and my
traceback. If you have any suggestions on working around this problem, I
would be very thankful.
Y can reproduce the memory Error without having to download my own data,
Y=scipy.sparse.rand(100000,400,density=.01)
dbscan(Y,eps=10,min_samples=10000,metric=âl1â)
Also, here is the traceback I obtain after running the code : seems like
initializing a dense matrix of zeros of size O(n^2) is not such a good idea.
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py",
line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-94-0e23204d7925>", line 1, in <module>
sklearn.cluster.dbscan(scipy.sparse.rand(100000,400,density=.01),metric='manhattan')
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cluster\dbscan_.py",
line 146, in dbscan
return_distance=False)
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\neighbors\base.py",
line 609, in radius_neighbors
**self.effective_metric_params_)
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
line 1207, in pairwise_distances
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
line 1054, in _parallel_pairwise
return func(X, Y, **kwds)
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
line 516, in manhattan_distances
D = np.zeros((X.shape[0], Y.shape[0]))
MemoryError
*Augustin LEFEVRE *| Consultant Senior | Ykems | -
www.ykems.com
[image: Loading Image...]
Loading Image...]
Loading Image...]
Loading Image...]
<https://www.youtube.com/user/ComBeijaflore>
P Save a tree ! Think before you print
*SECURE BUSINESS*
*This message and its attachment contain information that may be
privileged or confidential and is the property of Beijaflore. It is
intended only for the person to whom it is addressed. If you are not the
intended recipient, you are not authorized to read, print, retain, copy,
disseminate, distribute, use or rely on the information contained in this
email. If you receive this message in error, please notify the sender
immediately and delete all copies of this message.*
Sorry for contacting you directly, there may be a more
formal way of contacting you about this. Anyway, here is my question.
I tried using dbscan on scikit learn v0.17 today and got a
memory Error. After reading about it on stackoverflow, I am still puzzled,
since I am using a compressed sparse row matrix as input, of size 100,000 x
400, with density 0.01, which is far from huge (300 MB on disk).
Apparently, the reason is that I am using the l1 distance as a metric.
Please find below a sample of code to reproduce the error, and my
traceback. If you have any suggestions on working around this problem, I
would be very thankful.
Y can reproduce the memory Error without having to download my own data,
Y=scipy.sparse.rand(100000,400,density=.01)
dbscan(Y,eps=10,min_samples=10000,metric=âl1â)
Also, here is the traceback I obtain after running the code : seems like
initializing a dense matrix of zeros of size O(n^2) is not such a good idea.
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py",
line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-94-0e23204d7925>", line 1, in <module>
sklearn.cluster.dbscan(scipy.sparse.rand(100000,400,density=.01),metric='manhattan')
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cluster\dbscan_.py",
line 146, in dbscan
return_distance=False)
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\neighbors\base.py",
line 609, in radius_neighbors
**self.effective_metric_params_)
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
line 1207, in pairwise_distances
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
line 1054, in _parallel_pairwise
return func(X, Y, **kwds)
File
"C:\Users\ALefevre\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py",
line 516, in manhattan_distances
D = np.zeros((X.shape[0], Y.shape[0]))
MemoryError
*Augustin LEFEVRE *| Consultant Senior | Ykems | -
www.ykems.com
[image: Loading Image...]
Loading Image...]
Loading Image...]
Loading Image...]
<https://www.youtube.com/user/ComBeijaflore>
P Save a tree ! Think before you print
*SECURE BUSINESS*
*This message and its attachment contain information that may be
privileged or confidential and is the property of Beijaflore. It is
intended only for the person to whom it is addressed. If you are not the
intended recipient, you are not authorized to read, print, retain, copy,
disseminate, distribute, use or rely on the information contained in this
email. If you receive this message in error, please notify the sender
immediately and delete all copies of this message.*