Discussion:
[Scikit-learn-general] sklearn.preprocessing.normalize does not sum to 1
Ryan R. Rosario
2015-12-17 07:26:16 UTC
Permalink
Hi,

I have a very large dense numpy matrix. To avoid running out of RAM, I use np.float32 as the dtype instead of the default np.float64 on my system.

When I do an L1 normalization of the rows (axis=1) in my matrix in-place (copy=False), I frequently get rows that do not sum to 1. Since these are probability distributions that I pass to np.random.choice, these must sum to exactly 1.0.

pp.normalize(term, norm='l1', axis=1, copy=False)
sums = term.sum(axis=1)
sums[np.where(sums != 1)]

array([ 0.99999994, 0.99999994, 1.00000012, ..., 0.99999994,
0.99999994, 0.99999994], dtype=float32)

I wrote some code to manually add/subtract the small difference from 1 to each row, and I make some progress, but still all the rows do not sum to 1.

Is there a way to avoid this problem?

— Ryan
------------------------------------------------------------------------------
Sebastian Raschka
2015-12-17 07:58:09 UTC
Permalink
Hm, since you have problems with memory already, the longdouble wouldn't be an option I guess. However, what about using numpy.around to reduce the precision by a few decimals?



Sent from my iPhone
Post by Ryan R. Rosario
Hi,
I have a very large dense numpy matrix. To avoid running out of RAM, I use np.float32 as the dtype instead of the default np.float64 on my system.
When I do an L1 normalization of the rows (axis=1) in my matrix in-place (copy=False), I frequently get rows that do not sum to 1. Since these are probability distributions that I pass to np.random.choice, these must sum to exactly 1.0.
pp.normalize(term, norm='l1', axis=1, copy=False)
sums = term.sum(axis=1)
sums[np.where(sums != 1)]
array([ 0.99999994, 0.99999994, 1.00000012, ..., 0.99999994,
0.99999994, 0.99999994], dtype=float32)
I wrote some code to manually add/subtract the small difference from 1 to each row, and I make some progress, but still all the rows do not sum to 1.
Is there a way to avoid this problem?
— Ryan
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dale Smith
2015-12-17 12:50:09 UTC
Permalink
Ryan,



Have you tried a small problem to see if the float32 datatype is causing you problems? float64 is going to give 15-17 digits of precision, meaning you may not get to the exact 1.0 representation, especially with float32.



I am not sure this will help you, but take a look at numpy.memmap. You may be able to go back to float64.



https://github.com/lmjohns3/theanets/issues/59



After reading this carefully, I have more questions, so perhaps more digging is required.



I’d like to suggest that numpy code should not just “blow up” because of these types of issues. They are completely foreseeable. And perhaps someone on the numpy mailing list could help.




Dale Smith, Ph.D.
Data Scientist
​
[Loading Image...]<http://nexidia.com/>

d. 404.495.7220 x 4008 f. 404.795.7221
Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 30305

[Loading Image...]<http://blog.nexidia.com/> [Loading Image...] <https://www.linkedin.com/company/nexidia> [Loading Image...] <https://plus.google.com/u/0/107921893643164441840/posts> [Loading Image...] <https://twitter.com/Nexidia> [Loading Image...] <https://www.youtube.com/user/NexidiaTV>


-----Original Message-----
From: Ryan R. Rosario [mailto:***@bytemining.com]
Sent: Thursday, December 17, 2015 2:26 AM
To: Scikit-learn-***@lists.sourceforge.net
Subject: [Scikit-learn-general] sklearn.preprocessing.normalize does not sum to 1



Hi,



I have a very large dense numpy matrix. To avoid running out of RAM, I use np.float32 as the dtype instead of the default np.float64 on my system.



When I do an L1 normalization of the rows (axis=1) in my matrix in-place (copy=False), I frequently get rows that do not sum to 1. Since these are probability distributions that I pass to np.random.choice, these must sum to exactly 1.0.



pp.normalize(term, norm='l1', axis=1, copy=False) sums = term.sum(axis=1) sums[np.where(sums != 1)]



array([ 0.99999994, 0.99999994, 1.00000012, ..., 0.99999994,

0.99999994, 0.99999994], dtype=float32)



I wrote some code to manually add/subtract the small difference from 1 to each row, and I make some progress, but still all the rows do not sum to 1.



Is there a way to avoid this problem?



— Ryan

------------------------------------------------------------------------------

_______________________________________________

Scikit-learn-general mailing list

Scikit-learn-***@lists.sourceforge.net<mailto:Scikit-learn-***@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Matthieu Brucher
2015-12-17 12:55:59 UTC
Permalink
The thing is that even if you did sum and divide by the sum, summing
the results back may not lead to 1.0. This is always the "issue" in
floating point computation.

Cheers,

Matthieu
Post by Ryan R. Rosario
Hi,
I have a very large dense numpy matrix. To avoid running out of RAM, I use np.float32 as the dtype instead of the default np.float64 on my system.
When I do an L1 normalization of the rows (axis=1) in my matrix in-place (copy=False), I frequently get rows that do not sum to 1. Since these are probability distributions that I pass to np.random.choice, these must sum to exactly 1.0.
pp.normalize(term, norm='l1', axis=1, copy=False)
sums = term.sum(axis=1)
sums[np.where(sums != 1)]
array([ 0.99999994, 0.99999994, 1.00000012, ..., 0.99999994,
0.99999994, 0.99999994], dtype=float32)
I wrote some code to manually add/subtract the small difference from 1 to each row, and I make some progress, but still all the rows do not sum to 1.
Is there a way to avoid this problem?
— Ryan
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher

------------------------------------------------------------------------------
Dale Smith
2015-12-17 13:09:08 UTC
Permalink
Ryan, did you try passing the arrays, as they are, to np.random.choice? Do you get what you expect?

Dale Smith, Ph.D.
Data Scientist



d. 404.495.7220 x 4008   f. 404.795.7221
Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 30305

    


-----Original Message-----
From: Matthieu Brucher [mailto:***@gmail.com]
Sent: Thursday, December 17, 2015 7:56 AM
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] sklearn.preprocessing.normalize does not sum to 1

The thing is that even if you did sum and divide by the sum, summing the results back may not lead to 1.0. This is always the "issue" in floating point computation.

Cheers,

Matthieu
Post by Ryan R. Rosario
Hi,
I have a very large dense numpy matrix. To avoid running out of RAM, I use np.float32 as the dtype instead of the default np.float64 on my system.
When I do an L1 normalization of the rows (axis=1) in my matrix in-place (copy=False), I frequently get rows that do not sum to 1. Since these are probability distributions that I pass to np.random.choice, these must sum to exactly 1.0.
pp.normalize(term, norm='l1', axis=1, copy=False) sums =
term.sum(axis=1) sums[np.where(sums != 1)]
array([ 0.99999994, 0.99999994, 1.00000012, ..., 0.99999994,
0.99999994, 0.99999994], dtype=float32)
I wrote some code to manually add/subtract the small difference from 1 to each row, and I make some progress, but still all the rows do not sum to 1.
Is there a way to avoid this problem?
— Ryan
----------------------------------------------------------------------
-------- _______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Ryan R. Rosario
2015-12-17 20:01:56 UTC
Permalink
Thank you for the suggestions. The behavior persists after I tried them :-(. To answer Dale’s question, when I pass the array to random.choice, I get a ValueError that the probabilities do not sum to 1.

I found a line of code that seems to lead to the problem:

numpy.power(...)

I have to raise each element of the matrix to a certain power. Once I do this, normalizing the rows of the matrix does not always yield a row sum of 1. *Without* this line, the rows *always* sum to 1.

I removed the call to np.power and tested this with both sklearn’s normalize function and also by using apply_along_axis(lambda x: x / np.sum(x), 1, my_matrix) and both work. With the call to np.power though, neither method returns the correct result.

I suppose at this point this is more of a numpy question than a scikit-learn question.

— Ryan
Post by Dale Smith
Ryan, did you try passing the arrays, as they are, to np.random.choice? Do you get what you expect?
Dale Smith, Ph.D.
Data Scientist

d. 404.495.7220 x 4008 f. 404.795.7221
Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 30305
-----Original Message-----
Sent: Thursday, December 17, 2015 7:56 AM
Subject: Re: [Scikit-learn-general] sklearn.preprocessing.normalize does not sum to 1
The thing is that even if you did sum and divide by the sum, summing the results back may not lead to 1.0. This is always the "issue" in floating point computation.
Cheers,
Matthieu
Post by Ryan R. Rosario
Hi,
I have a very large dense numpy matrix. To avoid running out of RAM, I use np.float32 as the dtype instead of the default np.float64 on my system.
When I do an L1 normalization of the rows (axis=1) in my matrix in-place (copy=False), I frequently get rows that do not sum to 1. Since these are probability distributions that I pass to np.random.choice, these must sum to exactly 1.0.
pp.normalize(term, norm='l1', axis=1, copy=False) sums =
term.sum(axis=1) sums[np.where(sums != 1)]
array([ 0.99999994, 0.99999994, 1.00000012, ..., 0.99999994,
0.99999994, 0.99999994], dtype=float32)
I wrote some code to manually add/subtract the small difference from 1 to each row, and I make some progress, but still all the rows do not sum to 1.
Is there a way to avoid this problem?
— Ryan
----------------------------------------------------------------------
-------- _______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Loading...