[Scikit-learn-general] Using Typed MemoryViews for Numpy Arrays

Discussion:

mahesh ravishankar

2016-02-11 02:09:57 UTC

Hi,

I am looking at scikit as an app for prototyping a Python module that
exposes an array-like object I am developing. I was going through the
Cython files and see that a lot of places, the raw data buffer of numpy are
accessed by accessing the c-field (i.e. "data" field) exposed through the
cython/numpy interface. I am a relative newbie to cython, but from my
understanding using typed memoryview (
http://docs.cython.org/src/userguide/memoryviews.html#memoryview-objects-and-cython-arrays)
is the recommended way of accessing data in an array object. I was
wondering if this was done due to legacy reasons, or performance reasons?

For me to evaluate my array object interface, I am thinking of changing
scikit to use the typed memoryview. If there is interest in this, I can
push this change to scikit. Any comments about why this would not be a good
idea are deeply appreciated.

Thanks,

--
Mahesh

Jacob Vanderplas

2016-02-11 04:15:55 UTC

Permalink

Hi Mahesh,
Regarding the raw data access, what specific parts of the code are you
looking at?
Thanks,
Jake

Jake VanderPlas
Senior Data Science Fellow
Director of Research in Physical Sciences
University of Washington eScience Institute

On Wed, Feb 10, 2016 at 6:09 PM, mahesh ravishankar <

Post by mahesh ravishankar
Hi,
I am looking at scikit as an app for prototyping a Python module that
exposes an array-like object I am developing. I was going through the
Cython files and see that a lot of places, the raw data buffer of numpy are
accessed by accessing the c-field (i.e. "data" field) exposed through the
cython/numpy interface. I am a relative newbie to cython, but from my
understanding using typed memoryview (
http://docs.cython.org/src/userguide/memoryviews.html#memoryview-objects-and-cython-arrays)
is the recommended way of accessing data in an array object. I was
wondering if this was done due to legacy reasons, or performance reasons?
For me to evaluate my array object interface, I am thinking of changing
scikit to use the typed memoryview. If there is interest in this, I can
push this change to scikit. Any comments about why this would not be a good
idea are deeply appreciated.
Thanks,
--
Mahesh
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

mahesh ravishankar

2016-02-11 19:05:40 UTC

Permalink

Hi Jacob,

For example, in _gradient_boosting.pyx (in //sklearn/ensemble/) the
function _predict_regression_tree_inplace_fast has the first parameter of
type np.float32_t* . When this function is called from predict_stages, the
first argument is X.data , where X is a numpy.ndarray. The reason this
works is that cython knows that the underlying C object that is used for
numpy.ndarray's has a field "data" of type char* that points to the raw
data buffer of the numpy array.

What I am planning to do is as follows. The current signature of the
function is

_predict_regression_tree_inplace_fast(np.float32_t* X , ...)

If this can be changed to

_predict_regression_tree_inplace_fast(np.float32 [:,:] X, ... )

then this generalizes to use any object X that exposes the buffer protocol
(described by PEP 3118 of python). Thoughts on whether this is something
useful for the scikit community? I am probably going to make this change in
my local branch, anyway. I can push these changes back to scikit if there
is interest.

Thanks,
Mahesh

Post by Jacob Vanderplas
Hi Mahesh,
Regarding the raw data access, what specific parts of the code are you
looking at?
Thanks,
Jake
Jake VanderPlas
Senior Data Science Fellow
Director of Research in Physical Sciences
University of Washington eScience Institute
On Wed, Feb 10, 2016 at 6:09 PM, mahesh ravishankar <

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Mahesh

mahesh ravishankar

2016-02-11 19:09:12 UTC

Permalink

More information about how using buffer objects works in Cython :
http://docs.cython.org/src/userguide/memoryviews.html#memoryview-objects-and-cython-arrays

On Thu, Feb 11, 2016 at 11:05 AM, mahesh ravishankar <

Post by mahesh ravishankar
Hi Jacob,
For example, in _gradient_boosting.pyx (in //sklearn/ensemble/) the
function _predict_regression_tree_inplace_fast has the first parameter of
type np.float32_t* . When this function is called from predict_stages, the
first argument is X.data , where X is a numpy.ndarray. The reason this
works is that cython knows that the underlying C object that is used for
numpy.ndarray's has a field "data" of type char* that points to the raw
data buffer of the numpy array.
What I am planning to do is as follows. The current signature of the
function is
_predict_regression_tree_inplace_fast(np.float32_t* X , ...)
If this can be changed to
_predict_regression_tree_inplace_fast(np.float32 [:,:] X, ... )
then this generalizes to use any object X that exposes the buffer protocol
(described by PEP 3118 of python). Thoughts on whether this is something
useful for the scikit community? I am probably going to make this change in
my local branch, anyway. I can push these changes back to scikit if there
is interest.
Thanks,
Mahesh
On Wed, Feb 10, 2016 at 8:15 PM, Jacob Vanderplas <

--
Mahesh

Jacob Schreiber

2016-02-11 19:17:51 UTC

Permalink

Hi Mahesh

Representing things as their underlying buffer using a pointer in the way
you identified is significantly faster than using a typed memoryview for
reading and writing.

Jacob

On Thu, Feb 11, 2016 at 11:09 AM, mahesh ravishankar <

Post by mahesh ravishankar
http://docs.cython.org/src/userguide/memoryviews.html#memoryview-objects-and-cython-arrays
On Thu, Feb 11, 2016 at 11:05 AM, mahesh ravishankar <

Post by mahesh ravishankar
Hi Jacob,
For example, in _gradient_boosting.pyx (in //sklearn/ensemble/) the
function _predict_regression_tree_inplace_fast has the first parameter of
type np.float32_t* . When this function is called from predict_stages, the
first argument is X.data , where X is a numpy.ndarray. The reason this
works is that cython knows that the underlying C object that is used for
numpy.ndarray's has a field "data" of type char* that points to the raw
data buffer of the numpy array.
What I am planning to do is as follows. The current signature of the
function is
_predict_regression_tree_inplace_fast(np.float32_t* X , ...)
If this can be changed to
_predict_regression_tree_inplace_fast(np.float32 [:,:] X, ... )
then this generalizes to use any object X that exposes the buffer
protocol (described by PEP 3118 of python). Thoughts on whether this is
something useful for the scikit community? I am probably going to make this
change in my local branch, anyway. I can push these changes back to scikit
if there is interest.
Thanks,
Mahesh
On Wed, Feb 10, 2016 at 8:15 PM, Jacob Vanderplas <

--
Mahesh

--
Mahesh
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Jacob Vanderplas

2016-02-11 19:18:20 UTC

Permalink

Thanks Mahesh,
That particular code was committed in early 2012, which (if I remember
correctly) was before Cython supported the typed-memoryview interface. I
suspect this is why raw pointers were used... looking at the code it seems
that replacing this with typed memoryviews should be just fine (as long as
the boundscheck and wraparound are turned off).
Jake

Jake VanderPlas
Senior Data Science Fellow
Director of Research in Physical Sciences
University of Washington eScience Institute

On Thu, Feb 11, 2016 at 11:05 AM, mahesh ravishankar <

mahesh ravishankar

2016-02-12 19:04:29 UTC

Permalink

Thanks Jacob V. and Jacob S.
I have forked scikit-learn into my github and will start making my changes
to my branch. I will send a code-review once I am done.

Mahesh

On Thu, Feb 11, 2016 at 11:18 AM, Jacob Vanderplas <

Post by Jacob Vanderplas
Thanks Mahesh,
That particular code was committed in early 2012, which (if I remember
correctly) was before Cython supported the typed-memoryview interface. I
suspect this is why raw pointers were used... looking at the code it seems
that replacing this with typed memoryviews should be just fine (as long as
the boundscheck and wraparound are turned off).
Jake
Jake VanderPlas
Senior Data Science Fellow
Director of Research in Physical Sciences
University of Washington eScience Institute
On Thu, Feb 11, 2016 at 11:05 AM, mahesh ravishankar <

Post by mahesh ravishankar
Hi Jacob,
For example, in _gradient_boosting.pyx (in //sklearn/ensemble/) the
function _predict_regression_tree_inplace_fast has the first parameter of
type np.float32_t* . When this function is called from predict_stages, the
first argument is X.data , where X is a numpy.ndarray. The reason this
works is that cython knows that the underlying C object that is used for
numpy.ndarray's has a field "data" of type char* that points to the raw
data buffer of the numpy array.
What I am planning to do is as follows. The current signature of the
function is
_predict_regression_tree_inplace_fast(np.float32_t* X , ...)
If this can be changed to
_predict_regression_tree_inplace_fast(np.float32 [:,:] X, ... )
then this generalizes to use any object X that exposes the buffer
protocol (described by PEP 3118 of python). Thoughts on whether this is
something useful for the scikit community? I am probably going to make this
change in my local branch, anyway. I can push these changes back to scikit
if there is interest.
Thanks,
Mahesh
On Wed, Feb 10, 2016 at 8:15 PM, Jacob Vanderplas <

--
Mahesh

Jacob Schreiber

2016-02-12 19:24:16 UTC

Permalink

I would be interested in knowing if using typed memoryviews did not
decrease performance. Please ping me once you have results!

On Fri, Feb 12, 2016 at 11:04 AM, mahesh ravishankar <

Post by mahesh ravishankar
Thanks Jacob V. and Jacob S.
I have forked scikit-learn into my github and will start making my changes
to my branch. I will send a code-review once I am done.
Mahesh
On Thu, Feb 11, 2016 at 11:18 AM, Jacob Vanderplas <

Post by mahesh ravishankar
Hi Jacob,
For example, in _gradient_boosting.pyx (in //sklearn/ensemble/) the
function _predict_regression_tree_inplace_fast has the first parameter of
type np.float32_t* . When this function is called from predict_stages, the
first argument is X.data , where X is a numpy.ndarray. The reason this
works is that cython knows that the underlying C object that is used for
numpy.ndarray's has a field "data" of type char* that points to the raw
data buffer of the numpy array.
What I am planning to do is as follows. The current signature of the
function is
_predict_regression_tree_inplace_fast(np.float32_t* X , ...)
If this can be changed to
_predict_regression_tree_inplace_fast(np.float32 [:,:] X, ... )
then this generalizes to use any object X that exposes the buffer
protocol (described by PEP 3118 of python). Thoughts on whether this is
something useful for the scikit community? I am probably going to make this
change in my local branch, anyway. I can push these changes back to scikit
if there is interest.
Thanks,
Mahesh
On Wed, Feb 10, 2016 at 8:15 PM, Jacob Vanderplas <

Post by mahesh ravishankar
Hi,
I am looking at scikit as an app for prototyping a Python module that
exposes an array-like object I am developing. I was going through the
Cython files and see that a lot of places, the raw data buffer of numpy are
accessed by accessing the c-field (i.e. "data" field) exposed through the
cython/numpy interface. I am a relative newbie to cython, but from my
understanding using typed memoryview (
http://docs.cython.org/src/userguide/memoryviews.html#memoryview-objects-and-cython-arrays)
is the recommended way of accessing data in an array object. I was
wondering if this was done due to legacy reasons, or performance reasons?
For me to evaluate my array object interface, I am thinking of
changing scikit to use the typed memoryview. If there is interest in this,
I can push this change to scikit. Any comments about why this would not be
a good idea are deeply appreciated.
Thanks,
--
Mahesh
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general