Discussion:
[Scikit-learn-general] linear_model.SGDClassifier(): ValueError: ndarray is not C-contiguous when calling partial_fit()
Tom Kenter
2013-10-04 14:42:15 UTC
Permalink
Dear all,

I am trying to run a linear_model.SGDClassifier() and have it update after
every example it classifies.
My code works for a small feature file (10 features), but when I give it a
bigger feature file (some 80000 features, but very sparse) it keeps giving
me errors straight away, the first time partial_fit() is called.

This is what I do in pseudocode:

X, y = load_svmlight_file(train_file)
classifier = linear_model.SGDClassifier()
classifier.fit(X, y)

for every test_line in test file:
test_X, test_y = getFeatures(test_line)
# This gives me a Python list for X
# and an integer label for y

print "prediction: %f" % = classifier.predict([test_X])

classifier.partial_fit(csr_matrix([test_X]),
csr_matrix([Y_GroundTruth])
classes=np.unique(y) )

The error I keep getting for the partial_fit() line is:

File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 487, in partial_fit
coef_init=None, intercept_init=None)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 371, in _partial_fit
sample_weight=sample_weight, n_iter=n_iter)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 451, in _fit_multiclass
for i in range(len(self.classes_)))
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",
line 517, in __call__
self.dispatch(function, args, kwargs)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",
line 312, in dispatch
job = ImmediateApply(func, args, kwargs)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",
line 136, in __init__
self.results = func(*args, **kwargs)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 284, in fit_binary
est.power_t, est.t_, intercept_decay)
File "sgd_fast.pyx", line 327, in sklearn.linear_model.sgd_fast.plain_sgd
(sklearn/linear_model/sgd_fast.c:7568)
ValueError: ndarray is not C-contiguous

I also tried feeding partial.fit() Python arrays, or numpy arrays (which
are C-contiguous (sort=C) by default, I thought), but this gives the same
result.
The classes attribute is not the problem I think. The same error appears if
I leave it out or if I give the right classes in hard code.

I do notice that when I print the flags of the _coef array of the
classifier, it says:

Flags of coef_ array:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False

I am sure I am doing something wrong, but really, I don't see what...

Any help appreciated!

Cheers,

Tom
Peter Prettenhofer
2013-10-08 08:01:40 UTC
Permalink
Hi Tom,

that's a bug - I'll open a ticket for it.
A quick fix: call partial_fit instead of fit just before the ``for`` loop.

- Peter
Post by Tom Kenter
Dear all,
I am trying to run a linear_model.SGDClassifier() and have it update after
every example it classifies.
My code works for a small feature file (10 features), but when I give it a
bigger feature file (some 80000 features, but very sparse) it keeps giving
me errors straight away, the first time partial_fit() is called.
X, y = load_svmlight_file(train_file)
classifier = linear_model.SGDClassifier()
classifier.fit(X, y)
test_X, test_y = getFeatures(test_line)
# This gives me a Python list for X
# and an integer label for y
print "prediction: %f" % = classifier.predict([test_X])
classifier.partial_fit(csr_matrix([test_X]),
csr_matrix([Y_GroundTruth])
classes=np.unique(y) )
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 487, in partial_fit
coef_init=None, intercept_init=None)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 371, in _partial_fit
sample_weight=sample_weight, n_iter=n_iter)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 451, in _fit_multiclass
for i in range(len(self.classes_)))
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",
line 517, in __call__
self.dispatch(function, args, kwargs)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",
line 312, in dispatch
job = ImmediateApply(func, args, kwargs)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py",
line 136, in __init__
self.results = func(*args, **kwargs)
File
"/datastore/tkenter1/epd/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py",
line 284, in fit_binary
est.power_t, est.t_, intercept_decay)
File "sgd_fast.pyx", line 327, in
sklearn.linear_model.sgd_fast.plain_sgd
(sklearn/linear_model/sgd_fast.c:7568)
ValueError: ndarray is not C-contiguous
I also tried feeding partial.fit() Python arrays, or numpy arrays (which
are C-contiguous (sort=C) by default, I thought), but this gives the same
result.
The classes attribute is not the problem I think. The same error appears
if I leave it out or if I give the right classes in hard code.
I do notice that when I print the flags of the _coef array of the
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
I am sure I am doing something wrong, but really, I don't see what...
Any help appreciated!
Cheers,
Tom
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Lars Buitinck
2013-10-09 15:51:32 UTC
Permalink
Post by Peter Prettenhofer
that's a bug - I'll open a ticket for it.
A quick fix: call partial_fit instead of fit just before the ``for`` loop.
Peter, is this due to an optimization that turns coef_ into a
Fortran-ordered array? If so, I don't think we need it any longer with
NumPy 1.7 and the new sklearn.extmath.fast_dot:

In [1]: X = np.random.randn(10000, 200)

In [2]: Y = np.random.randn(200, 70)

In [3]: %timeit np.dot(X, Y)
100 loops, best of 3: 16.5 ms per loop

In [4]: Yf = asfortranarray(Y)

In [5]: %timeit np.dot(X, Yf)
100 loops, best of 3: 16.7 ms per loop

In [6]: numpy.__version__
Out[6]: '1.7.1'
Peter Prettenhofer
2013-10-09 17:30:56 UTC
Permalink
great - thanks Lars - will prepare a PR
Post by Lars Buitinck
Post by Peter Prettenhofer
that's a bug - I'll open a ticket for it.
A quick fix: call partial_fit instead of fit just before the ``for``
loop.
Peter, is this due to an optimization that turns coef_ into a
Fortran-ordered array? If so, I don't think we need it any longer with
In [1]: X = np.random.randn(10000, 200)
In [2]: Y = np.random.randn(200, 70)
In [3]: %timeit np.dot(X, Y)
100 loops, best of 3: 16.5 ms per loop
In [4]: Yf = asfortranarray(Y)
In [5]: %timeit np.dot(X, Yf)
100 loops, best of 3: 16.7 ms per loop
In [6]: numpy.__version__
Out[6]: '1.7.1'
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most
from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Peter Prettenhofer
Lars Buitinck
2013-10-09 19:49:02 UTC
Permalink
Post by Peter Prettenhofer
great - thanks Lars - will prepare a PR
I just realized that I forgot to benchmark the sparse case as well.
Post by Peter Prettenhofer
X = fetch_20newsgroups_vectorized().data
Y = np.random.randn(X.shape[1], 20)
%timeit X * Y
10 loops, best of 3: 64 ms per loop
Post by Peter Prettenhofer
Yf = np.asfortranarray(Y)
%timeit X * Yf
10 loops, best of 3: 72.7 ms per loop
Post by Peter Prettenhofer
Y = np.random.randn(X.shape[1], 200)
Yf = np.asfortranarray(Y)
%timeit X * Y
1 loops, best of 3: 381 ms per loop
Post by Peter Prettenhofer
%timeit X * Yf
1 loops, best of 3: 498 ms per loop

Though I prefer a working SGD to a fast one that doesn't work ;)

Loading...