Discussion:
[Scikit-learn-general] Specify rather than learn sparse coding dictionary?
Ian Goodfellow
2011-12-05 21:42:18 UTC
Permalink
I'm interested in doing sparse coding with scikits.learn. It looks
like the way to do this is with
sklearn.decomposition.MiniBatchDictionaryLearning. Am I correct about
that?

If so:
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
I thought I could do this by passing the desired dictionary to the
dict_init parameter of the __init__
method and then never calling a fit method. However, when I attempted
this, the transform method
seems broken. It complains there is no components_ field. Patching it
to work without having called
fit doesn't seem trivial; simply filling in the components_ field with
the desired dictionary doesn't work
either. Is there a way to do what I want?

Thanks,
Ian
Alexandre Gramfort
2011-12-05 21:50:58 UTC
Permalink
Post by Ian Goodfellow
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
if you have the dictionaries then use a batch lasso or batch OMP with
precomputed gram to get the coefficients. That will give you the full
estimated model.

Alex
Olivier Grisel
2011-12-05 22:03:38 UTC
Permalink
Post by Alexandre Gramfort
Post by Ian Goodfellow
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
if you have the dictionaries then use a batch lasso or batch OMP with
precomputed gram to get the coefficients. That will give you the full
estimated model.
There are slightly higher level utility functions too here:

http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode.html
http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.sparse_encode_parallel.html
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Ian Goodfellow
2011-12-05 22:07:52 UTC
Permalink
On Mon, Dec 5, 2011 at 4:50 PM, Alexandre Gramfort
Post by Alexandre Gramfort
Post by Ian Goodfellow
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
if you have the dictionaries then use a batch lasso or batch OMP with
precomputed gram to get the coefficients. That will give you the full
estimated model.
I do not understand. I have the dictionary already, so what is being estimated?
Post by Alexandre Gramfort
Alex
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2011-12-05 22:14:00 UTC
Permalink
Post by Ian Goodfellow
On Mon, Dec 5, 2011 at 4:50 PM, Alexandre Gramfort
Post by Alexandre Gramfort
Post by Ian Goodfellow
One experiment I want to do involves plugging in dictionaries that
were learned with other methods.
if you have the dictionaries then use a batch lasso or batch OMP with
precomputed gram to get the coefficients. That will give you the full
estimated model.
I do not understand. I have the dictionary already, so what is being estimated?
The loadings.

- original data matrix has shape (n_samples, n_features)
- dictionary has shape (n_components, n_features)
- loadings matrix has shape (n_samples, n_components)

The loadings are the results of the sparse encoded samples in the
dictionary space.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Alexandre Gramfort
2011-12-05 22:14:40 UTC
Permalink
Post by Ian Goodfellow
I do not understand. I have the dictionary already, so what is being estimated?
well I am not sure to follow now, but if you have the dictionary the
only missing part is the coefs of the decomposition.

X = dico x coefs

Alex
Andreas Mueller
2011-12-05 22:21:01 UTC
Permalink
Post by Alexandre Gramfort
Post by Ian Goodfellow
I do not understand. I have the dictionary already, so what is being estimated?
well I am not sure to follow now, but if you have the dictionary the
only missing part is the coefs of the decomposition.
X = dico x coefs
I think there is a little misunderstanding here.
As I understand Ian, he has estimated a dictionary on some dataset
and wants to use this dictionary to encode some "new" data.

You do not need to estimate anything to get the "model".

What you do want is to "transform" the new data so that it
is coded using the specified dictionary.
I think this is exactly what the sparse encoding method that Olivier
referenced is doing.

Hope I got it right and this clears more things up than confuses.

Cheers,
Andy
Olivier Grisel
2011-12-05 22:31:55 UTC
Permalink
Post by Andreas Mueller
Post by Alexandre Gramfort
Post by Ian Goodfellow
I do not understand. I have the dictionary already, so what is being estimated?
well I am not sure to follow now, but if you have the dictionary the
only missing part is the coefs of the decomposition.
X = dico x coefs
I think there is a little misunderstanding here.
As I understand Ian, he has estimated a dictionary on some dataset
and wants to use this dictionary to encode some "new" data.
You do not need to estimate anything to get the "model".
What you do want is to "transform" the new data so that it
is coded using the specified dictionary.
Yes.
Post by Andreas Mueller
I think this is exactly what the sparse encoding method that Olivier
referenced is doing.
Yes and this method is just a wrapper for what Alexandre is explaining
(about LARS and OMP). Both replies are consistent.

Sparse coding is not simple linear projection (as in the transform
method of PCA for instance) as there is a penalty on the non-zero
loadings. Hence there is some "estimator fitting" that still occur at
the "transform" time. Unless of course who use fixed arbitrary
thresholding that might be enough for some tasks (e.g. sparse feature
extraction for image classification).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Ian Goodfellow
2011-12-05 22:50:46 UTC
Permalink
Post by Olivier Grisel
Post by Andreas Mueller
Post by Alexandre Gramfort
Post by Ian Goodfellow
I do not understand. I have the dictionary already, so what is being estimated?
well I am not sure to follow now, but if you have the dictionary the
only missing part is the coefs of the decomposition.
X = dico x coefs
I think there is a little misunderstanding here.
As I understand Ian, he has estimated a dictionary on some dataset
and wants to use this dictionary to encode some "new" data.
You do not need to estimate anything to get the "model".
What you do want is to "transform" the new data so that it
is coded using the specified dictionary.
Yes.
Post by Andreas Mueller
I think this is exactly what the sparse encoding method that Olivier
referenced is doing.
Yes and this method is just a wrapper for what Alexandre is explaining
(about LARS and OMP). Both replies are consistent.
Sparse coding is not simple linear projection (as in the transform
method of PCA for instance) as there is a penalty on the non-zero
loadings. Hence there is some "estimator fitting" that still occur at
the "transform" time. Unless of course who use fixed arbitrary
thresholding that might be enough for some tasks (e.g. sparse feature
extraction for image classification).
I think I was mostly confused by the terminology-- I don't consider the code
to be part of a sparse coding model, nor to be estimated (I am aware that
sparse coding involves iterative optimization but I don't consider the optimizer
to be solving an estimation problem).

I don't understand exactly what interface Alexandre is saying to use.

To use the sparse_encode interface, should I pass a dictionary of shape
(num_data_features, num_code_elements) for X and a data matrix of shape
(num_data_features, num_examples) for Y?

I have tried doing that, but for alpha = 1. or alpha = 0.1 it returns
a matrix of
all zeros, and for alpha = .01 it returns a code with NaNs in it.
Post by Olivier Grisel
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
David Warde-Farley
2011-12-05 23:24:59 UTC
Permalink
Post by Ian Goodfellow
I think I was mostly confused by the terminology-- I don't consider the code
to be part of a sparse coding model, nor to be estimated (I am aware that
sparse coding involves iterative optimization but I don't consider the optimizer
to be solving an estimation problem).
I don't understand exactly what interface Alexandre is saying to use.
To use the sparse_encode interface, should I pass a dictionary of shape
(num_data_features, num_code_elements) for X and a data matrix of shape
(num_data_features, num_examples) for Y?
I have tried doing that, but for alpha = 1. or alpha = 0.1 it returns
a matrix of
all zeros, and for alpha = .01 it returns a code with NaNs in it.
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?

David
Alexandre Gramfort
2011-12-06 08:04:22 UTC
Permalink
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?

it should ultimately use the duality gap to stop the iterations but
there might be a corner case …

let me know

Alex
David Warde-Farley
2011-12-06 18:34:21 UTC
Permalink
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)

In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))

In [36]: signal = rng.normal(size=100) / 1000

In [37]: from sklearn.linear_model import Lasso

In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)

In [39]: lasso.fit(dictionary, signal)
Out[39]:
Lasso(alpha=0.0001, copy_X=True, fit_intercept=False, max_iter=1000000.0,
normalize=False, precompute='auto', tol=1e-08)

In [40]: max(abs(lasso.coef_))
Out[40]: 0.0

In [41]: from pylearn2.optimization.feature_sign import feature_sign_search

In [42]: coef = feature_sign_search(dictionary, signal, 0.0001)

In [43]: max(abs(coef))
Out[43]: 0.0027295761244725018

And I'm pretty sure the latter result is the right one, since

In [45]: def gradient(coefs):
....: gram = np.dot(dictionary.T, dictionary)
....: corr = np.dot(dictionary.T, signal)
....: return - 2 * corr + 2 * np.dot(gram, coefs) + 0.0001 *
np.sign(coefs)
....:


and

In [51]: max(abs(gradient(coef)[coef == 0]))
Out[51]: 9.9849794040467979e-05

In [52]: max(abs(gradient(lasso.coef_)))
Out[52]: 0.0065556004140218012

meaning that whatever that maximum gradient coefficient is should have been
activated, since the gradient from the linear portion should override the
sparsity penalty.
Olivier Grisel
2011-12-06 19:43:06 UTC
Permalink
Post by David Warde-Farley
Post by Alexandre Gramfort
Post by David Warde-Farley
This actually gets at something I've been meaning to fiddle with and report but haven't had time: I'm not sure I completely trust the coordinate descent implementation in scikit-learn, because it seems to give me bogus answers a lot (i.e., the optimality conditions necessary for it to be an actual solution are not even approximately satisfied). Are you guys using something weird for the termination condition?
can you give us a sample X and y that shows the pb?
it should ultimately use the duality gap to stop the iterations but
there might be a corner case …
In [34]: rng = np.random.RandomState(0)
In [35]: dictionary = rng.normal(size=(100, 500)) / 1000; dictionary /=
np.sqrt((dictionary ** 2).sum(axis=0))
In [36]: signal = rng.normal(size=100) / 1000
In [37]: from sklearn.linear_model import Lasso
In [38]: lasso = Lasso(alpha=0.0001, max_iter=1e6, fit_intercept=False,
tol=1e-8)
In [3