Discussion:
[Scikit-learn-general] algorithm solve classical MDS with SVD
Terry Peng
2013-03-28 15:51:23 UTC
Hi all,

I was discussing with Nelle to add an algorithm to solve the classical MDS with svd. but one thing we don't sure is how to check missing data so we can fall back to SMACOF in that case.
my idea is to check if there are any 0 in non-diagonal elements.

what do you think?

Thanks & Regards,
--Terry
Nelle Varoquaux
2013-03-28 15:57:37 UTC
Hi Terry,

We need to find a uniform way over the whole scikit to indicate missing
data. Hence, 0 cannot be how missing data is spotted.
A solution would be to use "Nan" but it is not very satisfying either, as
this could lead to think there is missing data, while there isn't.

Maybe we should add an argument named missing, with how the missing data is
indicated in the matrices ? For example, the signature of the MDS, using
nan as missing data would be something like:

mds.fit(X, missing=np.nan)

Cheers,
N
Post by Terry Peng
Hi all,
I was discussing with Nelle to add an algorithm to solve the classical
MDS with svd. but one thing we don't sure is how to check missing data so
we can fall back to SMACOF in that case.
my idea is to check if there are any 0 in non-diagonal elements.
what do you think?
Thanks & Regards,
--Terry
Gael Varoquaux
2013-03-28 16:00:42 UTC
Post by Nelle Varoquaux
Maybe we should add an argument named missing, with how the missing data is
indicated in the matrices ? For example, the signature of the MDS, using nan as
mds.fit(X, missing=np.nan)
This would be my favorite solution (however, you would be the argument in
the class init). And missing=None (which would be the default) would lead
to not using the SMACOF.

By the way, thanks a lot for volunteering to do that, Terry, this is
something that had been missing for quite a while.

G
Mathieu Blondel
2013-03-28 16:52:49 UTC
On Fri, Mar 29, 2013 at 12:57 AM, Nelle Varoquaux
Post by Nelle Varoquaux
We need to find a uniform way over the whole scikit to indicate missing
data. Hence, 0 cannot be how missing data is spotted.
A solution would be to use "Nan" but it is not very satisfying either, as
this could lead to think there is missing data, while there isn't.
Encoding missing values with np.nan doesn't scale to very
high-dimensional problems with mostly missing values.
Personally, for encoding missing data, I just use sparse matrices.
Values which are actually zero can be stored explicitly in the .data
attribute.

Mathieu
Lars Buitinck
2013-03-28 17:10:59 UTC
Post by Mathieu Blondel
Encoding missing values with np.nan doesn't scale to very
high-dimensional problems with mostly missing values.
Personally, for encoding missing data, I just use sparse matrices.
Values which are actually zero can be stored explicitly in the .data
attribute.
+1 for not storing missing values, but will scipy.sparse matrices work
correctly when .data has zeros, and will conversion between formats
retain them?
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Jacob Vanderplas
2013-03-28 17:19:44 UTC
Post by Lars Buitinck
Post by Mathieu Blondel
Encoding missing values with np.nan doesn't scale to very
high-dimensional problems with mostly missing values.
Personally, for encoding missing data, I just use sparse matrices.
Values which are actually zero can be stored explicitly in the .data
attribute.
+1 for not storing missing values, but will scipy.sparse matrices work
correctly when .data has zeros, and will conversion between formats
retain them?
It depends on the conversion. Some sparse matrix conversions keep explicit
zeros, some don't. We dealt with this in scipy.sparse.csgraph: there are
some utilities there that do the matrix conversions and make sure missing
entries & zero entries are distinguished correctly. The functions are a
bit graph-specific, but it might be useful to look at for some ideas.
Jake
Post by Lars Buitinck
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Own the Future-Intel&reg; Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest.
Compete for recognition, cash, and the chance to get your game
on Steam. \$5K grand prize plus 10 genre and skill prizes.
Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Nelle Varoquaux
2013-03-28 18:35:14 UTC
Post by Jacob Vanderplas
Post by Lars Buitinck
Post by Mathieu Blondel
Encoding missing values with np.nan doesn't scale to very
high-dimensional problems with mostly missing values.
Personally, for encoding missing data, I just use sparse matrices.
Values which are actually zero can be stored explicitly in the .data
attribute.
+1 for not storing missing values, but will scipy.sparse matrices work
correctly when .data has zeros, and will conversion between formats
retain them?
It depends on the conversion. Some sparse matrix conversions keep
there are some utilities there that do the matrix conversions and make sure
missing entries & zero entries are distinguished correctly. The functions
are a bit graph-specific, but it might be useful to look at for some ideas.
But in general, I don't think we can "force" the user to use sparse
matrices. They are an absolute pain to work with because of the
inconsistencies of interface with ndarray and conversion between sparse and
dense can be time consuming. Hence, I think we need to find something that
works both with dense and sparse matrices.

N
Post by Jacob Vanderplas
Jake
Post by Lars Buitinck
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Own the Future-Intel&reg; Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest.
Compete for recognition, cash, and the chance to get your game
on Steam. \$5K grand prize plus 10 genre and skill prizes.
Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Own the Future-Intel&reg; Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest.
Compete for recognition, cash, and the chance to get your game
on Steam. \$5K grand prize plus 10 genre and skill prizes.
Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Kenneth C. Arnold
2013-03-28 18:52:29 UTC
On Thu, Mar 28, 2013 at 2:35 PM, Nelle Varoquaux
Post by Nelle Varoquaux
But in general, I don't think we can "force" the user to use sparse
matrices. They are an absolute pain to work with because of the
inconsistencies of interface with ndarray and conversion between sparse and
dense can be time consuming.
Every once in a while I think of refactoring SciPy's sparse matrices into
sparse arrays (with ndarray-like semantics) with the matrix interface as a
wrapper like np.matrix is, but it would be a big undertaking and so far
nobody else has seemed excited about it.

-Ken
Lars Buitinck
2013-03-28 22:24:52 UTC
On Thu, Mar 28, 2013 at 2:35 PM, Nelle Varoquaux
Post by Nelle Varoquaux
But in general, I don't think we can "force" the user to use sparse
matrices. They are an absolute pain to work with because of the
inconsistencies of interface with ndarray and conversion between sparse and
dense can be time consuming.
True. What is the use case we're thinking of? Most data missing or
most data zero and only some of it missing? The latter can be easily
handled by sparse matrices and ndarrays in the same way, but the
former is tricky.
Post by Nelle Varoquaux
Every once in a while I think of refactoring SciPy's sparse matrices into
sparse arrays (with ndarray-like semantics) with the matrix interface as a
wrapper like np.matrix is, but it would be a big undertaking and so far
nobody else has seemed excited about it.
I know the "need to rewrite scipy.sparse" itch. I for one would be
very excited if you were to volunteer ;)
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Mathieu Blondel
2013-03-29 02:26:02 UTC
Post by Lars Buitinck
I know the "need to rewrite scipy.sparse" itch. I for one would be
very excited if you were to volunteer ;)
I almost started this project twice but then I realized I would shoot
me in the foot if I started it alone and gave up. If we do it as a
community project, I would gladly give a hand. The idea would be to
start it as a third-party project and merge it in numpy when it's
ready (IMO, sparse arrays are such a basic requirement that they
should go in numpy, not scipy). The project can be bootstrapped by
depending on scipy.sparse at first (the long term goal being to not
depend on it).

Regarding missing values, this could be helpful in the dense case:

Mathieu
Nelle Varoquaux
2013-03-29 07:26:07 UTC
Post by Kenneth C. Arnold
On Thu, Mar 28, 2013 at 2:35 PM, Nelle Varoquaux
Post by Nelle Varoquaux
But in general, I don't think we can "force" the user to use sparse
matrices. They are an absolute pain to work with because of the
inconsistencies of interface with ndarray and conversion between sparse
and
Post by Nelle Varoquaux
dense can be time consuming.
True. What is the use case we're thinking of? Most data missing or
most data zero and only some of it missing? The latter can be easily
handled by sparse matrices and ndarrays in the same way, but the
former is tricky.
I think this really depends on the use case and the application. This is
what makes dealing with missing data so complicated. In the case of the
MDS, missing data is explicitely not taken in account in the objective
function: I'm not sure how much this extends to other algorithms.
Also in the case of the classical MDS, having 0 entries in the distance
matrix doesn't make much sense, so I'd have a tendancy to think 0 should be
considered as missing data. And you can have both the case of a lot of
missing data (for example if you have to measure the distances between two
points, and this is costly or time consuming, you would end up with a lot
of missing data), and no missing data at all.
Post by Kenneth C. Arnold
Post by Nelle Varoquaux
Every once in a while I think of refactoring SciPy's sparse matrices into
sparse arrays (with ndarray-like semantics) with the matrix interface as
a
Post by Nelle Varoquaux
wrapper like np.matrix is, but it would be a big undertaking and so far
nobody else has seemed excited about it.
I know the "need to rewrite scipy.sparse" itch. I for one would be
very excited if you were to volunteer ;)
Me too ! I could also give a hand.
Post by Kenneth C. Arnold
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Own the Future-Intel(R) Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest. Compete
for recognition, cash, and the chance to get your game on Steam.
\$5K grand prize plus 10 genre and skill prizes. Submit your demo