Discussion:
manifold learning on abstract metric spaces
(too old to reply)
Jacob VanderPlas
2011-09-20 23:25:58 UTC
Permalink
Hello,
I recently was contacted by someone interested in using manifold
learning methods on abstract metric spaces: that is, the training data
is a matrix of pairwise distances rather than a set of points. It would
be fairly straightforward to implement this for basic LLE and Isomap,
and could probably be done for the other manifold methods as well. Two
questions:
1) does this seem like a feature worth including in scikit-learn? Are
there common use-cases people can think of?
2) any ideas about the best interface to allow this? Because the format
of the input is so different from the normal use-case, it may be best to
make it a separate estimator. Perhaps `MetricLLE`, `MetricIsomap` or
something similar. Another option would be to have a keyword similar to
the `kernel='precomputed'` option in `KernelPCA`.
Any thoughts?
Jake
Robert Layton
2011-09-20 23:34:26 UTC
Permalink
On 21 September 2011 09:25, Jacob VanderPlas <
Post by Jacob VanderPlas
Hello,
I recently was contacted by someone interested in using manifold
learning methods on abstract metric spaces: that is, the training data
is a matrix of pairwise distances rather than a set of points. It would
be fairly straightforward to implement this for basic LLE and Isomap,
and could probably be done for the other manifold methods as well. Two
1) does this seem like a feature worth including in scikit-learn? Are
there common use-cases people can think of?
2) any ideas about the best interface to allow this? Because the format
of the input is so different from the normal use-case, it may be best to
make it a separate estimator. Perhaps `MetricLLE`, `MetricIsomap` or
something similar. Another option would be to have a keyword similar to
the `kernel='precomputed'` option in `KernelPCA`.
Any thoughts?
Jake
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
I have very little knowledge of manifold learning in general, but I am all
for this.
DBSCAN (the clustering algorithm) optionally takes a precomputed matrix, and
I'd like to see it in more applications where possible.

- Robert
--
My public key can be found at: http://pgp.mit.edu/
Search for this email address and select the key from "2011-08-19" (key id:
54BA8735)
Older keys can be used, but please inform me beforehand (and update when
possible!)
Gael Varoquaux
2011-09-20 23:50:51 UTC
Permalink
Post by Jacob VanderPlas
I recently was contacted by someone interested in using manifold
learning methods on abstract metric spaces: that is, the training data
is a matrix of pairwise distances rather than a set of points. It would
be fairly straightforward to implement this for basic LLE and Isomap,
and could probably be done for the other manifold methods as well. Two
Just a quick answer from someone who does too many things:

- It is a general pattern that can be found with many other algorithms,
therefore I think that it should be in the scikit

- I don't know what interface is the right, but the problem pops up at
many different places in the scikit, and we should give it some
thoughts.

my 2 cents,

G
Robert Layton
2011-09-20 23:56:06 UTC
Permalink
Post by Gael Varoquaux
Post by Jacob VanderPlas
I recently was contacted by someone interested in using manifold
learning methods on abstract metric spaces: that is, the training data
is a matrix of pairwise distances rather than a set of points. It would
be fairly straightforward to implement this for basic LLE and Isomap,
and could probably be done for the other manifold methods as well. Two
- It is a general pattern that can be found with many other algorithms,
therefore I think that it should be in the scikit
- I don't know what interface is the right, but the problem pops up at
many different places in the scikit, and we should give it some
thoughts.
my 2 cents,
G
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
I do really like the metric='precomputed' concept, which allows both
implementing actual metrics (euclidean, manhattan) as well as passing a
precomputed array in. If the algorithm doesn't allow it for whatever reason*
throw an error. The same interface works with kernels as well.

* k-means springs to mind - its only 'proven' for Euclidean distance, which
means that it should error if anything else is passed to it. I have an
implementation that works solely using a distance matrix, but I don't know
if it retains the qualities that the base algorithm does.
--
My public key can be found at: http://pgp.mit.edu/
Search for this email address and select the key from "2011-08-19" (key id:
54BA8735)
Older keys can be used, but please inform me beforehand (and update when
possible!)
Mathieu Blondel
2011-09-21 04:39:01 UTC
Permalink
Post by Robert Layton
I do really like the metric='precomputed' concept, which allows both
implementing actual metrics (euclidean, manhattan) as well as passing a
precomputed array in. If the algorithm doesn't allow it for whatever reason*
throw an error. The same interface works with kernels as well.
* k-means springs to mind - its only 'proven' for Euclidean distance, which
means that it should error if anything else is passed to it. I have an
implementation that works solely using a distance matrix, but I don't know
if it retains the qualities that the base algorithm does.
I like metric="precomputed" too and I think we should continue to
provide it when it's easy, but its main problem is its O(n_samples^2)
space complexity. It would be better if we had an efficient way to
work with triangular matrices but it would only cut the memory
consumption by 2. Also, the point of algorithms such as SVMs is
precisely that they have sparse solutions so it seems like overkill to
precompute everything in advance.

The solution, for me, is to use a cache (in the spirit of a kernel
cache) and to give a way to the algorithm to recompute pairwise
similarities / distances on demand, for example with a callable
object. That would be slow in Python, which makes me think that we
should start thinking of providing a Cython API, in addition to our
Python API, when appropriate.

My 2c,
Mathieu
Matthieu Brucher
2011-09-21 06:13:17 UTC
Permalink
Hi,

I think some of the algorithms already offer this (Laplacian Eigenmaps for
instance).
I'm -1 for LLE as LLE does not compute distances, but weights based on the
points directly.

Matthieu
Post by Jacob VanderPlas
Hello,
I recently was contacted by someone interested in using manifold
learning methods on abstract metric spaces: that is, the training data
is a matrix of pairwise distances rather than a set of points. It would
be fairly straightforward to implement this for basic LLE and Isomap,
and could probably be done for the other manifold methods as well. Two
1) does this seem like a feature worth including in scikit-learn? Are
there common use-cases people can think of?
2) any ideas about the best interface to allow this? Because the format
of the input is so different from the normal use-case, it may be best to
make it a separate estimator. Perhaps `MetricLLE`, `MetricIsomap` or
something similar. Another option would be to have a keyword similar to
the `kernel='precomputed'` option in `KernelPCA`.
Any thoughts?
Jake
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Jacob VanderPlas
2011-09-21 15:40:34 UTC
Permalink
To be clear, LLE's weights are found via a linear solution involving
covariances of local neighborhoods, which can be constructed from a
matrix of pairwise distances in a way analogous to that of metric MDS.
Jake
Post by Matthieu Brucher
Hi,
I think some of the algorithms already offer this (Laplacian Eigenmaps
for instance).
I'm -1 for LLE as LLE does not compute distances, but weights based on
the points directly.
Matthieu
Hello,
I recently was contacted by someone interested in using manifold
learning methods on abstract metric spaces: that is, the training data
is a matrix of pairwise distances rather than a set of points. It would
be fairly straightforward to implement this for basic LLE and Isomap,
and could probably be done for the other manifold methods as well.
Two
1) does this seem like a feature worth including in scikit-learn? Are
there common use-cases people can think of?
2) any ideas about the best interface to allow this? Because the format
of the input is so different from the normal use-case, it may be best to
make it a separate estimator. Perhaps `MetricLLE`, `MetricIsomap` or
something similar. Another option would be to have a keyword similar to
the `kernel='precomputed'` option in `KernelPCA`.
Any thoughts?
Jake
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
------------------------------------------------------------------------
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...