Discussion:
[Scikit-learn-general] feature selection algo
Satrajit Ghosh
2012-06-15 12:58:05 UTC
fyi

---------- Forwarded message ----------
From: joshua vogelstein <***@jhu.edu>
Date: Fri, Jun 15, 2012 at 12:35 AM

http://jmlr.csail.mit.edu/papers/volume13/song12a/song12a.pdf

these guys define a nice nonlinear/nonparametric measure of correlation
that might be of interest to you.
Yaroslav Halchenko
2012-06-15 13:42:39 UTC
hm... interesting -- and there is no comparison against "minimizing
independence"? e.g. dCov measure
http://en.wikipedia.org/wiki/Distance_correlation which is really simple
to estimate and as intuitive as a correlation coefficient
Post by Satrajit Ghosh
fyi
---------- Forwarded message ----------
Date: Fri, Jun 15, 2012 at 12:35 AM
[2]http://jmlr.csail.mit.edu/papers/volume13/song12a/song12a.pdf
these guys define a nice nonlinear/nonparametric measure of correlation
that might be of interest to you.
--
Yaroslav O. Halchenko
Postdoctoral Fellow, Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
Satrajit Ghosh
2012-06-15 14:10:37 UTC
hi yarik,

hm... interesting -- and there is no comparison against "minimizing
Post by Yaroslav Halchenko
independence"? e.g. dCov measure
http://en.wikipedia.org/wiki/Distance_correlation which is really simple
to estimate and as intuitive as a correlation coefficient
thanks for bringing up dCov. have you had a chance to play with it (their R
package) [1] /try to implement it from their paper [2]. it seems like dCov,
from the paper doesn't apply to 2 vectors, but it can compare the
covariance of two datasets, where n_samples>1. their code also allows you
to enter two distance matrices into dCor and dCov and dCor(x,x) != dCor(x).

i would love if somebody can explain to me how dCor, dCov can be used for 2
random vectors.

all that said, i started implementing dCorr/dCov for sklearn and stopped as
i couldn't match my output with the R code (i didn't actually look at the
source there) and it seemed it wasn't going to be related to comparing two
samples.

cheers,

satra

[1] http://cran.r-project.org/web/packages/energy/index.html
[2] http://arxiv.org/pdf/0803.4101.pdf
Post by Yaroslav Halchenko
Post by Satrajit Ghosh
fyi
---------- Forwarded message ----------
Date: Fri, Jun 15, 2012 at 12:35 AM
[2]http://jmlr.csail.mit.edu/papers/volume13/song12a/song12a.pdf
these guys define a nice nonlinear/nonparametric measure of
correlation
Post by Satrajit Ghosh
that might be of interest to you.
--
Yaroslav O. Halchenko
Postdoctoral Fellow, Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2012-06-15 14:19:28 UTC
I haven't had a chance to play with it extensively but I have a basic
implementation:
https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/misc/dcov.py
which still lacks statistical assessment, but provides dCov, dCor values

and yes -- it is "inherently multivariate", but since also could be
useful if applied in a univariate fashion, you can see those uv bool
arguments. Implementation as is is very memory hungry (even for
multivariate case... which shouldn't be the case)
Post by Satrajit Ghosh
hi yarik,
hm... interesting -- and there is no comparison against "minimizing
independence"? e.g. dCov measure
[1]http://en.wikipedia.org/wiki/Distance_correlation which is really
simple
to estimate and as intuitive as a correlation coefficient
thanks for bringing up dCov. have you had a chance to play with it (their
R package) [1] /try to implement it from their paper [2]. it seems like
dCov, from the paper doesn't apply to 2 vectors, but it can compare the
covariance of two datasets, where n_samples>1. their code also allows you
to enter two distance matrices into dCor and dCov and dCor(x,x) != dCor(x).
i would love if somebody can explain to me how dCor, dCov can be used for
2 random vectors.
all that said, i started implementing dCorr/dCov for sklearn and stopped
as i couldn't match my output with the R code (i didn't actually look at
the source there) and it seemed it wasn't going to be related to comparing
two samples.
cheers,
satra
[1]�[2]http://cran.r-project.org/web/packages/energy/index.html
[2]�[3]http://arxiv.org/pdf/0803.4101.pdf

--
Yaroslav O. Halchenko
Postdoctoral Fellow, Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
Satrajit Ghosh
2012-06-15 14:30:04 UTC
hi yarik,

here is my attempt:

https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distance_covariance.py

i'll look at your code in detail later today to understand the uv=True case.

cheers,

satra
Post by Yaroslav Halchenko
I haven't had a chance to play with it extensively but I have a basic
https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/misc/dcov.py
which still lacks statistical assessment, but provides dCov, dCor values
and yes -- it is "inherently multivariate", but since also could be
useful if applied in a univariate fashion, you can see those uv bool
arguments. Implementation as is is very memory hungry (even for
multivariate case... which shouldn't be the case)
Post by Satrajit Ghosh
hi yarik,
hm... interesting -- and there is no comparison against "minimizing
independence"? e.g. dCov measure
[1]http://en.wikipedia.org/wiki/Distance_correlation which is
really
Post by Satrajit Ghosh
simple
to estimate and as intuitive as a correlation coefficient
thanks for bringing up dCov. have you had a chance to play with it
(their
Post by Satrajit Ghosh
R package) [1] /try to implement it from their paper [2]. it seems
like
Post by Satrajit Ghosh
dCov, from the paper doesn't apply to 2 vectors, but it can compare
the
Post by Satrajit Ghosh
covariance of two datasets, where n_samples>1. their code also allows
you
Post by Satrajit Ghosh
to enter two distance matrices into dCor and dCov and dCor(x,x) != dCor(x).
i would love if somebody can explain to me how dCor, dCov can be used
for
Post by Satrajit Ghosh
2 random vectors.
all that said, i started implementing dCorr/dCov for sklearn and
stopped
Post by Satrajit Ghosh
as i couldn't match my output with the R code (i didn't actually look
at
Post by Satrajit Ghosh
the source there) and it seemed it wasn't going to be related to
comparing
Post by Satrajit Ghosh
two samples.
cheers,
satra
[1]ï¿œ[2]http://cran.r-project.org/web/packages/energy/index.html
[2]ï¿œ[3]http://arxiv.org/pdf/0803.4101.pdf
ï¿œ
--
Yaroslav O. Halchenko
Postdoctoral Fellow, Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2012-06-15 14:45:17 UTC
Post by Satrajit Ghosh
hi yarik,
[1]https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distance_covariance.py
i'll look at your code in detail later today to understand the uv=True
it is just to compute dCo[vr] among each pair of variables (so
similar to what corrcoef does) instead of taking them all as a
multivariate pattern.
--
Yaroslav O. Halchenko
Postdoctoral Fellow, Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
j***@gmail.com
2012-06-15 19:50:46 UTC
Post by Yaroslav Halchenko
hi yarik,
[1]https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distance_covariance.py
i'll look at your code in detail later today to understand the uv=True
it is just to compute dCo[vr] among each pair of variables (so
similar to what corrcoef does) instead of taking them all as a
multivariate pattern.
trying to see how well this works, here's my gist
https://gist.github.com/2938402

bretzel and cosine have essentially zero distance correlation ??

Josef
Post by Yaroslav Halchenko
--
Yaroslav O. Halchenko
Postdoctoral Fellow,   Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
j***@gmail.com
2012-06-15 19:52:20 UTC
Post by j***@gmail.com
Post by Yaroslav Halchenko
hi yarik,
[1]https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distance_covariance.py
i'll look at your code in detail later today to understand the uv=True
it is just to compute dCo[vr] among each pair of variables (so
similar to what corrcoef does) instead of taking them all as a
multivariate pattern.
trying to see how well this works, here's my gist
https://gist.github.com/2938402
(univariate only)
Post by j***@gmail.com
bretzel and cosine have essentially zero distance correlation ??
Josef
Post by Yaroslav Halchenko
--
Yaroslav O. Halchenko
Postdoctoral Fellow,   Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2012-06-15 20:20:19 UTC
Here is a comparison to output of my code (marked with >):

0.00458652660079 0.788017364828 0.00700027844478 0.00483928213727
0.145564526722 0.480124905375 0.422482399359 0.217567496918
6.50616752373e-07 7.99461373461e-05 0.00700027844478 0.0094610687282
0.120884106118 0.249205123601 0.422482399359 0.556949542822
0.00760091429285 0.819325410184 0.00909659010031 0.0094610687282
0.108943679969 0.195731527474 0.556245690994 0.556949542822
9.30536435524e-08 9.96844926893e-06 0.00921024880111 0.0094610687282
0.0362155612112 0.0648617517611 0.559754044038 0.556949542822
;)
Post by j***@gmail.com
[1]https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distance_covariance.py
i'll look at your code in detail later today to understand the uv=True
trying to see how well this works, here's my gist
https://gist.github.com/2938402
(univariate only)
Post by j***@gmail.com
bretzel and cosine have essentially zero distance correlation ??
--
Yaroslav O. Halchenko
Postdoctoral Fellow, Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
j***@gmail.com
2012-06-15 20:42:00 UTC
0.00458652660079 0.788017364828 0.00700027844478 0.00483928213727
0.145564526722 0.480124905375 0.422482399359 0.217567496918
6.50616752373e-07 7.99461373461e-05 0.00700027844478 0.0094610687282
0.120884106118 0.249205123601 0.422482399359 0.556949542822
0.00760091429285 0.819325410184 0.00909659010031 0.0094610687282
0.108943679969 0.195731527474 0.556245690994 0.556949542822
9.30536435524e-08 9.96844926893e-06 0.00921024880111 0.0094610687282
0.0362155612112 0.0648617517611 0.559754044038 0.556949542822
I was only looking at wikipedia, (I dowloaded the papers a while ago

https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/misc/dcov.py#L160
looks like a double sum, but wikipedia only has one sum, elementwise product.

When you have a nice implementation, I will (try to) borrow it :)

Josef
;)
Post by j***@gmail.com
[1]https://github.com/satra/scikit-learn/blob/enh/covariance/sklearn/covariance/distance_covariance.py
i'll look at your code in detail later today to understand the uv=True
trying to see how well this works, here's my gist
https://gist.github.com/2938402
(univariate only)
Post by j***@gmail.com
bretzel and cosine have essentially zero distance correlation ??
--
Yaroslav O. Halchenko
Postdoctoral Fellow,   Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Yaroslav Halchenko
2012-06-15 20:50:09 UTC
Post by j***@gmail.com
https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/misc/dcov.py#L160
looks like a double sum, but wikipedia only has one sum, elementwise product.
sorry -- I might be slow -- what sum? there is only an outer product in

160: Axy = Ax[:, None] * Ay[None, :]
Post by j***@gmail.com
When you have a nice implementation, I will (try to) borrow it :)
do you mean I should remove all embedded comments? ;)

yeah -- could be improved -- I just tried to stay descriptive and
generic so it could be used for both mass-univariate stats (like
corrcoef) and multivariate. and pretty much it is just only a
difference in how distances computed ;) -- otherwise 100% the same
code.
--
Yaroslav O. Halchenko
Postdoctoral Fellow, Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
j***@gmail.com
2012-06-15 22:19:36 UTC
Post by Yaroslav Halchenko
Post by j***@gmail.com
https://github.com/PyMVPA/PyMVPA/blob/master/mvpa2/misc/dcov.py#L160
looks like a double sum, but wikipedia only has one sum, elementwise product.
sorry -- I might be slow -- what sum? there is only an outer product in
160:    Axy = Ax[:, None] * Ay[None, :]
I shouldn't read to fast, I didn't see that you vectorized uv.

(
And I found my silly mistake in the distance calculation while copying
from script to file.
compared to R which returns sqrt,
Post by Yaroslav Halchenko
Post by j***@gmail.com
np.sqrt(dr) - 0.995387854787786
1.1102230246251565e-16
)

Josef
Post by Yaroslav Halchenko
Post by j***@gmail.com
When you have a nice implementation, I will (try to) borrow it :)
do you mean I should remove all embedded comments? ;)
yeah -- could be improved -- I just tried to stay descriptive and
generic so it could be used for both mass-univariate stats (like
corrcoef) and multivariate.  and pretty much it is just only a
difference in how distances computed ;) -- otherwise 100% the same
code.
--
Yaroslav O. Halchenko
Postdoctoral Fellow,   Department of Psychological and Brain Sciences
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
xinfan meng
2012-06-15 13:46:16 UTC
Submitted 5/07; Revised 6/11; Published 5/12

It takes such a long time ...
Post by Satrajit Ghosh
fyi
---------- Forwarded message ----------
Date: Fri, Jun 15, 2012 at 12:35 AM
http://jmlr.csail.mit.edu/papers/volume13/song12a/song12a.pdf
these guys define a nice nonlinear/nonparametric measure of correlation
that might be of interest to you.
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Best Wishes
--------------------------------------------
Meng XinfanïŒèæ°æ³ïŒ
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China