Discussion:
[Scikit-learn-general] Sprint next Friday
Gael Varoquaux
2011-03-26 10:21:11 UTC
Permalink
This is just a reminder to everybody that we are having a sprint on the
scikit next Friday. It will be hosted at Logilab, in Paris, and in Boston
(AlexG do you have a location?). We will also be on line on IRC for
remote participation.

The sprint is the best time to merge in a feature that you have had
half-implemented, or scratch a new itch. It is also a great way to learn
the best practices for contributing to the scikit, as core developers
will be available for quick interaction.

We will try to merge in many of the long-awaiting pull request. If you
have a pull request waiting for a merge and you are around to answer
questions, it will make the work easier.

Efficient sprinting comes with preparation. It would be great if
everybody could edit the corresponding wiki page
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events to give
practical information: who is going to be there, what are people going to
work on, what tasks are there to be done and where can newcomers find
information to achieve these tasks.

See y'all soon!

Gaël
Alexandre Gramfort
2011-03-26 14:34:05 UTC
Permalink
Post by Gael Varoquaux
This is just a reminder to everybody that we are having a sprint on the
scikit next Friday. It will be hosted at Logilab, in Paris, and in Boston
(AlexG do you have a location?).
location at MIT is :

36-537 which is located on the 5th floor of building 36 at mit.

whereis.mit.edu/?go=36

thanks satra for booking the room !

If you're reading this, you're in Boston and you hesitate to come,
let me tell you that you shouldn't. Such a sprint is a great opportunity
to learn from others while improving the scikit !

Look forward to it.

Alex
Vincent Michel
2011-03-28 09:04:02 UTC
Permalink
Hi,

The sprint will start at 9am, and will finish around 7pm.
Location in Paris is:

Logilab, 104 boulevard Louis-Auguste Blanqui,
75013<http://maps.google.fr/maps?q=104%20boulevard%20blanqui,%20paris>
*
*Metro 6 - Glacière*
*More details here : http://www.logilab.fr/contact

See you there !

Vincent
*

*
Post by Gael Varoquaux
This is just a reminder to everybody that we are having a sprint on the
scikit next Friday. It will be hosted at Logilab, in Paris, and in Boston
(AlexG do you have a location?). We will also be on line on IRC for
remote participation.
The sprint is the best time to merge in a feature that you have had
half-implemented, or scratch a new itch. It is also a great way to learn
the best practices for contributing to the scikit, as core developers
will be available for quick interaction.
We will try to merge in many of the long-awaiting pull request. If you
have a pull request waiting for a merge and you are around to answer
questions, it will make the work easier.
Efficient sprinting comes with preparation. It would be great if
everybody could edit the corresponding wiki page
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events to give
practical information: who is going to be there, what are people going to
work on, what tasks are there to be done and where can newcomers find
information to achieve these tasks.
See y'all soon!
Gaël
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Vlad Niculae
2011-03-29 20:41:48 UTC
Permalink
Hello
I will also participate via IRC. My main tasks are in matrix
factorization: finishing the tests and writing examples for NMF, maybe
begin work on sparse PCA or kernel PCA or something on the list in my
thread. If not, I'll help the other tasks as much as I can.

Should I announce this on the wiki and if so, in what form? I will
send a pull request for the NMF code at least the night before.

Anyway I'll be on IRC as soon as I get home on friday (is 9am UTC+1
Paris time?), see you then!

Vlad
Post by Vincent Michel
Hi,
The sprint will start at 9am, and will finish around 7pm.
Logilab, 104 boulevard Louis-Auguste Blanqui, 75013
Metro 6 - Glacière
More details here : http://www.logilab.fr/contact
See you there !
Vincent
Post by Gael Varoquaux
This is just a reminder to everybody that we are having a sprint on the
scikit next Friday. It will be hosted at Logilab, in Paris, and in Boston
(AlexG do you have a location?). We will also be on line on IRC for
remote participation.
The sprint is the best time to merge in a feature that you have had
half-implemented, or scratch a new itch. It is also a great way to learn
the best practices for contributing to the scikit, as core developers
will be available for quick interaction.
We will try to merge in many of the long-awaiting pull request. If you
have a pull request waiting for a merge and you are around to answer
questions, it will make the work easier.
Efficient sprinting comes with preparation. It would be great if
everybody could edit the corresponding wiki page
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events to give
practical information: who is going to be there, what are people going to
work on, what tasks are there to be done and where can newcomers find
information to achieve these tasks.
See y'all soon!
Gaël
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2011-03-29 21:03:17 UTC
Permalink
Post by Vlad Niculae
I will also participate via IRC. My main tasks are in matrix
factorization: finishing the tests and writing examples for NMF, maybe
begin work on sparse PCA or kernel PCA or something on the list in my
thread. If not, I'll help the other tasks as much as I can.
Hey Vlad,

I have a lot of interest in matrix factorization techniques, so I have
been trying to review your branch for the last week, when I had spare
cycles. Unfortunately, I haven't found time to dig deep in, and I can't
do it now. Let me give a few very general remarks about the cro, which is
what I had time to look at most. I don't love the use of the bunch object
I think that I would prefer a dictionary, because it is a standard Python
object. Or you could use the Bunch as it is defined in
scikits.learn.datasets.base, that way we don't have 2 bunch objects.
Ideally, I'd like a more 'flat' data structure, simply based on lists,
but I can see why you had to do what you did. Also, I think that I'd
prefer a bit more functionality pushed in functions rather than methods,
as I don't think that the methods that you have always need the structure
of an object. It's also often easier to test. I should dwell into this
more, but...
Post by Vlad Niculae
Should I announce this on the wiki and if so, in what form?
What you have done is good.

By the way, are you still interested in submitting a Google summer of
code proposal? In which case, it would be good to start working on it,
as the deadline for proposals in nearing in (this applies to anybody
wanting to do the GSoC on scikits.learn).

G
Vlad Niculae
2011-03-29 22:07:22 UTC
Permalink
On Wed, Mar 30, 2011 at 12:03 AM, Gael Varoquaux
Post by Gael Varoquaux
Post by Vlad Niculae
I will also participate via IRC. My main tasks are in matrix
factorization: finishing the tests and writing examples for NMF, maybe
begin work on sparse PCA or kernel PCA or something on the list in my
thread. If not, I'll help the other tasks as much as I can.
Hey Vlad,
I have a lot of interest in matrix factorization techniques, so I have
been trying to review your branch for the last week, when I had spare
cycles. Unfortunately, I haven't found time to dig deep in, and I can't
do it now. Let me give a few very general remarks about the cro, which is
what I had time to look at most. I don't love the use of the bunch object
I think that I would prefer a dictionary, because it is a standard Python
object. Or you could use the Bunch as it is defined in
scikits.learn.datasets.base, that way we don't have 2 bunch objects.
I have indeed removed the bunch object in favour of the internal
named_tuple construct, I'm not sure whether I pushed that change yet.
Among other unpushed changes are sparsity constraints in the nmf fit!
I can't wait to do some benchmarks and examples.
Post by Gael Varoquaux
Ideally, I'd like a more 'flat' data structure, simply based on lists,
but I can see why you had to do what you did. Also, I think that I'd
prefer a bit more functionality pushed in functions rather than methods,
as I don't think that the methods that you have always need the structure
of an object.
I will indeed try to refactor into functions. I am just more used to
this style.

It's also often easier to test. I should dwell into this
Post by Gael Varoquaux
more, but...
Post by Vlad Niculae
Should I announce this on the wiki and if so, in what form?
What you have done is good.
By the way, are you still interested in submitting a Google summer of
code proposal? In which case, it would be good to start working on it,
as the deadline for proposals in nearing in (this applies to anybody
wanting to do the GSoC on scikits.learn).
I am very interested indeed in the GSoC. I was waiting for updates
regarding the status of scikits-learn as a mentoring organization etc.
But I probably should get all the documents in order anyway.

Thanks!
Vlad
Post by Gael Varoquaux
G
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2011-03-30 04:49:51 UTC
Permalink
Post by Vlad Niculae
I have indeed removed the bunch object in favour of the internal
named_tuple construct,
Tha's definitely what you would want to do in an ideal world, but the
named tuple is new to Python 2.6, and we are still trying to support
Python 2.5. You can try and use it, and fall back to the bunch if it's
not present.
Post by Vlad Niculae
Post by Gael Varoquaux
By the way, are you still interested in submitting a Google summer of
code proposal? In which case, it would be good to start working on it,
as the deadline for proposals in nearing in (this applies to anybody
wanting to do the GSoC on scikits.learn).
I am very interested indeed in the GSoC. I was waiting for updates
regarding the status of scikits-learn as a mentoring organization etc.
We are accepting student via the PSF:
http://wiki.python.org/moin/SummerOfCode/2011
Post by Vlad Niculae
But I probably should get all the documents in order anyway.
Yes, you should work on your proposal, and work to meet the PSF
requirements.

Good luck,

Gaël
Mathieu Blondel
2011-03-30 03:17:59 UTC
Permalink
Post by Vlad Niculae
I will also participate via IRC. My main tasks are in matrix
factorization: finishing the tests and writing examples for NMF, maybe
begin work on sparse PCA or kernel PCA or something on the list in my
thread. If not, I'll help the other tasks as much as I can.
I have an implementation of Kernel PCA (just need to polish the
documentation). I will try to put together a pull request so you can
review it during the sprint.

Mathieu
h***@gmail.com
2011-03-29 18:14:16 UTC
Permalink
Hi

I see from [1] that one of the planned features for scikits.learn is
graphical models. I'm curious what the status regarding this feature
is?

During my master's I I helped author a GM Python toolkit [2]. The code
never really made it out of 'research mode' and still lacks a bit
w.r.t. documentation, tests, optimization, etc. However, it does work.
Some parts are essentially ports of the Bayes Nets toolbox for Matlab
by K.Murphy. Perhaps there is something that could be re-used in the
SciKit.

In any case I would be interested in contributing to implementing this feature.

Regards,
Helge

[1] http://scikit-learn.sourceforge.net/
[2] http://dip.sun.ac.za/~hreikeras/grmpy_doc/

On Sat, Mar 26, 2011 at 12:21 PM, Gael Varoquaux
Post by Gael Varoquaux
This is just a reminder to everybody that we are having a sprint on the
scikit next Friday. It will be hosted at Logilab, in Paris, and in Boston
(AlexG do you have a location?). We will also be on line on IRC for
remote participation.
The sprint is the best time to merge in a feature that you have had
half-implemented, or scratch a new itch. It is also a great way to learn
the best practices for contributing to the scikit, as core developers
will be available for quick interaction.
We will try to merge in many of the long-awaiting pull request. If you
have a pull request waiting for a merge and you are around to answer
questions, it will make the work easier.
Efficient sprinting comes with preparation. It would be great if
everybody could edit the corresponding wiki page
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events to give
practical information: who is going to be there, what are people going to
work on, what tasks are there to be done and where can newcomers find
information to achieve these tasks.
See y'all soon!
Gaël
------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software
be a part of the solution? Download the Intel(R) Manageability Checker
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2011-03-29 20:05:09 UTC
Permalink
Post by h***@gmail.com
I see from [1] that one of the planned features for scikits.learn is
graphical models. I'm curious what the status regarding this feature
is?
Not much is done :). I do some graphical Gaussian model for my own
research work (for instance
http://hal.inria.fr/inria-00512451/PDF/paper.pdf), and regularized
covariance learning is of general interest in machine. In this regards,
it is clear that some Gaussian graphical model code will land in the
scikit at some point (for instance, I'd love to find the time to
implement
http://books.nips.cc/papers/files/nips23/NIPS2010_0109.pdf ).
Post by h***@gmail.com
During my master's I I helped author a GM Python toolkit [2]. The code
never really made it out of 'research mode' and still lacks a bit
w.r.t. documentation, tests, optimization, etc. However, it does work.
Some parts are essentially ports of the Bayes Nets toolbox for Matlab
by K.Murphy. Perhaps there is something that could be re-used in the
SciKit.
There would definitly be some interest. Out of curiosity, can you
implement a simple interface with learners (we call them estimators)
implementing a fit(X) method and models describe with a few set of
general objects. For instance, we like to use sparse matrices to
describe a graph. The reason that I am asking this, is that if you need a
different API and layout than the rest of the scikit, it might make more
sens keeping the packages separate. On the other hand, if you can 'slot
in', it would be great, as one of the goals of the scikit is to make it
easy to combine and compare methods.

Thanks for your input,

Gael
h***@gmail.com
2011-03-29 21:03:20 UTC
Permalink
Hi Gael

Thanks for your reply.

On Tue, Mar 29, 2011 at 10:05 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by h***@gmail.com
During my master's I I helped author a GM Python toolkit [2]. The code
never really made it out of 'research mode' and still lacks a bit
w.r.t. documentation, tests, optimization, etc. However, it does work.
Some parts are essentially ports of the Bayes Nets toolbox for Matlab
by K.Murphy. Perhaps there is something that could be re-used in the
SciKit.
There would definitly be some interest. Out of curiosity, can you
implement a simple interface with learners (we call them estimators)
implementing a fit(X) method and models describe with a few set of
general objects. For instance, we like to use sparse matrices to
describe a graph. The reason that I am asking this, is that if you need a
different API and layout than the rest of the scikit, it might make more
sens keeping the packages separate. On the other hand, if you can 'slot
in', it would be great, as one of the goals of the scikit is to make it
easy to combine and compare methods.
It would certainly be possible to do fit(X) etc. In fact, this is
pretty much how the interface behaves already. However, the structure
of X is not a simple numarray. E.g. we use an empty list to indicate
latent nodes and a non-empty list for evidence nodes. For the latter
the number of list elements corresponds to the dimensionality of the
random variable associated with the observed node. We then nest these
lists into a list of observations indexed by the nodes. In the common
case that you have multiple observations per node such a list is
created for each observation and then joined into a new list that will
eventually become the training data X. I.e.

X[i][j][k]

is the i-th observation of the k-th dimension of the j-th node. The
interface is fairly flexible in how it deals with missing data. If the
node is latent then

X[i][j] = [].

We already have the option of sparse adjacency matrices.

Note that the current implementation only deals with nodes that can be
represented as finite probability tables, i.e. discrete or observed
continuous variables. We can not do inference for latent continuous
variables.
<