Discussion:
GSoC
(too old to reply)
Arnaud Joly
2014-03-10 09:46:11 UTC
Permalink
Hi,

Anything concerning the GSOC should pass by the scikit-learn
mailing list.

Thanks for your interest in the subject. If you intend to apply for a GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
and start contributing to scikit-learn.

Right now, three people have shown interest for this topic: Maheshakya (who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a given subject, only the best proposal will
be selected.

Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for which you are assigned as mentor: "Add scipy.sparse matrix support to the Decision Tree".
I am considering applying for this project as I have already faced the necessity of sparse matrices in sklearn and the outcome was not totally satisfactory.
Right now I am participating in a kaggle contest (http://www.kaggle.com/c/lshtc/) and facing several difficulties even with algorithms that already support spase matrices (even simple algorithms such as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and related algorithms such as random forest and extra trees) and, if the time constraint allows, improve as much as possible the general support for sparse matrices.
With this email, I want to show my interest and ask you if there are already any candidates for this place.
Best regards,
--
João
vamsi kaushik
2014-03-10 21:43:56 UTC
Permalink
hi Arnaud,

Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon

cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for a
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014<https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014>
and start contributing to scikit-learn.
Right now, three people have shown interest for this topic: Maheshakya (who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a given
subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for
which you are assigned as mentor: "Add scipy.sparse matrix support to the
Decision Tree".
I am considering applying for this project as I have already faced the
necessity of sparse matrices in sklearn and the outcome was not totally
satisfactory.
Right now I am participating in a kaggle contest (
http://www.kaggle.com/c/lshtc/) and facing several difficulties even with
algorithms that already support spase matrices (even simple algorithms such
as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and
related algorithms such as random forest and extra trees) and, if the time
constraint allows, improve as much as possible the general support for
sparse matrices.
With this email, I want to show my interest and ask you if there are
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
vamsi kaushik
2014-03-10 22:51:25 UTC
Permalink
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).

And i have already contributed a little.
Please take a look at :

https://github.com/scikit-learn/scikit-learn/pull/2905

and also:
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727

cheers,
kaushik


On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for a
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014<https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014>
and start contributing to scikit-learn.
Right now, three people have shown interest for this topic: Maheshakya (who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a given
subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for
which you are assigned as mentor: "Add scipy.sparse matrix support to the
Decision Tree".
I am considering applying for this project as I have already faced the
necessity of sparse matrices in sklearn and the outcome was not totally
satisfactory.
Right now I am participating in a kaggle contest (
http://www.kaggle.com/c/lshtc/) and facing several difficulties even
with algorithms that already support spase matrices (even simple algorithms
such as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and
related algorithms such as random forest and extra trees) and, if the time
constraint allows, improve as much as possible the general support for
sparse matrices.
With this email, I want to show my interest and ask you if there are
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Arnaud Joly
2014-03-11 07:42:34 UTC
Permalink
Thanks for your contribution.

Keep up!

Arnaud
Post by vamsi kaushik
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for a GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
and start contributing to scikit-learn.
Right now, three people have shown interest for this topic: Maheshakya (who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for which you are assigned as mentor: "Add scipy.sparse matrix support to the Decision Tree".
I am considering applying for this project as I have already faced the necessity of sparse matrices in sklearn and the outcome was not totally satisfactory.
Right now I am participating in a kaggle contest (http://www.kaggle.com/c/lshtc/) and facing several difficulties even with algorithms that already support spase matrices (even simple algorithms such as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and related algorithms such as random forest and extra trees) and, if the time constraint allows, improve as much as possible the general support for sparse matrices.
With this email, I want to show my interest and ask you if there are already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
vamsi kaushik
2014-03-11 22:07:45 UTC
Permalink
hi Joly,

I actually have a few questions:

Firstly, regarding the implementation of sparse functions. _tree.pxy is the
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree. The basic
implementation that i have in mind is that we duplicate the splitter
classes and The tree architecture itself. At each leaf node we might have a
sparse matrix representing the output data and at every non-leaf node,
splitting should be done on the sparse matrix at that node and produce few
other nodes with sparse output. So the project boils down to implementing
good access and construction methods for sparse matrices in cython. I just
want to know if this is a possible implementation criterion ?. And i would
also like to know the implementation plan that you have in mind ?

Secondly, In my previous mail i have shown you my contributions. I would
just like to know if the patches are upto the mark for a selection. If they
are, i will turn my attention to the DT sparse problem, else i will work on
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik <
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for a
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014<https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014>
and start contributing to scikit-learn.
Right now, three people have shown interest for this topic: Maheshakya (who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a given
subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for
which you are assigned as mentor: "Add scipy.sparse matrix support to the
Decision Tree".
I am considering applying for this project as I have already faced the
necessity of sparse matrices in sklearn and the outcome was not totally
satisfactory.
Right now I am participating in a kaggle contest (
http://www.kaggle.com/c/lshtc/) and facing several difficulties even
with algorithms that already support spase matrices (even simple algorithms
such as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and
related algorithms such as random forest and extra trees) and, if the time
constraint allows, improve as much as possible the general support for
sparse matrices.
With this email, I want to show my interest and ask you if there are
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Felipe Eltermann
2014-03-12 12:08:38 UTC
Permalink
Hello Vamsi,

"Firstly, regarding the implementation of sparse functions. _tree.pxy is
the back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.

"The basic implementation that i have in mind is that we duplicate the
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and sparse
matrices.
I started a PR that is already in progress:
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)




On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions. _tree.pxy is
the back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree. The basic
implementation that i have in mind is that we duplicate the splitter
classes and The tree architecture itself. At each leaf node we might have a
sparse matrix representing the output data and at every non-leaf node,
splitting should be done on the sparse matrix at that node and produce few
other nodes with sparse output. So the project boils down to implementing
good access and construction methods for sparse matrices in cython. I just
want to know if this is a possible implementation criterion ?. And i would
also like to know the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I would
just like to know if the patches are upto the mark for a selection. If they
are, i will turn my attention to the DT sparse problem, else i will work on
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik <
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for a
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014<https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014>
and start contributing to scikit-learn.
Right now, three people have shown interest for this topic: Maheshakya (who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for
which you are assigned as mentor: "Add scipy.sparse matrix support to the
Decision Tree".
I am considering applying for this project as I have already faced the
necessity of sparse matrices in sklearn and the outcome was not totally
satisfactory.
Right now I am participating in a kaggle contest (
http://www.kaggle.com/c/lshtc/) and facing several difficulties even
with algorithms that already support spase matrices (even simple algorithms
such as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and
related algorithms such as random forest and extra trees) and, if the time
constraint allows, improve as much as possible the general support for
sparse matrices.
With this email, I want to show my interest and ask you if there are
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gilles Louppe
2014-03-12 12:44:30 UTC
Permalink
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions. _tree.pxy is the
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate the
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and sparse
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.

In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions. _tree.pxy is
the back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree. The basic implementation
that i have in mind is that we duplicate the splitter classes and The tree
architecture itself. At each leaf node we might have a sparse matrix
representing the output data and at every non-leaf node, splitting should be
done on the sparse matrix at that node and produce few other nodes with
sparse output. So the project boils down to implementing good access and
construction methods for sparse matrices in cython. I just want to know if
this is a possible implementation criterion ?. And i would also like to know
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I would
just like to know if the patches are upto the mark for a selection. If they
are, i will turn my attention to the DT sparse problem, else i will work on
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for a
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
and start contributing to scikit-learn.
Right now, three people have shown interest for this topic: Maheshakya (who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for
which you are assigned as mentor: "Add scipy.sparse matrix support to the
Decision Tree".
I am considering applying for this project as I have already faced the
necessity of sparse matrices in sklearn and the outcome was not totally
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several difficulties even with
algorithms that already support spase matrices (even simple algorithms such
as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and
related algorithms such as random forest and extra trees) and, if the time
constraint allows, improve as much as possible the general support for
sparse matrices.
With this email, I want to show my interest and ask you if there are
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Arnaud Joly
2014-03-12 14:59:02 UTC
Permalink
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity support to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.

The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.

Best,
Arnaud
Post by Gilles Louppe
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions. _tree.pxy is the
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate the
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and sparse
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions. _tree.pxy is
the back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree. The basic implementation
that i have in mind is that we duplicate the splitter classes and The tree
architecture itself. At each leaf node we might have a sparse matrix
representing the output data and at every non-leaf node, splitting should be
done on the sparse matrix at that node and produce few other nodes with
sparse output. So the project boils down to implementing good access and
construction methods for sparse matrices in cython. I just want to know if
this is a possible implementation criterion ?. And i would also like to know
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I would
just like to know if the patches are upto the mark for a selection. If they
are, i will turn my attention to the DT sparse problem, else i will work on
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for a
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
and start contributing to scikit-learn.
Right now, three people have shown interest for this topic: Maheshakya
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for
which you are assigned as mentor: "Add scipy.sparse matrix support to the
Decision Tree".
I am considering applying for this project as I have already faced the
necessity of sparse matrices in sklearn and the outcome was not totally
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several difficulties even with
algorithms that already support spase matrices (even simple algorithms such
as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and
related algorithms such as random forest and extra trees) and, if the time
constraint allows, improve as much as possible the general support for
sparse matrices.
With this email, I want to show my interest and ask you if there are
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
vamsi kaushik
2014-03-16 01:49:11 UTC
Permalink
Hi arnaud, Eltermann

Thanks for the reply. Firstly I am a noob here so excuse me for any stupid
questions. Eltermann Just my 2 cents, I felt your approach(in BestSplitter
) somewhat inefficient when you are calling the get_sparse_item. Firstly
even though we have indices of all the non zero items, it is being called
node_samples times, iterating over nnz times. An easier way of doing this
would be to sort the samples in sparse matrix along features and then
extracting if necessary. because the whole point in using Xf is to sort
them which becomes difficult when they are in an n-d array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my point
is when we are adding support for sparse matrices, why not exploit their
structure as much as we can. Arnaud is this feasible ? Eltermann anything
wrong in my thinking ?


cheers,
kaushik varanasi
Post by Arnaud Joly
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity support to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
Post by Gilles Louppe
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions. _tree.pxy
is the
Post by Gilles Louppe
Post by Felipe Eltermann
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate the
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and sparse
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions. _tree.pxy is
the back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree. The basic
implementation
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
that i have in mind is that we duplicate the splitter classes and The
tree
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
architecture itself. At each leaf node we might have a sparse matrix
representing the output data and at every non-leaf node, splitting
should be
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
done on the sparse matrix at that node and produce few other nodes with
sparse output. So the project boils down to implementing good access
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
construction methods for sparse matrices in cython. I just want to
know if
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
this is a possible implementation criterion ?. And i would also like
to know
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I
would
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
just like to know if the patches are upto the mark for a selection. If
they
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
are, i will turn my attention to the DT sparse problem, else i will
work on
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for
a
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
and start contributing to scikit-learn.
Maheshakya
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one
for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
which you are assigned as mentor: "Add scipy.sparse matrix support
to the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Decision Tree".
I am considering applying for this project as I have already faced
the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
necessity of sparse matrices in sklearn and the outcome was not
totally
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several difficulties
even with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
algorithms that already support spase matrices (even simple
algorithms such
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT
(and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
related algorithms such as random forest and extra trees) and, if
the time
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
constraint allows, improve as much as possible the general support
for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
sparse matrices.
With this email, I want to show my interest and ask you if there are
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
Post by Felipe Eltermann
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
vamsi kaushik
2014-03-16 13:35:14 UTC
Permalink
Hi,
and i suppose we have to change the bagging models as well ?


On Sun, Mar 16, 2014 at 7:19 AM, vamsi kaushik
Post by vamsi kaushik
Hi arnaud, Eltermann
Thanks for the reply. Firstly I am a noob here so excuse me for any stupid
questions. Eltermann Just my 2 cents, I felt your approach(in BestSplitter
) somewhat inefficient when you are calling the get_sparse_item. Firstly
even though we have indices of all the non zero items, it is being called
node_samples times, iterating over nnz times. An easier way of doing this
would be to sort the samples in sparse matrix along features and then
extracting if necessary. because the whole point in using Xf is to sort
them which becomes difficult when they are in an n-d array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my point
is when we are adding support for sparse matrices, why not exploit their
structure as much as we can. Arnaud is this feasible ? Eltermann anything
wrong in my thinking ?
cheers,
kaushik varanasi
Post by Arnaud Joly
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity support to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
Post by Gilles Louppe
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions. _tree.pxy
is the
Post by Gilles Louppe
Post by Felipe Eltermann
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate the
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and sparse
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions. _tree.pxy
is
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the back end cython code to handle the operations Splitting,
Evaluating
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
impurities at nodes and then constructing the tree. The basic
implementation
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
that i have in mind is that we duplicate the splitter classes and The
tree
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
architecture itself. At each leaf node we might have a sparse matrix
representing the output data and at every non-leaf node, splitting
should be
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
done on the sparse matrix at that node and produce few other nodes
with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
sparse output. So the project boils down to implementing good access
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
construction methods for sparse matrices in cython. I just want to
know if
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
this is a possible implementation criterion ?. And i would also like
to know
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I
would
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
just like to know if the patches are upto the mark for a selection.
If they
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
are, i will turn my attention to the DT sparse problem, else i will
work on
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply
for a
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
and start contributing to scikit-learn.
Maheshakya
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one
for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
which you are assigned as mentor: "Add scipy.sparse matrix support
to the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Decision Tree".
I am considering applying for this project as I have already faced
the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
necessity of sparse matrices in sklearn and the outcome was not
totally
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several difficulties
even with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
algorithms that already support spase matrices (even simple
algorithms such
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT
(and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
related algorithms such as random forest and extra trees) and, if
the time
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
constraint allows, improve as much as possible the general support
for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
sparse matrices.
With this email, I want to show my interest and ask you if there
are
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
Post by Felipe Eltermann
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Arnaud Joly
2014-03-17 13:19:59 UTC
Permalink
Hi,


The support for sparse matrices should exploit as much as possible the sparsity structure of the matrix
without blowing up memory as to obtain optimal performance. Note that the choice of the best sparse matrix
representation is still open (csc, csr, coo, ... matrix?).

To my knowldege, there is one implementation of sparse decision tree
https://github.com/fest/fest. But I don’t think that the license is BSD compatible.

If input sparsity is added, adaboost, GBRT, bagging and random
forest should support sparse input data.

Best,
Arnaud
Post by vamsi kaushik
Hi arnaud, Eltermann
Thanks for the reply. Firstly I am a noob here so excuse me for any stupid questions. Eltermann Just my 2 cents, I felt your approach(in BestSplitter ) somewhat inefficient when you are calling the get_sparse_item. Firstly even though we have indices of all the non zero items, it is being called node_samples times, iterating over nnz times. An easier way of doing this would be to sort the samples in sparse matrix along features and then extracting if necessary. because the whole point in using Xf is to sort them which becomes difficult when they are in an n-d array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my point is when we are adding support for sparse matrices, why not exploit their structure as much as we can. Arnaud is this feasible ? Eltermann anything wrong in my thinking ?
cheers,
kaushik varanasi
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity support to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
Post by Gilles Louppe
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions. _tree.pxy is the
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate the
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and sparse
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions. _tree.pxy is
the back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree. The basic implementation
that i have in mind is that we duplicate the splitter classes and The tree
architecture itself. At each leaf node we might have a sparse matrix
representing the output data and at every non-leaf node, splitting should be
done on the sparse matrix at that node and produce few other nodes with
sparse output. So the project boils down to implementing good access and
construction methods for sparse matrices in cython. I just want to know if
this is a possible implementation criterion ?. And i would also like to know
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I would
just like to know if the patches are upto the mark for a selection. If they
are, i will turn my attention to the DT sparse problem, else i will work on
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for a
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
and start contributing to scikit-learn.
Right now, three people have shown interest for this topic: Maheshakya
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for
which you are assigned as mentor: "Add scipy.sparse matrix support to the
Decision Tree".
I am considering applying for this project as I have already faced the
necessity of sparse matrices in sklearn and the outcome was not totally
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several difficulties even with
algorithms that already support spase matrices (even simple algorithms such
as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and
related algorithms such as random forest and extra trees) and, if the time
constraint allows, improve as much as possible the general support for
sparse matrices.
With this email, I want to show my interest and ask you if there are
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Vlad Niculae
2014-03-17 13:27:46 UTC
Permalink
This program is granted free of charge for research and education
purposes. However you must obtain a license from the author to use it
for commercial purposes.
Definitely FEST is not BSD compatible :(

Vlad
Hi,
The support for sparse matrices should exploit as much as possible the
sparsity structure of the matrix
without blowing up memory as to obtain optimal performance. Note that
the choice of the best sparse matrix
representation is still open (csc, csr, coo, ... matrix?).
To my knowldege, there is one implementation of sparse decision tree
https://github.com/fest/fest. But I don’t think that the license is BSD
compatible.
If input sparsity is added, adaboost, GBRT, bagging and random
forest should support sparse input data.
Best,
Arnaud
Post by vamsi kaushik
Hi arnaud, Eltermann
Thanks for the reply. Firstly I am a noob here so excuse me for any
stupid questions. Eltermann Just my 2 cents, I felt your approach(in
BestSplitter ) somewhat inefficient when you are calling the
get_sparse_item. Firstly even though we have indices of all the non
zero items, it is being called node_samples times, iterating over nnz
times. An easier way of doing this would be to sort the samples in
sparse matrix along features and then extracting if necessary. because
the whole point in using Xf is to sort them which becomes difficult
when they are in an n-d array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my
point is when we are adding support for sparse matrices, why not
exploit their structure as much as we can. Arnaud is this feasible ?
Eltermann anything wrong in my thinking ?
cheers,
kaushik varanasi
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity
support to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
On 12 March 2014 13:08, Felipe Eltermann
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions.
_tree.pxy is the
Post by Felipe Eltermann
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we
duplicate the
Post by Felipe Eltermann
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It
will be
Post by Felipe Eltermann
simpler if the same splitter class should handle both dense and
sparse
Post by Felipe Eltermann
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on
the open
Post by Felipe Eltermann
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both
implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions.
_tree.pxy is
Post by Felipe Eltermann
Post by vamsi kaushik
the back end cython code to handle the operations Splitting,
Evaluating
Post by Felipe Eltermann
Post by vamsi kaushik
impurities at nodes and then constructing the tree. The basic
implementation
Post by Felipe Eltermann
Post by vamsi kaushik
that i have in mind is that we duplicate the splitter classes
and The tree
Post by Felipe Eltermann
Post by vamsi kaushik
architecture itself. At each leaf node we might have a sparse
matrix
Post by Felipe Eltermann
Post by vamsi kaushik
representing the output data and at every non-leaf node,
splitting should be
Post by Felipe Eltermann
Post by vamsi kaushik
done on the sparse matrix at that node and produce few other
nodes with
Post by Felipe Eltermann
Post by vamsi kaushik
sparse output. So the project boils down to implementing good
access and
Post by Felipe Eltermann
Post by vamsi kaushik
construction methods for sparse matrices in cython. I just
want to know if
Post by Felipe Eltermann
Post by vamsi kaushik
this is a possible implementation criterion ?. And i would
also like to know
Post by Felipe Eltermann
Post by vamsi kaushik
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my
contributions. I would
Post by Felipe Eltermann
Post by vamsi kaushik
just like to know if the patches are upto the mark for a
selection. If they
Post by Felipe Eltermann
Post by vamsi kaushik
are, i will turn my attention to the DT sparse problem, else i
will work on
Post by Felipe Eltermann
Post by vamsi kaushik
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
On 10 Mar 2014, at 23:51, vamsi kaushik
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
On Mon, Mar 10, 2014 at 3:16 PM, Arnaud Joly
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to
apply for a
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
and start contributing to scikit-learn.
Maheshakya
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you.
Several
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
candidates may apply for the
same subject. However it is likely that if a GSOC is
awarded for a
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an
interesting one for
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
which you are assigned as mentor: "Add scipy.sparse matrix
support to the
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Decision Tree".
I am considering applying for this project as I have
already faced the
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
necessity of sparse matrices in sklearn and the outcome was
not totally
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several
difficulties even with
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
algorithms that already support spase matrices (even simple
algorithms such
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
as NB). In general I get some memory error and sometimes
segfaults.
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
I would be happy if I could implement the necessary support
for DT (and
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
related algorithms such as random forest and extra trees)
and, if the time
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
constraint allows, improve as much as possible the general
support for
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
sparse matrices.
With this email, I want to show my interest and ask you if
there are
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph
databases and
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free
book today!
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph
databases and
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph
databases and
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Felipe Eltermann
Post by vamsi kaushik
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph
databases and their
Post by Felipe Eltermann
Post by vamsi kaushik
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
Post by Felipe Eltermann
Post by vamsi kaushik
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Felipe Eltermann
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph
databases and their
Post by Felipe Eltermann
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
vamsi kaushik
2014-03-17 13:56:59 UTC
Permalink
hi arnaud,

yeah i have already gone through that implementation. I feel at least we
should have separate splitter classes for the node splitting to be more
efficient and flexible.
And as of the choice of matrix, I don't have one in mind but mostly the
elements are sorted along the feature columns. so I would say, csr maybe.
But there can be
other efficient formats too.

And as of the input matrices, we can follow the same procedure followed in
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx

so as to not break the user code. And moreover i have read somewhere on the
internet about sparse coding format, is it possible to use a more efficient
sparse structure within the tree algorithm. But the input types remain the
same, they get converted to another sparse structure once they are passed
to the tree, since we are working with non-zero element pointers, this
should not be memory/time inefficient. But my question is, is this
acceptable?
Post by Arnaud Joly
Hi,
The support for sparse matrices should exploit as much as possible the
sparsity structure of the matrix
without blowing up memory as to obtain optimal performance. Note that the
choice of the best sparse matrix
representation is still open (csc, csr, coo, ... matrix?).
To my knowldege, there is one implementation of sparse decision tree
https://github.com/fest/fest. But I don't think that the license is BSD
compatible.
If input sparsity is added, adaboost, GBRT, bagging and random
forest should support sparse input data.
Best,
Arnaud
Hi arnaud, Eltermann
Thanks for the reply. Firstly I am a noob here so excuse me for any stupid
questions. Eltermann Just my 2 cents, I felt your approach(in BestSplitter
) somewhat inefficient when you are calling the get_sparse_item. Firstly
even though we have indices of all the non zero items, it is being called
node_samples times, iterating over nnz times. An easier way of doing this
would be to sort the samples in sparse matrix along features and then
extracting if necessary. because the whole point in using Xf is to sort
them which becomes difficult when they are in an n-d array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my point
is when we are adding support for sparse matrices, why not exploit their
structure as much as we can. Arnaud is this feasible ? Eltermann anything
wrong in my thinking ?
cheers,
kaushik varanasi
Post by Arnaud Joly
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity support to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
Post by Gilles Louppe
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions. _tree.pxy
is the
Post by Gilles Louppe
Post by Felipe Eltermann
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate the
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and sparse
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions. _tree.pxy
is
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the back end cython code to handle the operations Splitting,
Evaluating
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
impurities at nodes and then constructing the tree. The basic
implementation
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
that i have in mind is that we duplicate the splitter classes and The
tree
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
architecture itself. At each leaf node we might have a sparse matrix
representing the output data and at every non-leaf node, splitting
should be
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
done on the sparse matrix at that node and produce few other nodes
with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
sparse output. So the project boils down to implementing good access
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
construction methods for sparse matrices in cython. I just want to
know if
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
this is a possible implementation criterion ?. And i would also like
to know
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I
would
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
just like to know if the patches are upto the mark for a selection.
If they
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
are, i will turn my attention to the DT sparse problem, else i will
work on
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply
for a
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
and start contributing to scikit-learn.
Maheshakya
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one
for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
which you are assigned as mentor: "Add scipy.sparse matrix support
to the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Decision Tree".
I am considering applying for this project as I have already faced
the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
necessity of sparse matrices in sklearn and the outcome was not
totally
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several difficulties
even with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
algorithms that already support spase matrices (even simple
algorithms such
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT
(and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
related algorithms such as random forest and extra trees) and, if
the time
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
constraint allows, improve as much as possible the general support
for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
sparse matrices.
With this email, I want to show my interest and ask you if there
are
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
Post by Felipe Eltermann
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
vamsi kaushik
2014-03-17 14:19:46 UTC
Permalink
Hi,
And of the fest implementation i would say we just rewrite the algorithm in
cython instead of writing wrappers for it, this dosn't need licence and is
readable for other contributors and also easily changeable in future. This
boils down to a few 500 lines of code in cython.


On Mon, Mar 17, 2014 at 7:26 PM, vamsi kaushik
Post by vamsi kaushik
hi arnaud,
yeah i have already gone through that implementation. I feel at least we
should have separate splitter classes for the node splitting to be more
efficient and flexible.
And as of the choice of matrix, I don't have one in mind but mostly the
elements are sorted along the feature columns. so I would say, csr maybe.
But there can be
other efficient formats too.
And as of the input matrices, we can follow the same procedure followed in
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx
so as to not break the user code. And moreover i have read somewhere on
the internet about sparse coding format, is it possible to use a more
efficient sparse structure within the tree algorithm. But the input types
remain the same, they get converted to another sparse structure once they
are passed to the tree, since we are working with non-zero element
pointers, this should not be memory/time inefficient. But my question is,
is this acceptable?
Post by Arnaud Joly
Hi,
The support for sparse matrices should exploit as much as possible the
sparsity structure of the matrix
without blowing up memory as to obtain optimal performance. Note that the
choice of the best sparse matrix
representation is still open (csc, csr, coo, ... matrix?).
To my knowldege, there is one implementation of sparse decision tree
https://github.com/fest/fest. But I don't think that the license is BSD
compatible.
If input sparsity is added, adaboost, GBRT, bagging and random
forest should support sparse input data.
Best,
Arnaud
Hi arnaud, Eltermann
Thanks for the reply. Firstly I am a noob here so excuse me for any
stupid questions. Eltermann Just my 2 cents, I felt your approach(in
BestSplitter ) somewhat inefficient when you are calling the
get_sparse_item. Firstly even though we have indices of all the non zero
items, it is being called node_samples times, iterating over nnz times. An
easier way of doing this would be to sort the samples in sparse matrix
along features and then extracting if necessary. because the whole point in
using Xf is to sort them which becomes difficult when they are in an n-d
array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my
point is when we are adding support for sparse matrices, why not exploit
their structure as much as we can. Arnaud is this feasible ? Eltermann
anything wrong in my thinking ?
cheers,
kaushik varanasi
Post by Arnaud Joly
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity support to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
Post by Gilles Louppe
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions. _tree.pxy
is the
Post by Gilles Louppe
Post by Felipe Eltermann
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate the
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and sparse
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions. _tree.pxy
is
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the back end cython code to handle the operations Splitting,
Evaluating
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
impurities at nodes and then constructing the tree. The basic
implementation
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
that i have in mind is that we duplicate the splitter classes and
The tree
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
architecture itself. At each leaf node we might have a sparse matrix
representing the output data and at every non-leaf node, splitting
should be
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
done on the sparse matrix at that node and produce few other nodes
with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
sparse output. So the project boils down to implementing good access
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
construction methods for sparse matrices in cython. I just want to
know if
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
this is a possible implementation criterion ?. And i would also like
to know
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I
would
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
just like to know if the patches are upto the mark for a selection.
If they
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
are, i will turn my attention to the DT sparse problem, else i will
work on
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
On 10 Mar 2014, at 23:51, vamsi kaushik <
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply
for a
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
and start contributing to scikit-learn.
Maheshakya
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one
for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
which you are assigned as mentor: "Add scipy.sparse matrix
support to the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Decision Tree".
I am considering applying for this project as I have already
faced the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
necessity of sparse matrices in sklearn and the outcome was not
totally
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several difficulties
even with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
algorithms that already support spase matrices (even simple
algorithms such
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
as NB). In general I get some memory error and sometimes
segfaults.
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
I would be happy if I could implement the necessary support for
DT (and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
related algorithms such as random forest and extra trees) and, if
the time
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
constraint allows, improve as much as possible the general
support for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
sparse matrices.
With this email, I want to show my interest and ask you if there
are
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
Post by Felipe Eltermann
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Arnaud Joly
2014-03-17 14:58:30 UTC
Permalink
Given the license, I don’t think you can read the code to make a new bsd implementation.

Arnaud
Hi,
And of the fest implementation i would say we just rewrite the algorithm in cython instead of writing wrappers for it, this dosn't need licence and is readable for other contributors and also easily changeable in future. This boils down to a few 500 lines of code in cython.
hi arnaud,
yeah i have already gone through that implementation. I feel at least we should have separate splitter classes for the node splitting to be more efficient and flexible.
And as of the choice of matrix, I don't have one in mind but mostly the elements are sorted along the feature columns. so I would say, csr maybe. But there can be
other efficient formats too.
And as of the input matrices, we can follow the same procedure followed in
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx
so as to not break the user code. And moreover i have read somewhere on the internet about sparse coding format, is it possible to use a more efficient sparse structure within the tree algorithm. But the input types remain the same, they get converted to another sparse structure once they are passed to the tree, since we are working with non-zero element pointers, this should not be memory/time inefficient. But my question is, is this acceptable?
Hi,
The support for sparse matrices should exploit as much as possible the sparsity structure of the matrix
without blowing up memory as to obtain optimal performance. Note that the choice of the best sparse matrix
representation is still open (csc, csr, coo, ... matrix?).
To my knowldege, there is one implementation of sparse decision tree
https://github.com/fest/fest. But I don’t think that the license is BSD compatible.
If input sparsity is added, adaboost, GBRT, bagging and random
forest should support sparse input data.
Best,
Arnaud
Post by vamsi kaushik
Hi arnaud, Eltermann
Thanks for the reply. Firstly I am a noob here so excuse me for any stupid questions. Eltermann Just my 2 cents, I felt your approach(in BestSplitter ) somewhat inefficient when you are calling the get_sparse_item. Firstly even though we have indices of all the non zero items, it is being called node_samples times, iterating over nnz times. An easier way of doing this would be to sort the samples in sparse matrix along features and then extracting if necessary. because the whole point in using Xf is to sort them which becomes difficult when they are in an n-d array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my point is when we are adding support for sparse matrices, why not exploit their structure as much as we can. Arnaud is this feasible ? Eltermann anything wrong in my thinking ?
cheers,
kaushik varanasi
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity support to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
Post by Gilles Louppe
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions. _tree.pxy is the
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate the
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and sparse
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions. _tree.pxy is
the back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree. The basic implementation
that i have in mind is that we duplicate the splitter classes and The tree
architecture itself. At each leaf node we might have a sparse matrix
representing the output data and at every non-leaf node, splitting should be
done on the sparse matrix at that node and produce few other nodes with
sparse output. So the project boils down to implementing good access and
construction methods for sparse matrices in cython. I just want to know if
this is a possible implementation criterion ?. And i would also like to know
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I would
just like to know if the patches are upto the mark for a selection. If they
are, i will turn my attention to the DT sparse problem, else i will work on
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply for a
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
and start contributing to scikit-learn.
Right now, three people have shown interest for this topic: Maheshakya
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for a
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting one for
which you are assigned as mentor: "Add scipy.sparse matrix support to the
Decision Tree".
I am considering applying for this project as I have already faced the
necessity of sparse matrices in sklearn and the outcome was not totally
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several difficulties even with
algorithms that already support spase matrices (even simple algorithms such
as NB). In general I get some memory error and sometimes segfaults.
I would be happy if I could implement the necessary support for DT (and
related algorithms such as random forest and extra trees) and, if the time
constraint allows, improve as much as possible the general support for
sparse matrices.
With this email, I want to show my interest and ask you if there are
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
vamsi kaushik
2014-03-18 14:16:13 UTC
Permalink
Hi Arnaud,

I just saw your mail regarding the fest implementation and dropped it from
the proposal.
Im currently working on my proposal and will post it on the wiki by tonight.
The core idea is to alter the _tree.pyx module and then extend sparse
support to ensemble methods and im also new to the community so i couldn't
fix a lot of patches.
I would like to know if there are other important points to mention in the
proposal regarding the project

cheers,
kaushik Varanasi


On Mon, Mar 17, 2014 at 7:49 PM, vamsi kaushik
Post by vamsi kaushik
Hi,
And of the fest implementation i would say we just rewrite the algorithm
in cython instead of writing wrappers for it, this dosn't need licence and
is readable for other contributors and also easily changeable in future.
This boils down to a few 500 lines of code in cython.
On Mon, Mar 17, 2014 at 7:26 PM, vamsi kaushik <
Post by vamsi kaushik
hi arnaud,
yeah i have already gone through that implementation. I feel at least we
should have separate splitter classes for the node splitting to be more
efficient and flexible.
And as of the choice of matrix, I don't have one in mind but mostly the
elements are sorted along the feature columns. so I would say, csr maybe.
But there can be
other efficient formats too.
And as of the input matrices, we can follow the same procedure followed in
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx
so as to not break the user code. And moreover i have read somewhere on
the internet about sparse coding format, is it possible to use a more
efficient sparse structure within the tree algorithm. But the input types
remain the same, they get converted to another sparse structure once they
are passed to the tree, since we are working with non-zero element
pointers, this should not be memory/time inefficient. But my question is,
is this acceptable?
Post by Arnaud Joly
Hi,
The support for sparse matrices should exploit as much as possible the
sparsity structure of the matrix
without blowing up memory as to obtain optimal performance. Note that
the choice of the best sparse matrix
representation is still open (csc, csr, coo, ... matrix?).
To my knowldege, there is one implementation of sparse decision tree
https://github.com/fest/fest. But I don't think that the license is BSD
compatible.
If input sparsity is added, adaboost, GBRT, bagging and random
forest should support sparse input data.
Best,
Arnaud
Hi arnaud, Eltermann
Thanks for the reply. Firstly I am a noob here so excuse me for any
stupid questions. Eltermann Just my 2 cents, I felt your approach(in
BestSplitter ) somewhat inefficient when you are calling the
get_sparse_item. Firstly even though we have indices of all the non zero
items, it is being called node_samples times, iterating over nnz times. An
easier way of doing this would be to sort the samples in sparse matrix
along features and then extracting if necessary. because the whole point in
using Xf is to sort them which becomes difficult when they are in an n-d
array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my
point is when we are adding support for sparse matrices, why not exploit
their structure as much as we can. Arnaud is this feasible ? Eltermann
anything wrong in my thinking ?
cheers,
kaushik varanasi
Post by Arnaud Joly
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity support
to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
Post by Gilles Louppe
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions.
_tree.pxy is the
Post by Gilles Louppe
Post by Felipe Eltermann
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate
the
Post by Gilles Louppe
Post by Felipe Eltermann
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will be
simpler if the same splitter class should handle both dense and
sparse
Post by Gilles Louppe
Post by Felipe Eltermann
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the open
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions.
_tree.pxy is
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the back end cython code to handle the operations Splitting,
Evaluating
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
impurities at nodes and then constructing the tree. The basic
implementation
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
that i have in mind is that we duplicate the splitter classes and
The tree
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
architecture itself. At each leaf node we might have a sparse matrix
representing the output data and at every non-leaf node, splitting
should be
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
done on the sparse matrix at that node and produce few other nodes
with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
sparse output. So the project boils down to implementing good
access and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
construction methods for sparse matrices in cython. I just want to
know if
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
this is a possible implementation criterion ?. And i would also
like to know
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I
would
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
just like to know if the patches are upto the mark for a selection.
If they
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
are, i will turn my attention to the DT sparse problem, else i will
work on
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
On 10 Mar 2014, at 23:51, vamsi kaushik <
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply
for a
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
and start contributing to scikit-learn.
Maheshakya
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded for
a
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting
one for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
which you are assigned as mentor: "Add scipy.sparse matrix
support to the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Decision Tree".
I am considering applying for this project as I have already
faced the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
necessity of sparse matrices in sklearn and the outcome was not
totally
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several
difficulties even with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
algorithms that already support spase matrices (even simple
algorithms such
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
as NB). In general I get some memory error and sometimes
segfaults.
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
I would be happy if I could implement the necessary support for
DT (and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
related algorithms such as random forest and extra trees) and,
if the time
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
constraint allows, improve as much as possible the general
support for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
sparse matrices.
With this email, I want to show my interest and ask you if there
are
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and their
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
Post by Felipe Eltermann
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
vamsi kaushik
2014-03-18 23:21:00 UTC
Permalink
Hi Arnaud,

I am looking forward to include these in my proposal
http://cam.zcu.cz/~brandner/PSfiles/pooch.pdf
http://arxiv.org/pdf/1301.6685v1.pdf

as algorithms for sparse storage within the decision tree. Have your say on
this one

cheers,
kaushik Varanasi


On Tue, Mar 18, 2014 at 7:45 PM, vamsi kaushik
Hi Arnaud,
I just saw your mail regarding the fest implementation and dropped it from
the proposal.
Im currently working on my proposal and will post it on the wiki by tonight.
The core idea is to alter the _tree.pyx module and then extend sparse
support to ensemble methods and im also new to the community so i couldn't
fix a lot of patches.
I would like to know if there are other important points to mention in the
proposal regarding the project
cheers,
kaushik Varanasi
On Mon, Mar 17, 2014 at 7:49 PM, vamsi kaushik <
Post by vamsi kaushik
Hi,
And of the fest implementation i would say we just rewrite the algorithm
in cython instead of writing wrappers for it, this dosn't need licence and
is readable for other contributors and also easily changeable in future.
This boils down to a few 500 lines of code in cython.
On Mon, Mar 17, 2014 at 7:26 PM, vamsi kaushik <
Post by vamsi kaushik
hi arnaud,
yeah i have already gone through that implementation. I feel at least we
should have separate splitter classes for the node splitting to be more
efficient and flexible.
And as of the choice of matrix, I don't have one in mind but mostly the
elements are sorted along the feature columns. so I would say, csr maybe.
But there can be
other efficient formats too.
And as of the input matrices, we can follow the same procedure followed in
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx
so as to not break the user code. And moreover i have read somewhere on
the internet about sparse coding format, is it possible to use a more
efficient sparse structure within the tree algorithm. But the input types
remain the same, they get converted to another sparse structure once they
are passed to the tree, since we are working with non-zero element
pointers, this should not be memory/time inefficient. But my question is,
is this acceptable?
Post by Arnaud Joly
Hi,
The support for sparse matrices should exploit as much as possible the
sparsity structure of the matrix
without blowing up memory as to obtain optimal performance. Note that
the choice of the best sparse matrix
representation is still open (csc, csr, coo, ... matrix?).
To my knowldege, there is one implementation of sparse decision tree
https://github.com/fest/fest. But I don't think that the license is
BSD compatible.
If input sparsity is added, adaboost, GBRT, bagging and random
forest should support sparse input data.
Best,
Arnaud
Hi arnaud, Eltermann
Thanks for the reply. Firstly I am a noob here so excuse me for any
stupid questions. Eltermann Just my 2 cents, I felt your approach(in
BestSplitter ) somewhat inefficient when you are calling the
get_sparse_item. Firstly even though we have indices of all the non zero
items, it is being called node_samples times, iterating over nnz times. An
easier way of doing this would be to sort the samples in sparse matrix
along features and then extracting if necessary. because the whole point in
using Xf is to sort them which becomes difficult when they are in an n-d
array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my
point is when we are adding support for sparse matrices, why not exploit
their structure as much as we can. Arnaud is this feasible ? Eltermann
anything wrong in my thinking ?
cheers,
kaushik varanasi
Post by Arnaud Joly
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity support
to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
Post by Gilles Louppe
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions.
_tree.pxy is the
Post by Gilles Louppe
Post by Felipe Eltermann
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate
the
Post by Gilles Louppe
Post by Felipe Eltermann
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will
be
Post by Gilles Louppe
Post by Felipe Eltermann
simpler if the same splitter class should handle both dense and
sparse
Post by Gilles Louppe
Post by Felipe Eltermann
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the
open
Post by Gilles Louppe
Post by Felipe Eltermann
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions.
_tree.pxy is
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the back end cython code to handle the operations Splitting,
Evaluating
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
impurities at nodes and then constructing the tree. The basic
implementation
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
that i have in mind is that we duplicate the splitter classes and
The tree
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
architecture itself. At each leaf node we might have a sparse
matrix
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
representing the output data and at every non-leaf node, splitting
should be
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
done on the sparse matrix at that node and produce few other nodes
with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
sparse output. So the project boils down to implementing good
access and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
construction methods for sparse matrices in cython. I just want to
know if
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
this is a possible implementation criterion ?. And i would also
like to know
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions. I
would
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
just like to know if the patches are upto the mark for a
selection. If they
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
are, i will turn my attention to the DT sparse problem, else i
will work on
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
On 10 Mar 2014, at 23:51, vamsi kaushik <
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to apply
for a
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
and start contributing to scikit-learn.
Maheshakya
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you. Several
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded
for a
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting
one for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
which you are assigned as mentor: "Add scipy.sparse matrix
support to the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Decision Tree".
I am considering applying for this project as I have already
faced the
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
necessity of sparse matrices in sklearn and the outcome was not
totally
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several
difficulties even with
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
algorithms that already support spase matrices (even simple
algorithms such
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
as NB). In general I get some memory error and sometimes
segfaults.
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
I would be happy if I could implement the necessary support for
DT (and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
related algorithms such as random forest and extra trees) and,
if the time
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
constraint allows, improve as much as possible the general
support for
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
sparse matrices.
With this email, I want to show my interest and ask you if
there are
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph
databases and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and their
Post by Gilles Louppe
Post by Felipe Eltermann
Post by vamsi kaushik
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Post by Felipe Eltermann
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and their
Post by Gilles Louppe
Post by Felipe Eltermann
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Gilles Louppe
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and
their
Post by Gilles Louppe
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
vamsi kaushik
2014-03-19 19:55:37 UTC
Permalink
Hi,

I have written a
proposal<https://github.com/scikit-learn/scikit-learn/wiki/GSoC-2014-Application:-Sparse-matrix-support-for-decision-tree>for
gsoc, can someone please review it .

cheers,
kaushik varanasi



On Wed, Mar 19, 2014 at 4:50 AM, vamsi kaushik
Hi Arnaud,
I am looking forward to include these in my proposal
http://cam.zcu.cz/~brandner/PSfiles/pooch.pdf
http://arxiv.org/pdf/1301.6685v1.pdf
as algorithms for sparse storage within the decision tree. Have your say
on this one
cheers,
kaushik Varanasi
On Tue, Mar 18, 2014 at 7:45 PM, vamsi kaushik <
Hi Arnaud,
I just saw your mail regarding the fest implementation and dropped it
from the proposal.
Im currently working on my proposal and will post it on the wiki by tonight.
The core idea is to alter the _tree.pyx module and then extend sparse
support to ensemble methods and im also new to the community so i couldn't
fix a lot of patches.
I would like to know if there are other important points to mention in
the proposal regarding the project
cheers,
kaushik Varanasi
On Mon, Mar 17, 2014 at 7:49 PM, vamsi kaushik <
Post by vamsi kaushik
Hi,
And of the fest implementation i would say we just rewrite the algorithm
in cython instead of writing wrappers for it, this dosn't need licence and
is readable for other contributors and also easily changeable in future.
This boils down to a few 500 lines of code in cython.
On Mon, Mar 17, 2014 at 7:26 PM, vamsi kaushik <
Post by vamsi kaushik
hi arnaud,
yeah i have already gone through that implementation. I feel at least
we should have separate splitter classes for the node splitting to be more
efficient and flexible.
And as of the choice of matrix, I don't have one in mind but mostly the
elements are sorted along the feature columns. so I would say, csr maybe.
But there can be
other efficient formats too.
And as of the input matrices, we can follow the same procedure followed in
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx
so as to not break the user code. And moreover i have read somewhere on
the internet about sparse coding format, is it possible to use a more
efficient sparse structure within the tree algorithm. But the input types
remain the same, they get converted to another sparse structure once they
are passed to the tree, since we are working with non-zero element
pointers, this should not be memory/time inefficient. But my question is,
is this acceptable?
Post by Arnaud Joly
Hi,
The support for sparse matrices should exploit as much as possible the
sparsity structure of the matrix
without blowing up memory as to obtain optimal performance. Note that
the choice of the best sparse matrix
representation is still open (csc, csr, coo, ... matrix?).
To my knowldege, there is one implementation of sparse decision tree
https://github.com/fest/fest. But I don't think that the license is
BSD compatible.
If input sparsity is added, adaboost, GBRT, bagging and random
forest should support sparse input data.
Best,
Arnaud
Hi arnaud, Eltermann
Thanks for the reply. Firstly I am a noob here so excuse me for any
stupid questions. Eltermann Just my 2 cents, I felt your approach(in
BestSplitter ) somewhat inefficient when you are calling the
get_sparse_item. Firstly even though we have indices of all the non zero
items, it is being called node_samples times, iterating over nnz times. An
easier way of doing this would be to sort the samples in sparse matrix
along features and then extracting if necessary. because the whole point in
using Xf is to sort them which becomes difficult when they are in an n-d
array. like for example
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.nonzero.html#scipy.sparse.csr_matrix.nonzero
gives the non zero elements. My solution might not be perfect but my
point is when we are adding support for sparse matrices, why not exploit
their structure as much as we can. Arnaud is this feasible ? Eltermann
anything wrong in my thinking ?
cheers,
kaushik varanasi
Post by Arnaud Joly
For the number of contributions, I would advise you to do as many
as possible to get familiar with the codebase and scikit-learn practices.
Of course starting to think and to work on how to add sparsity
support to decision tree
is important to grasp the challenges and write a convincing proposal.
Thus, the right proportion is difficult to suggest.
The goal of this GSOC is to have an efficient and a maintenable implementation.
The overall project plan should be dictated by this goal.
Best,
Arnaud
On 12 March 2014 13:08, Felipe Eltermann <
Post by Felipe Eltermann
Hello Vamsi,
"Firstly, regarding the implementation of sparse functions.
_tree.pxy is the
Post by Felipe Eltermann
back end cython code to handle the operations Splitting, Evaluating
impurities at nodes and then constructing the tree."
That's right.
"The basic implementation that i have in mind is that we duplicate
the
Post by Felipe Eltermann
splitter classes and The tree architecture itself."
Duplicating the code doesn't seem to be the best approach. It will
be
Post by Felipe Eltermann
simpler if the same splitter class should handle both dense and
sparse
Post by Felipe Eltermann
matrices.
https://github.com/scikit-learn/scikit-learn/pull/2848
If you want to help, you can read up the commits and help on the
open
Post by Felipe Eltermann
issues.
(however, I don't think this apply as a GSoC project though)
Just my 2 cents, but I am not sure having both dense and sparse
support in the same codebase is the best choice. Both
implementations
should in my opinion be separated, with the common code factored out
to avoid code dupplication.
In particular, having an *efficient* sparse splitter will require
specific changes to the splitter. I am really not sure having this
code intermingled with the dense support is the best option.
Post by Felipe Eltermann
On Tue, Mar 11, 2014 at 7:07 PM, vamsi kaushik <
Post by vamsi kaushik
hi Joly,
Firstly, regarding the implementation of sparse functions.
_tree.pxy is
Post by Felipe Eltermann
Post by vamsi kaushik
the back end cython code to handle the operations Splitting,
Evaluating
Post by Felipe Eltermann
Post by vamsi kaushik
impurities at nodes and then constructing the tree. The basic
implementation
Post by Felipe Eltermann
Post by vamsi kaushik
that i have in mind is that we duplicate the splitter classes and
The tree
Post by Felipe Eltermann
Post by vamsi kaushik
architecture itself. At each leaf node we might have a sparse
matrix
Post by Felipe Eltermann
Post by vamsi kaushik
representing the output data and at every non-leaf node,
splitting should be
Post by Felipe Eltermann
Post by vamsi kaushik
done on the sparse matrix at that node and produce few other
nodes with
Post by Felipe Eltermann
Post by vamsi kaushik
sparse output. So the project boils down to implementing good
access and
Post by Felipe Eltermann
Post by vamsi kaushik
construction methods for sparse matrices in cython. I just want
to know if
Post by Felipe Eltermann
Post by vamsi kaushik
this is a possible implementation criterion ?. And i would also
like to know
Post by Felipe Eltermann
Post by vamsi kaushik
the implementation plan that you have in mind ?
Secondly, In my previous mail i have shown you my contributions.
I would
Post by Felipe Eltermann
Post by vamsi kaushik
just like to know if the patches are upto the mark for a
selection. If they
Post by Felipe Eltermann
Post by vamsi kaushik
are, i will turn my attention to the DT sparse problem, else i
will work on
Post by Felipe Eltermann
Post by vamsi kaushik
the issues a bit more.
Post by Arnaud Joly
Thanks for your contribution.
Keep up!
Arnaud
On 10 Mar 2014, at 23:51, vamsi kaushik <
My name is actually Varanasi Vamsi Kaushik(yeah its pretty long).
And i have already contributed a little.
https://github.com/scikit-learn/scikit-learn/pull/2905
https://github.com/scikit-learn/scikit-learn/pull/2895
https://github.com/scikit-learn/scikit-learn/issues/2727
cheers,
kaushik
On Tue, Mar 11, 2014 at 3:13 AM, vamsi kaushik
Post by vamsi kaushik
hi Arnaud,
Vamsi kaushik is actually me.
Thanks for your reply, i'll get to the issue soon
cheers,
vamsi kaushik
Post by Arnaud Joly
Hi,
Anything concerning the GSOC should pass by the scikit-learn
mailing list.
Thanks for your interest in the subject. If you intend to
apply for a
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
GSOC, I suggest you to read
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-%28GSOC%29-2014
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
and start contributing to scikit-learn.
Maheshakya
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
(who is now
applying for a GSOC about LSH), Vamsi Kaushik and you.
Several
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
candidates may apply for the
same subject. However it is likely that if a GSOC is awarded
for a
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
given subject, only the best proposal will
be selected.
Best regards,
Arnaud
Hi Arnaud,
While browsing the current GSoC projects I saw an interesting
one for
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
which you are assigned as mentor: "Add scipy.sparse matrix
support to the
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Decision Tree".
I am considering applying for this project as I have already
faced the
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
necessity of sparse matrices in sklearn and the outcome was
not totally
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
satisfactory.
Right now I am participating in a kaggle contest
(http://www.kaggle.com/c/lshtc/) and facing several
difficulties even with
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
algorithms that already support spase matrices (even simple
algorithms such
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
as NB). In general I get some memory error and sometimes
segfaults.
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
I would be happy if I could implement the necessary support
for DT (and
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
related algorithms such as random forest and extra trees) and,
if the time
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
constraint allows, improve as much as possible the general
support for
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
sparse matrices.
With this email, I want to show my interest and ask you if
there are
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
already any candidates for this place.
Best regards,
--
João
------------------------------------------------------------------------------
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph
databases and
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Post by vamsi kaushik
Post by Arnaud Joly
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
Post by Felipe Eltermann
Post by vamsi kaushik
Post by Arnaud Joly
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Felipe Eltermann
Post by vamsi kaushik
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and their
Post by Felipe Eltermann
Post by vamsi kaushik
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
Post by Felipe Eltermann
Post by vamsi kaushik
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Post by Felipe Eltermann
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and their
Post by Felipe Eltermann
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases
and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Continue reading on narkive:
Loading...