Discussion:
Sparse matrix support for Decision tree implementation
(too old to reply)
Maheshakya Wijewardena
2014-01-22 05:13:19 UTC
Permalink
Hi,

I have been using Scikit-learn One hot encoder for data encoding and the
resulting array supports only for a few models such as logistic regression,
SVC, etc. When I convert those sparse matrices with list comprehension or
toarray() function to dense matrices, resulting arrays become too large for
those classifiers such as Decision trees or any other tree based
classifier.
I saw a GSOC project idea of implementing this as mentioned here.
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014
I'm looking forward to apply for GSOC this year as well, so I would like
Arnaud Joly
2014-01-22 08:20:33 UTC
Permalink
Hi Maheshakya,

I could be one of the mentors for this GSOC.

If you want to apply for a GSOC, I think that this message from Gael and Mathieu is worth reading
http://sourceforge.net/mailarchive/message.php?msg_id=31864881

Best,
Arnaud
Hi,
I have been using Scikit-learn One hot encoder for data encoding and the resulting array supports only for a few models such as logistic regression, SVC, etc. When I convert those sparse matrices with list comprehension or toarray() function to dense matrices, resulting arrays become too large for those classifiers such as Decision trees or any other tree based classifier.
I saw a GSOC project idea of implementing this as mentioned here.
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014
I'm looking forward to apply for GSOC this year as well, so I would like start working on this. From where can I get support for this. (There're no possible mentors assigned for this)
Regards,
Maheshakya
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2014-01-22 08:48:45 UTC
Permalink
Hi,

Something I was wondering is whether sparse support in decision trees would
actually be useful. Do decision trees (or ensembles of them like random
forests) work better than linear models for high-dimensional data?

It would be nice to take the News20 dataset, pre-select the top 10k
features (or more if possible) then measure test accuracy on the densified
dataset. I would be very interested in hearing the results.

And regardless of accuracy, some algorithms (e.g., GMM) scale very poorly
with n_features. I wonder if it's not the case for decision trees too.

Gilles, Peter, Arnaud, any opinion / experience?

Mathieu


On Wed, Jan 22, 2014 at 2:13 PM, Maheshakya Wijewardena <
Post by Maheshakya Wijewardena
Hi,
I have been using Scikit-learn One hot encoder for data encoding and the
resulting array supports only for a few models such as logistic regression,
SVC, etc. When I convert those sparse matrices with list comprehension or
toarray() function to dense matrices, resulting arrays become too large for
those classifiers such as Decision trees or any other tree based
classifier.
I saw a GSOC project idea of implementing this as mentioned here.
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014
I'm looking forward to apply for GSOC this year as well, so I would like
Gilles Louppe
2014-01-22 08:57:30 UTC
Permalink
Mathieu,

I have no experience with forests on sparse data, nor have I seen much work
on the topic. I would be curious to investigate however, there may be
problems how which this is useful. I know that Arnaud tried forests on
(densified) 20newsgroups and it seems to work well actually.

In particular, I have the feeling that supporting sparse data in decision
trees would be helpful for boosting. I am quite confident GBRT would work
its way on such data. What do you think Peter?

Gilles
Post by Mathieu Blondel
Hi,
Something I was wondering is whether sparse support in decision trees
would actually be useful. Do decision trees (or ensembles of them like
random forests) work better than linear models for high-dimensional data?
It would be nice to take the News20 dataset, pre-select the top 10k
features (or more if possible) then measure test accuracy on the densified
dataset. I would be very interested in hearing the results.
And regardless of accuracy, some algorithms (e.g., GMM) scale very poorly
with n_features. I wonder if it's not the case for decision trees too.
Gilles, Peter, Arnaud, any opinion / experience?
Mathieu
On Wed, Jan 22, 2014 at 2:13 PM, Maheshakya Wijewardena <
Post by Maheshakya Wijewardena
Hi,
I have been using Scikit-learn One hot encoder for data encoding and the
resulting array supports only for a few models such as logistic regression,
SVC, etc. When I convert those sparse matrices with list comprehension or
toarray() function to dense matrices, resulting arrays become too large for
those classifiers such as Decision trees or any other tree based
classifier.
I saw a GSOC project idea of implementing this as mentioned here.
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014
I'm looking forward to apply for GSOC this year as well, so I would like
Paolo Losi
2014-01-31 10:58:30 UTC
Permalink
Post by Mathieu Blondel
Something I was wondering is whether sparse support in decision trees
would actually be useful. Do decision trees (or ensembles of them like
random forests) work better than linear models for high-dimensional data?
I share your point of view.
Post by Mathieu Blondel
It would be nice to take the News20 dataset, pre-select the top 10k
features (or more if possible) then measure test accuracy on the densified
dataset. I would be very interested in hearing the results.
I would also be interested in it. Applying random projection down to 100
<-> 1000 features would be also very interesting.
I'm thinking also to features hashing with multiple probes that I think
it's not currently implemented in sklearn.
The idea could be that of mapping via murmurhash "literal" features or
features id to 128 bit, group the hash in 16 group of 8 bits.
each group is a probe on the projected 256 features (8 bits). the pattern
of the probe would be -1, 1, -1, 1, ....

As anyone ever performed such experiments?

Paolo
Arnaud Joly
2014-01-31 14:40:19 UTC
Permalink
Here, some results on the 20 newsgroups dataset:

Classifier train-time test-time error-rate
--------------------------------------------
5-nn 0.0047s 13.6651s 0.5916
random forest 263.3146s 3.9985s 0.2459
sgd 0.2265s 0.0657s 0.2604


Optimizing everything properly should allow to gain to some
percents. Here, the gist https://gist.github.com/arjoly/8732555
for those who wants to play

Best,
Arnaud
Caleb
2014-01-22 09:21:00 UTC
Permalink
Hi all,

I am using random forest to do deep learning/feature learning using the RandomForestEmbedding in scikit-learn. It would be cool to apply 
the random forest on the learned features and induced a higher level representation.

I have actually tried the naive approach of densified the output from RandomForestEmbedding and feed it back to another one to get the second level of representation of the same data, and then apply SVM on it. Not only it is extremely slow, the result become worst. 

However, I think sparse matrix support for decision tree is a worthwhile effort as it enables me to investigate why the result is worst easily.

Just my 2 cents.

Caleb 



On Wednesday, January 22, 2014 1:15 PM, Maheshakya Wijewardena <***@gmail.com> wrote:

Hi, 

I have been using Scikit-learn One hot encoder for data encoding and the resulting array supports only for a few models such as logistic regression, SVC, etc. When I convert those sparse matrices with list comprehension or toarray() function to dense matrices, resulting arrays become too large for those classifiers such as Decision trees or any other tree based classifier. 
I saw a GSOC project idea of implementing this as mentioned here.
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014

I'm looking forward to apply for GSOC this year as well, so I
Maheshakya Wijewardena
2014-01-23 06:18:57 UTC
Permalink
Hi

As I think, using sparse data we can enhance the descriptiveness of the
data while keeping its' smaller compared to the dense data without loosing
information. Isn't that what trees generally need for improved accuracy?

I will try using sparse data on 20newsgroups data and let you know the
results.


Arnaud,
I've gone through those messages and I've already started working on
patches. Last year I've done a project of a module in our university. It
was to implement Bagging in Scikit-learn. As Gilles had already begun that,
I was not able to get my code merged. Moreover I have not implemented
feature bootstrapping as it was beyond the scope of my original proposal to
the project.
https://github.com/maheshakya/scikit-learn/blob/bagging2/sklearn/ensemble/bagging.py

I would appreciate if you can review and give some feedback on my
implementation and what can I do further.

Thank you.
Post by Caleb
Hi all,
I am using random forest to do deep learning/feature learning using the
RandomForestEmbedding in scikit-learn. It would be cool to apply
the random forest on the learned features and induced a higher level representation.
I have actually tried the naive approach of densified the output from
RandomForestEmbedding and feed it back to another one to get the second
level of representation of the same data, and then apply SVM on it. Not
only it is extremely slow, the result become worst.
However, I think sparse matrix support for decision tree is a worthwhile
effort as it enables me to investigate why the result is worst easily.
Just my 2 cents.
Caleb
On Wednesday, January 22, 2014 1:15 PM, Maheshakya Wijewardena <
Hi,
I have been using Scikit-learn One hot encoder for data encoding and the
resulting array supports only for a few models such as logistic regression,
SVC, etc. When I convert those sparse matrices with list comprehension or
toarray() function to dense matrices, resulting arrays become too large for
those classifiers such as Decision trees or any other tree based
classifier.
I saw a GSOC project idea of implementing this as mentioned here.
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014
I'm looking forward to apply for GSOC this year as well, so I would like
Olivier Grisel
2014-01-23 10:00:51 UTC
Permalink
Hi
As I think, using sparse data we can enhance the descriptiveness of the data
while keeping its' smaller compared to the dense data without loosing
information.
I don't understand what you mean by "sparse data we can enhance the
descriptiveness of the data".
I will try using sparse data on 20newsgroups data and let you know the
results.
What do you mean? 20newsgroups data is inherently sparse in the sense
as extracted BoW features are mostly zero valued. The problem is that
the current implementation of Decision Trees requires a dense
*representation* of that sparse data to work. To make Decision Trees
work on a spase representation (e.g. a CSC sparse matrix) would
require to re-implement a lot of the code.
Arnaud,
I've gone through those messages and I've already started working on
patches. Last year I've done a project of a module in our university. It was
to implement Bagging in Scikit-learn. As Gilles had already begun that, I
was not able to get my code merged. Moreover I have not implemented feature
bootstrapping as it was beyond the scope of my original proposal to the
project.
https://github.com/maheshakya/scikit-learn/blob/bagging2/sklearn/ensemble/bagging.py
I would appreciate if you can review and give some feedback on my
implementation and what can I do further.
I don't really see the point in spending time reviewing past
alternative implementations of existing features. There are already
129 pull requests that need reviewer's time:

https://github.com/scikit-learn/scikit-learn/pulls

In my opinion it would be much more productive to fix bugs in the
current code base.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2014-01-23 11:37:03 UTC
Permalink
Post by Maheshakya Wijewardena
I will try using sparse data on 20newsgroups data and let you know the
results.

What I was suggesting is to densify the News20 dataset (using a subset of
the features so that it fits in memory) and try it on our current
implementation. Of course it will be really slow but the goal is to
evaluate how decision trees would fare in terms of *accuracy*. A correct
sparse implementation should give exactly the same results (unless it
handles the zeros differently).

Don't get me wrong, adding sparse data support to decision trees is not
necessarily a bad idea: I'm just trying to evaluate the cost (in terms of
maintenance) vs. benefits. How much code in our current implementation
depends on the data representation?

Mathieu
Post by Maheshakya Wijewardena
Post by Maheshakya Wijewardena
Hi
As I think, using sparse data we can enhance the descriptiveness of the
data
Post by Maheshakya Wijewardena
while keeping its' smaller compared to the dense data without loosing
information.
I don't understand what you mean by "sparse data we can enhance the
descriptiveness of the data".
Post by Maheshakya Wijewardena
I will try using sparse data on 20newsgroups data and let you know the
results.
What do you mean? 20newsgroups data is inherently sparse in the sense
as extracted BoW features are mostly zero valued. The problem is that
the current implementation of Decision Trees requires a dense
*representation* of that sparse data to work. To make Decision Trees
work on a spase representation (e.g. a CSC sparse matrix) would
require to re-implement a lot of the code.
Post by Maheshakya Wijewardena
Arnaud,
I've gone through those messages and I've already started working on
patches. Last year I've done a project of a module in our university. It
was
Post by Maheshakya Wijewardena
to implement Bagging in Scikit-learn. As Gilles had already begun that, I
was not able to get my code merged. Moreover I have not implemented
feature
Post by Maheshakya Wijewardena
bootstrapping as it was beyond the scope of my original proposal to the
project.
https://github.com/maheshakya/scikit-learn/blob/bagging2/sklearn/ensemble/bagging.py
Post by Maheshakya Wijewardena
I would appreciate if you can review and give some feedback on my
implementation and what can I do further.
I don't really see the point in spending time reviewing past
alternative implementations of existing features. There are already
https://github.com/scikit-learn/scikit-learn/pulls
In my opinion it would be much more productive to fix bugs in the
current code base.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gilles Louppe
2014-01-23 11:42:18 UTC
Permalink
How much code in our current implementation depends on the data
representation?

Not much actually. It now basically boils down to simply write a new
splitter object. Everything else remains the same. So basically, I would
say that it amounts to 300~ lines of Cython (out of the 2300 lines in our
implementation).
Post by Maheshakya Wijewardena
I will try using sparse data on 20newsgroups data and let you know the
results.
What I was suggesting is to densify the News20 dataset (using a subset of
the features so that it fits in memory) and try it on our current
implementation. Of course it will be really slow but the goal is to
evaluate how decision trees would fare in terms of *accuracy*. A correct
sparse implementation should give exactly the same results (unless it
handles the zeros differently).
Don't get me wrong, adding sparse data support to decision trees is not
necessarily a bad idea: I'm just trying to evaluate the cost (in terms of
maintenance) vs. benefits. How much code in our current implementation
depends on the data representation?
Mathieu
Post by Maheshakya Wijewardena
Post by Maheshakya Wijewardena
Hi
As I think, using sparse data we can enhance the descriptiveness of the
data
Post by Maheshakya Wijewardena
while keeping its' smaller compared to the dense data without loosing
information.
I don't understand what you mean by "sparse data we can enhance the
descriptiveness of the data".
Post by Maheshakya Wijewardena
I will try using sparse data on 20newsgroups data and let you know the
results.
What do you mean? 20newsgroups data is inherently sparse in the sense
as extracted BoW features are mostly zero valued. The problem is that
the current implementation of Decision Trees requires a dense
*representation* of that sparse data to work. To make Decision Trees
work on a spase representation (e.g. a CSC sparse matrix) would
require to re-implement a lot of the code.
Post by Maheshakya Wijewardena
Arnaud,
I've gone through those messages and I've already started working on
patches. Last year I've done a project of a module in our university.
It was
Post by Maheshakya Wijewardena
to implement Bagging in Scikit-learn. As Gilles had already begun that,
I
Post by Maheshakya Wijewardena
was not able to get my code merged. Moreover I have not implemented
feature
Post by Maheshakya Wijewardena
bootstrapping as it was beyond the scope of my original proposal to the
project.
https://github.com/maheshakya/scikit-learn/blob/bagging2/sklearn/ensemble/bagging.py
Post by Maheshakya Wijewardena
I would appreciate if you can review and give some feedback on my
implementation and what can I do further.
I don't really see the point in spending time reviewing past
alternative implementations of existing features. There are already
https://github.com/scikit-learn/scikit-learn/pulls
In my opinion it would be much more productive to fix bugs in the
current code base.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Felipe Eltermann
2014-01-23 12:25:40 UTC
Permalink
I'm testing different classifiers for a BoW problem and last week I got
disappointed that I couldn't use scikit's DecisionTree.
However, using NaiveBayes was awesome! Thanks for this great piece of
software.
So, if you are planning to add the support for scipy sparse matrix on
DecisionTree, I'd like to help.

Gilles, I read /sklearn/tree/tree.py and found that there are 4 methods
that receive X as a dense matrix:
BaseDecisionTree.fit()
BaseDecisionTree.predict()
DecisionTreeClassifier.predict_proba()
DecisionTreeClassifier.predict_log_proba()

fit() calls some Cython classes, that I think you referred to:
_tree.BestSplitter
_tree.PresortBestSplitter
_tree.RandomSplitter
_tree.Gini
_tree.Entropy
_tree.MSE
_tree.FriedmanMSE


Where to start?
Post by Mathieu Blondel
How much code in our current implementation depends on the data
representation?
Not much actually. It now basically boils down to simply write a new
splitter object. Everything else remains the same. So basically, I would
say that it amounts to 300~ lines of Cython (out of the 2300 lines in our
implementation).
Post by Maheshakya Wijewardena
I will try using sparse data on 20newsgroups data and let you know the
results.
What I was suggesting is to densify the News20 dataset (using a subset of
the features so that it fits in memory) and try it on our current
implementation. Of course it will be really slow but the goal is to
evaluate how decision trees would fare in terms of *accuracy*. A correct
sparse implementation should give exactly the same results (unless it
handles the zeros differently).
Don't get me wrong, adding sparse data support to decision trees is not
necessarily a bad idea: I'm just trying to evaluate the cost (in terms of
maintenance) vs. benefits. How much code in our current implementation
depends on the data representation?
Mathieu
Post by Maheshakya Wijewardena
Post by Maheshakya Wijewardena
Hi
As I think, using sparse data we can enhance the descriptiveness of
the data
Post by Maheshakya Wijewardena
while keeping its' smaller compared to the dense data without loosing
information.
I don't understand what you mean by "sparse data we can enhance the
descriptiveness of the data".
Post by Maheshakya Wijewardena
I will try using sparse data on 20newsgroups data and let you know the
results.
What do you mean? 20newsgroups data is inherently sparse in the sense
as extracted BoW features are mostly zero valued. The problem is that
the current implementation of Decision Trees requires a dense
*representation* of that sparse data to work. To make Decision Trees
work on a spase representation (e.g. a CSC sparse matrix) would
require to re-implement a lot of the code.
Post by Maheshakya Wijewardena
Arnaud,
I've gone through those messages and I've already started working on
patches. Last year I've done a project of a module in our university.
It was
Post by Maheshakya Wijewardena
to implement Bagging in Scikit-learn. As Gilles had already begun
that, I
Post by Maheshakya Wijewardena
was not able to get my code merged. Moreover I have not implemented
feature
Post by Maheshakya Wijewardena
bootstrapping as it was beyond the scope of my original proposal to the
project.
https://github.com/maheshakya/scikit-learn/blob/bagging2/sklearn/ensemble/bagging.py
Post by Maheshakya Wijewardena
I would appreciate if you can review and give some feedback on my
implementation and what can I do further.
I don't really see the point in spending time reviewing past
alternative implementations of existing features. There are already
https://github.com/scikit-learn/scikit-learn/pulls
In my opinion it would be much more productive to fix bugs in the
current code base.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2014-01-23 13:55:39 UTC
Permalink
Post by Felipe Eltermann
I'm testing different classifiers for a BoW problem and last week I got
disappointed that I couldn't use scikit's DecisionTree.
However, using NaiveBayes was awesome! Thanks for this great piece of
software.
So, if you are planning to add the support for scipy sparse matrix on
DecisionTree, I'd like to help.
Gilles, I read /sklearn/tree/tree.py and found that there are 4 methods that
BaseDecisionTree.fit()
BaseDecisionTree.predict()
DecisionTreeClassifier.predict_proba()
DecisionTreeClassifier.predict_log_proba()
_tree.BestSplitter
_tree.PresortBestSplitter
_tree.RandomSplitter
_tree.Gini
_tree.Entropy
_tree.MSE
_tree.FriedmanMSE
As Gilles said, have a look at the Splitters first. You probably want
to do feature-wise access to the input data, hence the
scipy.sparse.csc_matrix representation should be supported first. If
you are not familiar with the internal data structure of the CSC
representation, here is a piece of cython code of another estimator
that can deal efficiently with CSC sparse input data:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx#L227

which is called by:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/coordinate_descent.py#L450

Also have a look at:

http://docs.scipy.org/doc/scipy/reference/sparse.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Felipe Eltermann
2014-01-31 10:46:43 UTC
Permalink
OK, I finished reading _tree.pyx and now I understand CSC dense matrix
format.
I have a general view of what is necessary to be implemented.

I've never seriously used Cython. What are you guys using as development
environment? How to easily code/compile/test?


On Thu, Jan 23, 2014 at 11:55 AM, Olivier Grisel
Post by Olivier Grisel
Post by Felipe Eltermann
I'm testing different classifiers for a BoW problem and last week I got
disappointed that I couldn't use scikit's DecisionTree.
However, using NaiveBayes was awesome! Thanks for this great piece of
software.
So, if you are planning to add the support for scipy sparse matrix on
DecisionTree, I'd like to help.
Gilles, I read /sklearn/tree/tree.py and found that there are 4 methods
that
Post by Felipe Eltermann
BaseDecisionTree.fit()
BaseDecisionTree.predict()
DecisionTreeClassifier.predict_proba()
DecisionTreeClassifier.predict_log_proba()
_tree.BestSplitter
_tree.PresortBestSplitter
_tree.RandomSplitter
_tree.Gini
_tree.Entropy
_tree.MSE
_tree.FriedmanMSE
As Gilles said, have a look at the Splitters first. You probably want
to do feature-wise access to the input data, hence the
scipy.sparse.csc_matrix representation should be supported first. If
you are not familiar with the internal data structure of the CSC
representation, here is a piece of cython code of another estimator
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx#L227
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/coordinate_descent.py#L450
http://docs.scipy.org/doc/scipy/reference/sparse.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2014-01-31 11:25:33 UTC
Permalink
Post by Felipe Eltermann
OK, I finished reading _tree.pyx and now I understand CSC dense matrix
format.
I have a general view of what is necessary to be implemented.
I've never seriously used Cython. What are you guys using as development
environment?
Just a good text editor and a shell terminal. What else?
Post by Felipe Eltermann
How to easily code/compile/test?
cython sklearn/tree/_tree.pyx
python setup.py build_ext -i
nosetests -s -v -x sklearn/tree/

You can check that you are not introducing too many new yellow lines
(C boilerplate) by using the HTML source annotation mode:

cython -a sklearn/tree/_tree.pyx
firefox sklearn/tree/_tree.html

Read the Cython documentation for more details: http://cython.org/
--
Olivier
Felipe Eltermann
2014-02-04 16:38:37 UTC
Permalink
Post by Gilles Louppe
How much code in our current implementation depends on the data
representation?
Post by Gilles Louppe
Not much actually. It now basically boils down to simply write a new
splitter object. Everything else remains the same. So basically, I would
say that it amounts to 300~ lines of Cython (out of the 2300 lines in our
implementation).

Gilles,

*Splitter* base class assumes X to be a dense, 2-dimensional, numpy array.
Available splitters include: *BestSplitter*, *PresortBestSplitter* and
*RandomSplitter* (all inheriting Splitter).

That said, I think there are these possible architectures:

1-
- make "Splitter" base class support both dense and sparse representation
- implicitly assume that "BestSplitter", "PresortBestSplitter" and
"RandomSplitter" are dealing with dense structure
- create "SparseSplitter" that inherits "DenseSplitter" and implicitly
assumes the sparse structure

2-
- replace "Splitter" by "DenseSplitter"
- make "BestSplitter", "PresortBestSplitter" and "RandomSplitter" inherit
from "DenseSplitter"
- create "SparseSplitter"
- force "SparseSplitter" usage when the input is a sparse matrix

Which one should we use?
Also, which splitting approach should be used on SparseSplitter? Why not
"SparseBestSplitter", "SparsePresortBestSplitter" and
"SparseRandomSplitter"?
Arnaud Joly
2014-02-05 12:34:35 UTC
Permalink
I think that I would go for the option that minimize the amount of code duplication.
I would probably start with 2. Since we don’t pickle anymore the Splitter and criterion, the constructor
arguments could be used to pass the X and the y matrix.

Cheers,
Arnaud
Post by Mathieu Blondel
How much code in our current implementation depends on the data representation?
Not much actually. It now basically boils down to simply write a new splitter object. Everything else remains the same. So basically, I would say that it amounts to 300~ lines of Cython (out of the 2300 lines in our implementation).
Gilles,
Splitter base class assumes X to be a dense, 2-dimensional, numpy array.
Available splitters include: BestSplitter, PresortBestSplitter and RandomSplitter (all inheriting Splitter).
1-
- make "Splitter" base class support both dense and sparse representation
- implicitly assume that "BestSplitter", "PresortBestSplitter" and "RandomSplitter" are dealing with dense structure
- create "SparseSplitter" that inherits "DenseSplitter" and implicitly assumes the sparse structure
2-
- replace "Splitter" by "DenseSplitter"
- make "BestSplitter", "PresortBestSplitter" and "RandomSplitter" inherit from "DenseSplitter"
- create "SparseSplitter"
- force "SparseSplitter" usage when the input is a sparse matrix
Which one should we use?
Also, which splitting approach should be used on SparseSplitter? Why not "SparseBestSplitter", "SparsePresortBestSplitter" and "SparseRandomSplitter"?
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Felipe Eltermann
2014-02-07 18:55:59 UTC
Permalink
Arnaud,

I added a issparse attribute to Splitter base class.
Doing so, I think I managed to introduce sparse support without the need of
replicating Splitters' business logic code.
I'm working on this branch [1]

I have two questions:
1- I removed a "with nogil" statement [2]. Is there a way to keep it?
2- Toy sparse input test is failing, and I think it's due to incorrect
values being read from X.data [3] (the method considers the matrix is in
CSC format). Can you see something I'm doing wrong here?


[1] https://github.com/eltermann/scikit-learn/commits/DT-sparse
[2]
https://github.com/eltermann/scikit-learn/commit/8388bedff4e225cda9a1b2b6e3fc250bb7d22276#diff-a2cead4f3702cc4b9f76562bb2777edbL2297
[3]
https://github.com/eltermann/scikit-learn/commit/5ba9c367661446c3eba7e6ea54adc1ff5cdfd39f#diff-a2cead4f3702cc4b9f76562bb2777edbR1281
Post by Arnaud Joly
I think that I would go for the option that minimize the amount of code duplication.
I would probably start with 2. Since we don't pickle anymore the Splitter
and criterion, the constructor
arguments could be used to pass the X and the y matrix.
Cheers,
Arnaud
Post by Gilles Louppe
How much code in our current implementation depends on the data
representation?
Post by Gilles Louppe
Not much actually. It now basically boils down to simply write a new
splitter object. Everything else remains the same. So basically, I would
say that it amounts to 300~ lines of Cython (out of the 2300 lines in our
implementation).
Gilles,
*Splitter* base class assumes X to be a dense, 2-dimensional, numpy array.
Available splitters include: *BestSplitter*, *PresortBestSplitter* and
*RandomSplitter* (all inheriting Splitter).
1-
- make "Splitter" base class support both dense and sparse representation
- implicitly assume that "BestSplitter", "PresortBestSplitter" and
"RandomSplitter" are dealing with dense structure
- create "SparseSplitter" that inherits "DenseSplitter" and implicitly
assumes the sparse structure
2-
- replace "Splitter" by "DenseSplitter"
- make "BestSplitter", "PresortBestSplitter" and "RandomSplitter" inherit
from "DenseSplitter"
- create "SparseSplitter"
- force "SparseSplitter" usage when the input is a sparse matrix
Which one should we use?
Also, which splitting approach should be used on SparseSplitter? Why not
"SparseBestSplitter", "SparsePresortBestSplitter" and
"SparseRandomSplitter"?
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Arnaud Joly
2014-02-08 10:25:26 UTC
Permalink
Post by Felipe Eltermann
1- I removed a "with nogil" statement [2]. Is there a way to keep it?
In order to keep it, you can pass directly the bumpy array indices, indptr, data and nnz to the splitter. Detecting the input
type could be done easily in the python code.
Post by Felipe Eltermann
2- Toy sparse input test is failing, and I think it's due to incorrect values being read from X.data [3] (the method considers the matrix is in CSC format). Can you see something I'm doing wrong here?
Tests seems to fail, because you also test with the other splitters (random splitter, pre-sort best splitter)
which don't implement the sparse matrix support.


I have looked a bit at your code and it’s a great start. It would be easier to help you if you open a pull request.

Best,
Arnaud
Post by Felipe Eltermann
Arnaud,
I added a issparse attribute to Splitter base class.
Doing so, I think I managed to introduce sparse support without the need of replicating Splitters' business logic code.
I'm working on this branch [1]
1- I removed a "with nogil" statement [2]. Is there a way to keep it?
2- Toy sparse input test is failing, and I think it's due to incorrect values being read from X.data [3] (the method considers the matrix is in CSC format). Can you see something I'm doing wrong here?
[1] https://github.com/eltermann/scikit-learn/commits/DT-sparse
[2] https://github.com/eltermann/scikit-learn/commit/8388bedff4e225cda9a1b2b6e3fc250bb7d22276#diff-a2cead4f3702cc4b9f76562bb2777edbL2297
[3] https://github.com/eltermann/scikit-learn/commit/5ba9c367661446c3eba7e6ea54adc1ff5cdfd39f#diff-a2cead4f3702cc4b9f76562bb2777edbR1281
I think that I would go for the option that minimize the amount of code duplication.
I would probably start with 2. Since we don’t pickle anymore the Splitter and criterion, the constructor
arguments could be used to pass the X and the y matrix.
Cheers,
Arnaud
Post by Mathieu Blondel
How much code in our current implementation depends on the data representation?
Not much actually. It now basically boils down to simply write a new splitter object. Everything else remains the same. So basically, I would say that it amounts to 300~ lines of Cython (out of the 2300 lines in our implementation).
Gilles,
Splitter base class assumes X to be a dense, 2-dimensional, numpy array.
Available splitters include: BestSplitter, PresortBestSplitter and RandomSplitter (all inheriting Splitter).
1-
- make "Splitter" base class support both dense and sparse representation
- implicitly assume that "BestSplitter", "PresortBestSplitter" and "RandomSplitter" are dealing with dense structure
- create "SparseSplitter" that inherits "DenseSplitter" and implicitly assumes the sparse structure
2-
- replace "Splitter" by "DenseSplitter"
- make "BestSplitter", "PresortBestSplitter" and "RandomSplitter" inherit from "DenseSplitter"
- create "SparseSplitter"
- force "SparseSplitter" usage when the input is a sparse matrix
Which one should we use?
Also, which splitting approach should be used on SparseSplitter? Why not "SparseBestSplitter", "SparsePresortBestSplitter" and "SparseRandomSplitter"?
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2014-02-10 17:14:10 UTC
Permalink
I have looked a bit at your code and it’s a great start. It would be easier
to help you if you open a pull request.
+1. Don't hesitate to open an early PR with the "[WIP]" marker as a
title prefix to emphasize that you don't consider it finished work yet
but are open to early feedback on your work.
--
Olivier
Felipe Eltermann
2014-02-11 15:07:23 UTC
Permalink
https://github.com/scikit-learn/scikit-learn/pull/2848
The current state of implementation was explained in the PR comment.
Post by Arnaud Joly
I have looked a bit at your code and it's a great start. It would be
easier
to help you if you open a pull request.
+1. Don't hesitate to open an early PR with the "[WIP]" marker as a
title prefix to emphasize that you don't consider it finished work yet
but are open to early feedback on your work.
--
Olivier
------------------------------------------------------------------------------
Android&trade; apps run on BlackBerry&reg;10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience. Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Arnaud Joly
2014-01-24 09:47:17 UTC
Permalink
Post by Mathieu Blondel
Arnaud,
I've gone through those messages and I've already started working on patches. Last year I've done a project of a module in our university. It was to implement Bagging in Scikit-learn. As Gilles had already begun that, I was not able to get my code merged. Moreover I have not implemented feature bootstrapping as it was beyond the scope of my original proposal to the project.
https://github.com/maheshakya/scikit-learn/blob/bagging2/sklearn/ensemble/bagging.py
I would appreciate if you can review and give some feedback on my implementation and what can I do further.
Sorry, but I am currently short of time. If you sync your master branch with the upstream repository,
you can use github to show the differences between the current implementation and
your implementation.
http://stackoverflow.com/questions/3792989/how-to-view-diff-of-a-forked-github-project
http://stackoverflow.com/questions/14500240/how-can-i-generate-a-diff-for-a-single-file-between-two-branches-in-github


(here by you, I refer to Maheshakya Wijewardena, but also to any prospective google summer
applicants)

Everybody will appreciate if you take care of some issues in the tracker.
You’ll be able to know more about scikit-learn: its goal, its vision, its development practices, its contributors, ...
Start with small pull requests such as documentation fixes, small bugs or easy tagged issues.
Then progressively move on bigger projects.

This is an opportunity for you to know more about what you sign with a google summer of code.
It will allow you to think about the subject of your
proposal. Are you sure that you want to work full time on this subject for several months?
If your proposal involve several specific modules, it’s a good idea to become
more familiar with this part of the code (algorithms, software designs, problems, bugs, drawbacks,
possible improvements, related scientific literature, …) and start contributing.

Last but not least. Committing early in the project is important, because we are able to better
know you in the process.

Best,
Arnaud
Post by Mathieu Blondel
Thank you.
Hi all,
I am using random forest to do deep learning/feature learning using the RandomForestEmbedding in scikit-learn. It would be cool to apply
the random forest on the learned features and induced a higher level representation.
I have actually tried the naive approach of densified the output from RandomForestEmbedding and feed it back to another one to get the second level of representation of the same data, and then apply SVM on it. Not only it is extremely slow, the result become worst.
However, I think sparse matrix support for decision tree is a worthwhile effort as it enables me to investigate why the result is worst easily.
Just my 2 cents.
Caleb
Hi,
I have been using Scikit-learn One hot encoder for data encoding and the resulting array supports only for a few models such as logistic regression, SVC, etc. When I convert those sparse matrices with list comprehension or toarray() function to dense matrices, resulting arrays become too large for those classifiers such as Decision trees or any other tree based classifier.
I saw a GSOC project idea of implementing this as mentioned here.
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014
I'm looking forward to apply for GSOC this year as well, so I would like start working on this. From where can I get support for this. (There're no possible mentors assigned for this)
Regards,
Maheshakya
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Maheshakya Wijewardena
2014-01-24 11:14:52 UTC
Permalink
Hi Arnaud,
I have already synced my master branch with the upstream repository and
checked what the discrepancies were. There were some differences between
implementations.(I have reused much of the code of forests so it was not
that different) I couldn't implement cython sections of this because I was
not experienced in that. I'm working on that too.

I started working on this enhancement (as it is marked as easy) and I've a
passable experience in dealing with code of forest.

https://github.com/scikit-learn/scikit-learn/issues/2129

I'll start working on other easy issues and start committing as soon as
possible.

Regards,
Maheshakya
Post by Mathieu Blondel
Arnaud,
I've gone through those messages and I've already started working on
patches. Last year I've done a project of a module in our university. It
was to implement Bagging in Scikit-learn. As Gilles had already begun that,
I was not able to get my code merged. Moreover I have not implemented
feature bootstrapping as it was beyond the scope of my original proposal to
the project.
https://github.com/maheshakya/scikit-learn/blob/bagging2/sklearn/ensemble/bagging.py
I would appreciate if you can review and give some feedback on my
implementation and what can I do further.
Sorry, but I am currently short of time. If you sync your master branch
with the upstream repository,
you can use github to show the differences between the current
implementation and
your implementation.
http://stackoverflow.com/questions/3792989/how-to-view-diff-of-a-forked-github-project
http://stackoverflow.com/questions/14500240/how-can-i-generate-a-diff-for-a-single-file-between-two-branches-in-github
(here by you, I refer to Maheshakya Wijewardena, but also to any prospective google summer
applicants)
Everybody will appreciate if you take care of some issues in the tracker.
You’ll be able to know more about scikit-learn: its goal, its vision, its
development practices, its contributors, ...
Start with small pull requests such as documentation fixes, small bugs or
easy tagged issues.
Then progressively move on bigger projects.
This is an opportunity for you to know more about what you sign with a
google summer of code.
It will allow you to think about the subject of your
proposal. Are you sure that you want to work full time on this subject for several months?
If your proposal involve several specific modules, it’s a good idea to become
more familiar with this part of the code (algorithms, software designs,
problems, bugs, drawbacks,
possible improvements, related scientific literature, …) and start contributing.
Last but not least. Committing early in the project is important, because
we are able to better
know you in the process.
Best,
Arnaud
Thank you.
Post by Caleb
Hi all,
I am using random forest to do deep learning/feature learning using the
RandomForestEmbedding in scikit-learn. It would be cool to apply
the random forest on the learned features and induced a higher level representation.
I have actually tried the naive approach of densified the output from
RandomForestEmbedding and feed it back to another one to get the second
level of representation of the same data, and then apply SVM on it. Not
only it is extremely slow, the result become worst.
However, I think sparse matrix support for decision tree is a worthwhile
effort as it enables me to investigate why the result is worst easily.
Just my 2 cents.
Caleb
On Wednesday, January 22, 2014 1:15 PM, Maheshakya Wijewardena <
Hi,
I have been using Scikit-learn One hot encoder for data encoding and the
resulting array supports only for a few models such as logistic regression,
SVC, etc. When I convert those sparse matrices with list comprehension or
toarray() function to dense matrices, resulting arrays become too large for
those classifiers such as Decision trees or any other tree based
classifier.
I saw a GSOC project idea of implementing this as mentioned here.
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2014
I'm looking forward to apply for GSOC this year as well, so I would like
Continue reading on narkive:
Loading...