Discussion:
[Scikit-learn-general] Is it possible to specify the order of spliting in decision tree with scikit-learn?
Rex
2015-07-01 03:37:45 UTC
Permalink
Given three columns, ["A", "B", "C"], can we specify the order of
splitting, so that it firstly split on categories of "A", then "B", and
then by others?

Based on on documentation page on DecisionTreeClassifier, there is no such
option. Is there any way to work it out?

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Andreas Mueller
2015-07-01 15:08:31 UTC
Permalink
Not really, at that kind of defeats the purpose of learning the tree.
you could built a series of stumps that first only get feature a, then
feature b and then feature c.
Post by Rex
Given three columns, ["A", "B", "C"], can we specify the order of
splitting, so that it firstly split on categories of "A", then "B",
and then by others?
Based on on documentation page on DecisionTreeClassifier, there is no
such option. Is there any way to work it out?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2015-07-01 16:16:55 UTC
Permalink
Maybe a crazy idea, but what I think could be useful is to have something like a "repeat_features" parameter that can be set to `False` to not reuse features down the tree.

E.g., let's say we have 1000 different drug molecules with certain chemical groups and have some sort of experimental data of whether they work or not. Using decision tree classification/regression without feature repetition could help to interpret which of the functional groups may be important -- here the focus is maybe not so much predictive performance but rather interpretability, something like "supervised" clustering.
Post by Andreas Mueller
Not really, at that kind of defeats the purpose of learning the tree.
you could built a series of stumps that first only get feature a, then feature b and then feature c.
Given three columns, ["A", "B", "C"], can we specify the order of splitting, so that it firstly split on categories of "A", then "B", and then by others?
Based on on documentation page on DecisionTreeClassifier, there is no such option. Is there any way to work it out?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/ <https://www.gigenetcloud.com/>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Jacob Schreiber
2015-07-01 16:26:19 UTC
Permalink
I don't think that having that feature is a good idea. The great power of
decision trees (and ensembles of trees) is the ability to learn complicated
non-linearities which involve splitting on a variable multiple times if
necessary. If you're looking for an interpretable feature selection method,
there are better alternatives.
Post by Sebastian Raschka
Maybe a crazy idea, but what I think could be useful is to have something
like a "repeat_features" parameter that can be set to `False` to not reuse
features down the tree.
E.g., let's say we have 1000 different drug molecules with certain
chemical groups and have some sort of experimental data of whether they
work or not. Using decision tree classification/regression without feature
repetition could help to interpret which of the functional groups may be
important -- here the focus is maybe not so much predictive performance but
rather interpretability, something like "supervised" clustering.
Not really, at that kind of defeats the purpose of learning the tree.
you could built a series of stumps that first only get feature a, then
feature b and then feature c.
Given three columns, ["A", "B", "C"], can we specify the order of
splitting, so that it firstly split on categories of "A", then "B", and
then by others?
Based on on documentation page on DecisionTreeClassifier, there is no such
option. Is there any way to work it out?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.https://www.gigenetcloud.com/
_______________________________________________
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Satrajit Ghosh
2015-07-01 16:29:25 UTC
Permalink
the ensemble trees (randomforests and extratrees) gives you some intuition
about features through the feature_importance_ attribute.

cheers,

satra
Post by Sebastian Raschka
Maybe a crazy idea, but what I think could be useful is to have something
like a "repeat_features" parameter that can be set to `False` to not reuse
features down the tree.
E.g., let's say we have 1000 different drug molecules with certain
chemical groups and have some sort of experimental data of whether they
work or not. Using decision tree classification/regression without feature
repetition could help to interpret which of the functional groups may be
important -- here the focus is maybe not so much predictive performance but
rather interpretability, something like "supervised" clustering.
Not really, at that kind of defeats the purpose of learning the tree.
you could built a series of stumps that first only get feature a, then
feature b and then feature c.
Given three columns, ["A", "B", "C"], can we specify the order of
splitting, so that it firstly split on categories of "A", then "B", and
then by others?
Based on on documentation page on DecisionTreeClassifier, there is no such
option. Is there any way to work it out?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.https://www.gigenetcloud.com/
_______________________________________________
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Dale Smith
2015-07-01 16:34:29 UTC
Permalink
It is a crazy idea. It defeats the purpose of random forest, which is introducing randomness in specific ways in order to achieve certain goals. Your idea, while appropriate in your use case, does not fit with the algorithm you want to use. Why not investigate alternatives that better fit your use case?


Dale Smith, Ph.D.
Data Scientist
​
[Loading Image...]<http://nexidia.com/>

d. 404.495.7220 x 4008 f. 404.795.7221
Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 30305

[Loading Image...]<http://blog.nexidia.com/> [Loading Image...] <https://www.linkedin.com/company/nexidia> [Loading Image...] <https://plus.google.com/u/0/107921893643164441840/posts> [Loading Image...] <https://twitter.com/Nexidia> [Loading Image...] <https://www.youtube.com/user/NexidiaTV>

From: Sebastian Raschka [mailto:***@gmail.com]
Sent: Wednesday, July 01, 2015 12:17 PM
To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Is it possible to specify the order of spliting in decision tree with scikit-learn?

Maybe a crazy idea, but what I think could be useful is to have something like a "repeat_features" parameter that can be set to `False` to not reuse features down the tree.

E.g., let's say we have 1000 different drug molecules with certain chemical groups and have some sort of experimental data of whether they work or not. Using decision tree classification/regression without feature repetition could help to interpret which of the functional groups may be important -- here the focus is maybe not so much predictive performance but rather interpretability, something like "supervised" clustering.


On Jul 1, 2015, at 11:08 AM, Andreas Mueller <***@gmail.com<mailto:***@gmail.com>> wrote:

Not really, at that kind of defeats the purpose of learning the tree.
you could built a series of stumps that first only get feature a, then feature b and then feature c.
On 06/30/2015 11:37 PM, Rex wrote:
Given three columns, ["A", "B", "C"], can we specify the order of splitting, so that it firstly split on categories of "A", then "B", and then by others?

Based on on documentation page on DecisionTreeClassifier, there is no such option. Is there any way to work it out?

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html






------------------------------------------------------------------------------

Don't Limit Your Business. Reach for the Cloud.

GigeNET's Cloud Solutions provide you with the tools and support that

you need to offload your IT needs and focus on growing your business.

Configured For All Businesses. Start Your Cloud Today.

https://www.gigenetcloud.com/




_______________________________________________

Scikit-learn-general mailing list

Scikit-learn-***@lists.sourceforge.net<mailto:Scikit-learn-***@lists.sourceforge.net>

https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net<mailto:Scikit-learn-***@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2015-07-01 17:45:16 UTC
Permalink
Yes, and thanks for the answers, it was just a random idea.

But in all seriousness, which algorithm would you use for such a task -- here, the goal is not predictive performance but rather "inference":

I am collaborating with experimentalists who obtained measurements on a continuous scale 0.0 - 1.0, and each sample has ~30 binary features. They basically want to "learn" from this data, for example, which combination of features was "important" to yield a response >= 0.5 (although this threshold is not fixed)
For example, using a decision tree, you could come up with something like

If feature A=1 --> response > 0.5
If feature B=0 --> response > 0.6
If feature C=1 ---> response > 0.7
etc.

Basically, an association rule mining but with continuous outputs.
Post by Dale Smith
It is a crazy idea. It defeats the purpose of random forest, which is introducing randomness in specific ways in order to achieve certain goals. Your idea, while appropriate in your use case, does not fit with the algorithm you want to use. Why not investigate alternatives that better fit your use case?
Dale Smith, Ph.D.
Data Scientist
​
<image001.png> <http://nexidia.com/>
d. 404.495.7220 x 4008 f. 404.795.7221
Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 30305
<image002.jpg> <http://blog.nexidia.com/> <image003.jpg> <https://www.linkedin.com/company/nexidia> <image004.jpg> <https://plus.google.com/u/0/107921893643164441840/posts> <image005.jpg> <https://twitter.com/Nexidia> <image006.jpg> <https://www.youtube.com/user/NexidiaTV>
Sent: Wednesday, July 01, 2015 12:17 PM
Subject: Re: [Scikit-learn-general] Is it possible to specify the order of spliting in decision tree with scikit-learn?
Maybe a crazy idea, but what I think could be useful is to have something like a "repeat_features" parameter that can be set to `False` to not reuse features down the tree.
E.g., let's say we have 1000 different drug molecules with certain chemical groups and have some sort of experimental data of whether they work or not. Using decision tree classification/regression without feature repetition could help to interpret which of the functional groups may be important -- here the focus is maybe not so much predictive performance but rather interpretability, something like "supervised" clustering.
Not really, at that kind of defeats the purpose of learning the tree.
you could built a series of stumps that first only get feature a, then feature b and then feature c.
Given three columns, ["A", "B", "C"], can we specify the order of splitting, so that it firstly split on categories of "A", then "B", and then by others?
Based on on documentation page on DecisionTreeClassifier, there is no such option. Is there any way to work it out?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html <http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/ <https://www.gigenetcloud.com/>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/_______________________________________________ <https://www.gigenetcloud.com/_______________________________________________>
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Jacob Schreiber
2015-07-01 19:00:04 UTC
Permalink
If you are working with entirely binary data, then features will not be
repeated in the tree naturally. I think you are discussing the more general
field of 'feature selection', though. There are a plethora of algorithms
which do that--try to identify