Marcos Wolff
2012-09-04 13:15:49 UTC
Hi,
I was wondering if there are plans on implementing CHAID techniques for
tree growing
http://en.wikipedia.org/wiki/CHAID
Gordon Kass 1980's paper:
http://ebookbrowse.com/gdoc.php?id=60655988&url=705c072c97190f9f1c59ac51aa72a258
.
SPSS uses it and:
-it's very effective for multi-class classification, it out performs CART
in every situation (this may be an SPSS implementation issue, of course)
-it is not sensitive to unbalanced dataset (no need for prior probabilities
if you have very little positives and very much negative instances in your
data)
-it does multiple partitions on the data (CART does only binary partition)
-and performs very well on large datasets because of the simplicity of the
algorithm (I used it for classification in dataset of 350.000 rows and 200
columns of numbers, ordinal and categorical data)
I searched in github scikit issues for requested features and I didn't see
mentions to it (
https://github.com/scikit-learn/scikit-learn/issues/search?q=chaid )
I'm not really an experienced python developer, I just use python for data
cleaning, scrapping and for running data mining algorithms.
I don't know if I am experienced enough for developing this feature. But,
I'll be happy to try or help the community to do it if you are interested.
What would you recommend me to read or do, apart from reading this guide
http://scikit-learn.org/stable/developers/index.html#contributing-code,
if I wanted to contribute developing this feature?
Thanks!! (and sorry for my english)
Marcos.
I was wondering if there are plans on implementing CHAID techniques for
tree growing
http://en.wikipedia.org/wiki/CHAID
Gordon Kass 1980's paper:
http://ebookbrowse.com/gdoc.php?id=60655988&url=705c072c97190f9f1c59ac51aa72a258
.
SPSS uses it and:
-it's very effective for multi-class classification, it out performs CART
in every situation (this may be an SPSS implementation issue, of course)
-it is not sensitive to unbalanced dataset (no need for prior probabilities
if you have very little positives and very much negative instances in your
data)
-it does multiple partitions on the data (CART does only binary partition)
-and performs very well on large datasets because of the simplicity of the
algorithm (I used it for classification in dataset of 350.000 rows and 200
columns of numbers, ordinal and categorical data)
I searched in github scikit issues for requested features and I didn't see
mentions to it (
https://github.com/scikit-learn/scikit-learn/issues/search?q=chaid )
I'm not really an experienced python developer, I just use python for data
cleaning, scrapping and for running data mining algorithms.
I don't know if I am experienced enough for developing this feature. But,
I'll be happy to try or help the community to do it if you are interested.
What would you recommend me to read or do, apart from reading this guide
http://scikit-learn.org/stable/developers/index.html#contributing-code,
if I wanted to contribute developing this feature?
Thanks!! (and sorry for my english)
Marcos.