[Scikit-learn-general] Plans on implementing CHAID? (CHi-squared Automatic Interaction Detection)

Discussion:

Marcos Wolff

2012-09-04 13:15:49 UTC

Hi,

I was wondering if there are plans on implementing CHAID techniques for
tree growing

http://en.wikipedia.org/wiki/CHAID
Gordon Kass 1980's paper:
http://ebookbrowse.com/gdoc.php?id=60655988&url=705c072c97190f9f1c59ac51aa72a258
.

SPSS uses it and:
-it's very effective for multi-class classification, it out performs CART
in every situation (this may be an SPSS implementation issue, of course)
-it is not sensitive to unbalanced dataset (no need for prior probabilities
if you have very little positives and very much negative instances in your
data)
-it does multiple partitions on the data (CART does only binary partition)
-and performs very well on large datasets because of the simplicity of the
algorithm (I used it for classification in dataset of 350.000 rows and 200
columns of numbers, ordinal and categorical data)

I searched in github scikit issues for requested features and I didn't see
mentions to it (
https://github.com/scikit-learn/scikit-learn/issues/search?q=chaid )

I'm not really an experienced python developer, I just use python for data
cleaning, scrapping and for running data mining algorithms.
I don't know if I am experienced enough for developing this feature. But,
I'll be happy to try or help the community to do it if you are interested.

What would you recommend me to read or do, apart from reading this guide
http://scikit-learn.org/stable/developers/index.html#contributing-code,
if I wanted to contribute developing this feature?

Thanks!! (and sorry for my english)
Marcos.

Olivier Grisel

2012-09-04 13:33:06 UTC

Permalink

Hi,
I was wondering if there are plans on implementing CHAID techniques for tree
growing
http://en.wikipedia.org/wiki/CHAID
http://ebookbrowse.com/gdoc.php?id=60655988&url=705c072c97190f9f1c59ac51aa72a258.
-it's very effective for multi-class classification, it out performs CART in
every situation (this may be an SPSS implementation issue, of course)
-it is not sensitive to unbalanced dataset (no need for prior probabilities
if you have very little positives and very much negative instances in your
data)
-it does multiple partitions on the data (CART does only binary partition)
-and performs very well on large datasets because of the simplicity of the
algorithm (I used it for classification in dataset of 350.000 rows and 200
columns of numbers, ordinal and categorical data)
I searched in github scikit issues for requested features and I didn't see
mentions to it (
https://github.com/scikit-learn/scikit-learn/issues/search?q=chaid )
I'm not really an experienced python developer, I just use python for data
cleaning, scrapping and for running data mining algorithms.
I don't know if I am experienced enough for developing this feature. But,
I'll be happy to try or help the community to do it if you are interested.
What would you recommend me to read or do, apart from reading this guide
http://scikit-learn.org/stable/developers/index.html#contributing-code,
if I wanted to contribute developing this feature?

Sounds interesting. I especially appreciate algorithms that are
scalable to at least medium-sized, real world datasets :)

If you feel like contributing an implementation of this, please read
carefully the following guide:

http://scikit-learn.org/dev/developers/index.html

Also have a look at the existing pull requests (even if completely unrelated):

https://github.com/scikit-learn/scikit-learn/pulls

It's a good way to understand how the contribution / reviewing process
is working in practice.

Beware that each contribution will have to be maintained in the future
so will add a burden to the developers of the project. This burden can
only be alleviated by extensive documentations, tests, usage examples
and API and variable names consistent with the rest of the project.
Hence don't expect a fast code, submit and forget contribution
process.

Also more specific to this particular algorithm: in scikit-learn,
categorical features are traditionally encoded using 1 hot binary
features stored in a scipy.sparse matrix. This datastructure is a bit
peculiar so you might want to have a look at existing implementations
of estimators that are able to deal with it before engaging in the
design process. Typically dict-like representations (typically used in
datamining) can be converted into sparse data using the DictVectorizer
class: http://scikit-learn.org/dev/modules/feature_extraction.html#loading-features-from-dicts

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Marcos Wolff

2012-09-04 14:11:53 UTC

Permalink

Hi Olivier,

Ok. I forked scikit-learn on github so I'm aware of pull requests and other
activities.

If I could contribute to this feature, I would be glad to mantain it with
documentations, tests, usage examples, etc.
I think it's a very interesting algorithm and it's worth devote time to it.

Oh yes I'm aware of DicVectorizer. I stumbled upon it trying to classify
that dataset I told you about, so I had to figure out how to make it work.

Thanks for the info! any suggestion will be helpful since I'm just starting
with contributing to a library, and even with using github.

Congratulations for the mailing list, It's super responsive and everyone is
very well informed and with lots of useful suggestions!

Marcos.

Post by Marcos Wolff

Post by Marcos Wolff
Hi,
I was wondering if there are plans on implementing CHAID techniques for

tree

Post by Marcos Wolff
growing
http://en.wikipedia.org/wiki/CHAID

http://ebookbrowse.com/gdoc.php?id=60655988&url=705c072c97190f9f1c59ac51aa72a258
.

Post by Marcos Wolff
-it's very effective for multi-class classification, it out performs

CART in

Post by Marcos Wolff
every situation (this may be an SPSS implementation issue, of course)
-it is not sensitive to unbalanced dataset (no need for prior

probabilities

Post by Marcos Wolff
if you have very little positives and very much negative instances in

your

Post by Marcos Wolff
data)
-it does multiple partitions on the data (CART does only binary

partition)

Post by Marcos Wolff
-and performs very well on large datasets because of the simplicity of

the

Post by Marcos Wolff
algorithm (I used it for classification in dataset of 350.000 rows and

200

Post by Marcos Wolff
columns of numbers, ordinal and categorical data)
I searched in github scikit issues for requested features and I didn't

see

Post by Marcos Wolff
mentions to it (
https://github.com/scikit-learn/scikit-learn/issues/search?q=chaid )
I'm not really an experienced python developer, I just use python for

data

Post by Marcos Wolff
cleaning, scrapping and for running data mining algorithms.
I don't know if I am experienced enough for developing this feature. But,
I'll be happy to try or help the community to do it if you are

interested.

Post by Marcos Wolff
What would you recommend me to read or do, apart from reading this guide
http://scikit-learn.org/stable/developers/index.html#contributing-code,
if I wanted to contribute developing this feature?

Sounds interesting. I especially appreciate algorithms that are
scalable to at least medium-sized, real world datasets :)
If you feel like contributing an implementation of this, please read
http://scikit-learn.org/dev/developers/index.html
https://github.com/scikit-learn/scikit-learn/pulls
It's a good way to understand how the contribution / reviewing process
is working in practice.
Beware that each contribution will have to be maintained in the future
so will add a burden to the developers of the project. This burden can
only be alleviated by extensive documentations, tests, usage examples
and API and variable names consistent with the rest of the project.
Hence don't expect a fast code, submit and forget contribution
process.
Also more specific to this particular algorithm: in scikit-learn,
categorical features are traditionally encoded using 1 hot binary
features stored in a scipy.sparse matrix. This datastructure is a bit
peculiar so you might want to have a look at existing implementations
of estimators that are able to deal with it before engaging in the
design process. Typically dict-like representations (typically used in
datamining) can be converted into sparse data using the DictVectorizer
http://scikit-learn.org/dev/modules/feature_extraction.html#loading-features-from-dicts
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general