Discussion:
Majority rule Ensemble classifier
(too old to reply)
Sebastian Raschka
2015-01-11 00:13:30 UTC
Permalink
Hi,

I wrote a short blog post about implementing a conservative majority rule ensemble classifier in scikit-learn someone asked me whether this would be interesting for the scikit-learn library.

The idea behind it is quite simple: Using the weighted or unweighted majority rule from different classification models (naive Bayes, Logistic Regression, Random Forests etc.) to predict the class label.

clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = GaussianNB()

eclf = EnsembleClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1])

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

(more details in the blog post: http://sebastianraschka.com/Articles/2014_ensemble_classifier.html)

If you would consider this as useful, let me know, and I would be happy to contribute it to the scikit-learn library.

Best,
Sebastian
Andy
2015-01-14 02:21:35 UTC
Permalink
Hi Sebastian.
I think this might be useful as these times of algorithms are often used
in competitions.
It would also be nice to provide a transform method, so that one could
also learn another model on top
(like here
http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html).

Cheers,
Andy
Post by Sebastian Raschka
Hi,
I wrote a short blog post about implementing a conservative majority rule ensemble classifier in scikit-learn someone asked me whether this would be interesting for the scikit-learn library.
The idea behind it is quite simple: Using the weighted or unweighted majority rule from different classification models (naive Bayes, Logistic Regression, Random Forests etc.) to predict the class label.
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = GaussianNB()
eclf = EnsembleClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1])
scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
(more details in the blog post: http://sebastianraschka.com/Articles/2014_ensemble_classifier.html)
If you would consider this as useful, let me know, and I would be happy to contribute it to the scikit-learn library.
Best,
Sebastian
------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2015-01-14 07:06:05 UTC
Permalink
I wonder if these ensembles, while common, are too non-standard. Are there
well-analysed variants of these models in the literature, or standard ways
to configure them? If not, perhaps this is best presented as an example
rather than avaialable in the library...
Post by Andy
Hi Sebastian.
I think this might be useful as these times of algorithms are often used
in competitions.
It would also be nice to provide a transform method, so that one could
also learn another model on top
(like here
http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
).
Cheers,
Andy
Post by Sebastian Raschka
Hi,
I wrote a short blog post about implementing a conservative majority
rule ensemble classifier in scikit-learn someone asked me whether this
would be interesting for the scikit-learn library.
Post by Sebastian Raschka
The idea behind it is quite simple: Using the weighted or unweighted
majority rule from different classification models (naive Bayes, Logistic
Regression, Random Forests etc.) to predict the class label.
Post by Sebastian Raschka
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = GaussianNB()
eclf = EnsembleClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1])
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression',
scores = cross_validation.cross_val_score(clf, X, y, cv=5,
scoring='accuracy')
Post by Sebastian Raschka
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),
scores.std(), label))
http://sebastianraschka.com/Articles/2014_ensemble_classifier.html)
Post by Sebastian Raschka
If you would consider this as useful, let me know, and I would be happy
to contribute it to the scikit-learn library.
Post by Sebastian Raschka
Best,
Sebastian
------------------------------------------------------------------------------
Post by Sebastian Raschka
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is
your
Post by Sebastian Raschka
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take
a
Post by Sebastian Raschka
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andy
2015-01-14 14:07:17 UTC
Permalink
Post by Joel Nothman
I wonder if these ensembles, while common, are too non-standard. Are
there well-analysed variants of these models in the literature, or
standard ways to configure them? If not, perhaps this is best
presented as an example rather than avaialable in the library...
Well, there is "stacking" but that is rarely used in practice, I think.
FeatureUnion is also more of an engineering tool than a theoretical one...
Post by Joel Nothman
Hi Sebastian.
I think this might be useful as these times of algorithms are often used
in competitions.
It would also be nice to provide a transform method, so that one could
also learn another model on top
(like here
http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html).
Cheers,
Andy
Post by Sebastian Raschka
Hi,
I wrote a short blog post about implementing a conservative
majority rule ensemble classifier in scikit-learn someone asked me
whether this would be interesting for the scikit-learn library.
Post by Sebastian Raschka
The idea behind it is quite simple: Using the weighted or
unweighted majority rule from different classification models
(naive Bayes, Logistic Regression, Random Forests etc.) to predict
the class label.
Post by Sebastian Raschka
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = GaussianNB()
eclf = EnsembleClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1])
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic
scores = cross_validation.cross_val_score(clf, X, y, cv=5,
scoring='accuracy')
Post by Sebastian Raschka
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),
scores.std(), label))
http://sebastianraschka.com/Articles/2014_ensemble_classifier.html)
Post by Sebastian Raschka
If you would consider this as useful, let me know, and I would
be happy to contribute it to the scikit-learn library.
Post by Sebastian Raschka
Best,
Sebastian
------------------------------------------------------------------------------
Post by Sebastian Raschka
Dive into the World of Parallel Programming! The Go Parallel
Website,
Post by Sebastian Raschka
sponsored by Intel and developed in partnership with Slashdot
Media, is your
Post by Sebastian Raschka
hub for all things parallel software development, from weekly
thought
Post by Sebastian Raschka
leadership blogs to news, videos, case studies, tutorials and
more. Take a
Post by Sebastian Raschka
look and join the conversation now.
http://goparallel.sourceforge.net
Post by Sebastian Raschka
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Sebastian Raschka
2015-01-14 17:39:05 UTC
Permalink
Hi, Andy and Joel,

thanks for the heads-up and the discussion. I agree that it is more of an engineering tool and that's why I haven't considered asking about it initially. However, it seems that some people were interested in that (probably primarily "kaggler's") so I just wanted to know if something like this could be useful in the scikit-learn library.

I would be happy to add it either as example and/or implementation since I am a big fan of scikit-learn and would be happy to give something back if I can :)

I can dig into some literature on the weekend and see what I can find. But my feeling is that -- like Andy said -- it is more of an engineering tool (in contrast to bagging and AdaBoost).

So, shall I go ahead and open an issue in the GitHub repo to continue the discussion?

Andy, could you give me a quick follow-up on the transformer method. I am wondering what the transformer should return in this case.

Best,
Sebastian
Post by Andy
I wonder if these ensembles, while common, are too non-standard. Are there well-analysed variants of these models in the literature, or standard ways to configure them? If not, perhaps this is best presented as an example rather than avaialable in the library...
Well, there is "stacking" but that is rarely used in practice, I think.
FeatureUnion is also more of an engineering tool than a theoretical one...
Hi Sebastian.
I think this might be useful as these times of algorithms are often used
in competitions.
It would also be nice to provide a transform method, so that one could
also learn another model on top
(like here
http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html <http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html>).
Cheers,
Andy
Post by Sebastian Raschka
Hi,
I wrote a short blog post about implementing a conservative majority rule ensemble classifier in scikit-learn someone asked me whether this would be interesting for the scikit-learn library.
The idea behind it is quite simple: Using the weighted or unweighted majority rule from different classification models (naive Bayes, Logistic Regression, Random Forests etc.) to predict the class label.
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = GaussianNB()
eclf = EnsembleClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1])
scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
(more details in the blog post: http://sebastianraschka.com/Articles/2014_ensemble_classifier.html <http://sebastianraschka.com/Articles/2014_ensemble_classifier.html>)
If you would consider this as useful, let me know, and I would be happy to contribute it to the scikit-learn library.
Best,
Sebastian
------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net <http://goparallel.sourceforge.net/>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet <http://p.sf.net/sfu/gigenet>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet <http://p.sf.net/sfu/gigenet>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet <http://p.sf.net/sfu/gigenet>
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
Andy
2015-01-15 23:41:41 UTC
Permalink
Post by Sebastian Raschka
So, shall I go ahead and open an issue in the GitHub repo to continue the discussion?
Yeah or an early pull request if you already have code.
Post by Sebastian Raschka
Andy, could you give me a quick follow-up on the transformer method. I
am wondering what the transformer should return in this case.
The concatenated probability outputs of everything in the ensemble, I'd
think,
so that the result is (n_sample, n_classes * n_estimators).
Then you can learn a combination on top of that.
Sebastian Raschka
2015-01-26 05:42:44 UTC
Permalink
Has been a pretty busy couple of weeks, but I finally found some time on this rather quiet Sunday evening and send a pull request with code and documentation (https://github.com/scikit-learn/scikit-learn/pull/4161).
The concatenated probability outputs of everything in the ensemble, I'd think,
so that the result is (n_sample, n_classes * n_estimators).
Then you can learn a combination on top of that.
Oh yes, that makes sense; added it to build EnsembleEnsembleClassifiers ;)

Thanks for the tips!

Cheers,
Sebastian
Post by Sebastian Raschka
So, shall I go ahead and open an issue in the GitHub repo to continue the discussion?
Yeah or an early pull request if you already have code.
Post by Sebastian Raschka
Andy, could you give me a quick follow-up on the transformer method. I am wondering what the transformer should return in this case.
The concatenated probability outputs of everything in the ensemble, I'd think,
so that the result is (n_sample, n_classes * n_estimators).
Then you can learn a combination on top of that.
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Continue reading on narkive:
Loading...