My main motivation is mostly usability. In terms of development though, I've only really worked on decision trees, so my comments are heavily influenced by that experience.
Here are the three main reasons why I use scikit-learn:
Simplicity (taking the cue from Olivier). If you've seen how difficult it is to prepare your dataset into Orange format, you will appreciate any package that operates directly on numpy arrays.
Speed. The decision tree implementation of Orange takes about 25 seconds to train on the Madelon dataset, whereas the optimised version of scikit-learn takes well under a second. I can't really comment on other algorithms though.
Readability. Algorithms implemented in scikit-learn are meant to be easily understood, to the point where anyone with enough knowledge of the algorithm should be able to go in and make changes if they wish. I like to think of it as executable pseudocode.
These are the main reasons why I use it, but the other ones mentioned (distributed code, licensing) are important too.
From: Denis Kochedykov <***@mail.ru>
Date: Sun, 04 Dec 2011 14:49:30
Subject: Re: [Scikit-learn-general] motivation for the lib,
why re-implement existing stuff
Thanks for comments!
So, summarizing, sklearn versus Orange is:
- use plain arrays instead of classes for storing data-sets, features, etc
- use BSD rather than GPL license
- no framework, plain library of methods
If I got it right, seems like creating sklearn was not a question of
Orange quality/usability, but more a question of another development
That is, for users who're not going to sell their software (which is not
permitted by GPL), there is not much difference?
Of course, convenience for developers and simplicity means more viable
library in a long term.
Post by Olivier Grisel
- scikit-learn is a scikit (scientific python toolkit): it is meant to
be used by he scipy community and to play by its tacit rules: the
primary data structure is plain old numpy array (or
scipy.sparse.matrix): no machine learning specific class for samples,
- scikit-learn has only dependencies on non viral open source licenses
(python, numpy, scipy and joblib all are BSD-like): hence scikit-learn
is BSD-like as well to play fair in this permissive ecosystem (being a
able to copy and paste any function or modules of scikit-learn source
code anywhere else is perfectly OK)
- scikit-learn focuses on implementing machine learning with as few
framework code as possible and let other framework oriented projects
reuse some of scikit-learn modules if they want to do so: i.e. to
build datamining GUI for instance.
Other scikit-learn contributors might have their own reasons to
contribute to scikit-learn rather than Orange.
Also on a more trivial perspective, I like working on github using
pull-request based reviews as the main inter-developer communication
medium for code contributions. svn is such a pain once you tasted a
decentralized tool like git or hg.
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
Scikit-learn-general mailing list