Discussion:
motivation for the lib, why re-implement existing stuff
(too old to reply)
Denis Kochedykov
2011-12-03 03:54:14 UTC
Permalink
Hi all,

I'm looking for an ML library for Python for our research team. I found
a quite comprehensive one - Orange - and a relatively new one -
scikits.learn.
Orange definitely look good given the number of methods implemented in
it, maturity and its GUI as a bonus.
But I'm a bit confused - if you guys started a new library, maybe there
is something wrong with Orange? Why do you need to re-implement what has
been already done, instead of using that lib as a foundation and
concentrate on adding a new cool stuff or improving existing?

I'm really interested. Thank you very much for any comments.
Denis.
Olivier Grisel
2011-12-03 11:14:35 UTC
Permalink
Post by Denis Kochedykov
Hi all,
I'm looking for an ML library for Python for our research team. I found
a quite comprehensive one - Orange - and a relatively new one -
scikits.learn.
Orange definitely look good given the number of methods implemented in
it, maturity and its GUI as a bonus.
But I'm a bit confused - if you guys started a new library, maybe there
is something wrong with Orange? Why do you need to re-implement what has
been already done, instead of using that lib as a foundation and
concentrate on adding a new cool stuff or improving existing?
Hi Denis,

I my opinion here are the main reasons why scikit-learn cannot reuse orange:

- scikit-learn is a scikit (scientific python toolkit): it is meant to
be used by he scipy community and to play by its tacit rules: the
primary data structure is plain old numpy array (or
scipy.sparse.matrix): no machine learning specific class for samples,
features, datasets...

- scikit-learn has only dependencies on non viral open source licenses
(python, numpy, scipy and joblib all are BSD-like): hence scikit-learn
is BSD-like as well to play fair in this permissive ecosystem (being a
able to copy and paste any function or modules of scikit-learn source
code anywhere else is perfectly OK)

- scikit-learn focuses on implementing machine learning with as few
framework code as possible and let other framework oriented projects
reuse some of scikit-learn modules if they want to do so: i.e. to
build datamining GUI for instance.

Other scikit-learn contributors might have their own reasons to
contribute to scikit-learn rather than Orange.

Also on a more trivial perspective, I like working on github using
pull-request based reviews as the main inter-developer communication
medium for code contributions. svn is such a pain once you tasted a
decentralized tool like git or hg.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2011-12-03 11:26:52 UTC
Permalink
Post by Olivier Grisel
Also on a more trivial perspective, I like working on github using
pull-request based reviews as the main inter-developer communication
medium for code contributions. svn is such a pain once you tasted a
decentralized tool like git or hg.
+1! github makes contributing to open-source software much more fun!

Mathieu
Denis Kochedykov
2011-12-04 06:49:30 UTC
Permalink
Hi Olivier,

Thanks for comments!

So, summarizing, sklearn versus Orange is:
- use plain arrays instead of classes for storing data-sets, features, etc
- use BSD rather than GPL license
- no framework, plain library of methods

If I got it right, seems like creating sklearn was not a question of
Orange quality/usability, but more a question of another development
style/community.
That is, for users who're not going to sell their software (which is not
permitted by GPL), there is not much difference?
Of course, convenience for developers and simplicity means more viable
library in a long term.

Denis.
Post by Olivier Grisel
Hi Denis,
- scikit-learn is a scikit (scientific python toolkit): it is meant to
be used by he scipy community and to play by its tacit rules: the
primary data structure is plain old numpy array (or
scipy.sparse.matrix): no machine learning specific class for samples,
features, datasets...
- scikit-learn has only dependencies on non viral open source licenses
(python, numpy, scipy and joblib all are BSD-like): hence scikit-learn
is BSD-like as well to play fair in this permissive ecosystem (being a
able to copy and paste any function or modules of scikit-learn source
code anywhere else is perfectly OK)
- scikit-learn focuses on implementing machine learning with as few
framework code as possible and let other framework oriented projects
reuse some of scikit-learn modules if they want to do so: i.e. to
build datamining GUI for instance.
Other scikit-learn contributors might have their own reasons to
contribute to scikit-learn rather than Orange.
Also on a more trivial perspective, I like working on github using
pull-request based reviews as the main inter-developer communication
medium for code contributions. svn is such a pain once you tasted a
decentralized tool like git or hg.
b***@gmail.com
2011-12-04 08:57:51 UTC
Permalink
Hi Denis,

My main motivation is mostly usability. In terms of development though, I've only really worked on decision trees, so my comments are heavily influenced by that experience.
Here are the three main reasons why I use scikit-learn:

Simplicity (taking the cue from Olivier). If you've seen how difficult it is to prepare your dataset into Orange format, you will appreciate any package that operates directly on numpy arrays.

Speed. The decision tree implementation of Orange takes about 25 seconds to train on the Madelon dataset, whereas the optimised version of scikit-learn takes well under a second. I can't really comment on other algorithms though.

Readability. Algorithms implemented in scikit-learn are meant to be easily understood, to the point where anyone with enough knowledge of the algorithm should be able to go in and make changes if they wish. I like to think of it as executable pseudocode.

These are the main reasons why I use it, but the other ones mentioned (distributed code, licensing) are important too.

Regards
Brian
-----Original Message-----
From: Denis Kochedykov <***@mail.ru>
Date: Sun, 04 Dec 2011 14:49:30
To: <scikit-learn-***@lists.sourceforge.net>
Reply-To: scikit-learn-***@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] motivation for the lib,
why re-implement existing stuff

Hi Olivier,

Thanks for comments!

So, summarizing, sklearn versus Orange is:
- use plain arrays instead of classes for storing data-sets, features, etc
- use BSD rather than GPL license
- no framework, plain library of methods

If I got it right, seems like creating sklearn was not a question of
Orange quality/usability, but more a question of another development
style/community.
That is, for users who're not going to sell their software (which is not
permitted by GPL), there is not much difference?
Of course, convenience for developers and simplicity means more viable
library in a long term.

Denis.
Post by Olivier Grisel
Hi Denis,
- scikit-learn is a scikit (scientific python toolkit): it is meant to
be used by he scipy community and to play by its tacit rules: the
primary data structure is plain old numpy array (or
scipy.sparse.matrix): no machine learning specific class for samples,
features, datasets...
- scikit-learn has only dependencies on non viral open source licenses
(python, numpy, scipy and joblib all are BSD-like): hence scikit-learn
is BSD-like as well to play fair in this permissive ecosystem (being a
able to copy and paste any function or modules of scikit-learn source
code anywhere else is perfectly OK)
- scikit-learn focuses on implementing machine learning with as few
framework code as possible and let other framework oriented projects
reuse some of scikit-learn modules if they want to do so: i.e. to
build datamining GUI for instance.
Other scikit-learn contributors might have their own reasons to
contribute to scikit-learn rather than Orange.
Also on a more trivial perspective, I like working on github using
pull-request based reviews as the main inter-developer communication
medium for code contributions. svn is such a pain once you tasted a
decentralized tool like git or hg.
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scik
Denis Kochedykov
2011-12-04 12:58:13 UTC
Permalink
Hi Brian,

Thanks, all points are quite important for me (for most users, I think).
Performance problems are surprising, considering Orange is mainly C++.

Denis.
Post by Olivier Grisel
Hi Denis,
My main motivation is mostly usability. In terms of development though, I've only really worked on decision trees, so my comments are heavily influenced by that experience.
Simplicity (taking the cue from Olivier). If you've seen how difficult it is to prepare your dataset into Orange format, you will appreciate any package that operates directly on numpy arrays.
Speed. The decision tree implementation of Orange takes about 25 seconds to train on the Madelon dataset, whereas the optimised version of scikit-learn takes well under a second. I can't really comment on other algorithms though.
Readability. Algorithms implemented in scikit-learn are meant to be easily understood, to the point where anyone with enough knowledge of the algorithm should be able to go in and make changes if they wish. I like to think of it as executable pseudocode.
These are the main reasons why I use it, but the other ones mentioned (distributed code, licensing) are important too.
Regards
Brian
Gael Varoquaux
2011-12-05 06:51:12 UTC
Permalink
Post by Denis Kochedykov
Thanks, all points are quite important for me (for most users, I think).
Performance problems are surprising, considering Orange is mainly C++.
It's the algorithm that counts, more than the language.

G

Olivier Grisel
2011-12-04 09:48:51 UTC
Permalink
Post by Denis Kochedykov
Hi Olivier,
Thanks for comments!
- use plain arrays instead of classes for storing data-sets, features, etc
- use BSD rather than GPL license
- no framework, plain library of methods
If I got it right, seems like creating sklearn was not a question of
Orange quality/usability, but more a question of another development
style/community.
That is, for users who're not going to sell their software (which is not
permitted by GPL)
You are allowed to sell a GPL software (albeit it does not make much
sense since user should be able to build it from source for free) but
you are not allowed to embed it in a non-GPL compatible product (e.g.
proprietary for instance).
Post by Denis Kochedykov
, there is not much difference?
I don't really know Orange but I think it's indeed pretty similar in
scope to what sklearn provides if you ignore the aforementioned 3
points.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Denis Kochedykov
2011-12-04 13:06:05 UTC
Permalink
Hi Olivier,
Post by Olivier Grisel
I don't really know Orange but I think it's indeed pretty similar in
scope to what sklearn provides if you ignore the aforementioned 3 points.
Definitely not ignoring them :) Some points are important for some
users, other important for others. Performance/stability/transparency
look important for me, while I'm fine with using wrapper classes for
data-sets, etc (I don't even have to memorize anything with Python's
introspection) and presence of framework looks like a plus for me.
Restrictive license looks like a problem, but if I don't license my
software, probably it's not important if embedded lib is BSD or GPL
(just a guess).

Denis.
David Warde-Farley
2011-12-04 09:50:35 UTC
Permalink
Post by Denis Kochedykov
Hi Olivier,
Thanks for comments!
- use plain arrays instead of classes for storing data-sets, features, etc
- use BSD rather than GPL license
- no framework, plain library of methods
If I got it right, seems like creating sklearn was not a question of
Orange quality/usability, but more a question of another development
style/community.
That is, for users who're not going to sell their software (which is not
permitted by GPL), there is not much difference?
The GPL does not prohibit you from selling your software. The only
stipulation is that anyone who receives the software in binary form must also
receive the source code, and a copy of the license.

*That* person is then free to redistribute the software under the terms of
the GPL, including giving it away for free. So while you are free to sell
software, those you sell it to are free to give it away, and so forth.

That, and the GPL is viral, so the moment you import a GPLed library or copy
and paste a snippet of GPLed code, your entire project becomes a "derivative
work" as far as the FSF is concerned (and possibly the copyright law of some
country, AFAIK it's never been litigated).
Post by Denis Kochedykov
Of course, convenience for developers and simplicity means more viable
library in a long term.
When I last tried out Orange, it was very much a C++ library trying and
failing to masquerade as a Python library. The API was complicated and
prosaic, it didn't build very easily, and it was prone to hard crashes that
brought the interpreter down in flames. I don't know if things have improved
since then (this was probably 2008ish).

I've since moved into mostly dabbling in the kinds of algorithms that neither
scikit-learn nor Orange implement, but when I do require the use of an
off-the-shelf algorithm, I greatly prefer scikit-learn's approach to APIs
because, as a seasoned NumPy user, there's very little else I need to grasp
in order to use it. I don't need to spend half a day piecing together
somebody's notions of how best to decompose a learning task into a 30-piece
C++ class hierarchy: I look up the class I'm interested in, look at the
docstring for __init__() and fit(), and I'm done.

David
Denis Kochedykov
2011-12-04 13:16:56 UTC
Permalink
Hi David,

Thanks, very good points. That is

1. C++ rather than Python (in fact this, looks like a plus for me -
performance, universality, etc)
2. Complicated and inconvenient classes structure and API in Orange
3. Instability(?)

I think I've heard enough good reasons to use sklearn :)

Asked here because wasn't able to find such a comparison anywhere else.

Denis.
Post by David Warde-Farley
When I last tried out Orange, it was very much a C++ library trying and
failing to masquerade as a Python library. The API was complicated and
prosaic, it didn't build very easily, and it was prone to hard crashes that
brought the interpreter down in flames. I don't know if things have improved
since then (this was probably 2008ish).
I've since moved into mostly dabbling in the kinds of algorithms that neither
scikit-learn nor Orange implement, but when I do require the use of an
off-the-shelf algorithm, I greatly prefer scikit-learn's approach to APIs
because, as a seasoned NumPy user, there's very little else I need to grasp
in order to use it. I don't need to spend half a day piecing together
somebody's notions of how best to decompose a learning task into a 30-piece
C++ class hierarchy: I look up the class I'm interested in, look at the
docstring for __init__() and fit(), and I'm done.
David
David Warde-Farley
2011-12-04 13:41:04 UTC
Permalink
Post by Denis Kochedykov
Hi David,
Thanks, very good points. That is
1. C++ rather than Python (in fact this, looks like a plus for me -
performance, universality, etc)
I agree from the perspective of universality, but beware of the trap of
making speed generalizations about languages. A lot of the speed-critical
parts of sklearn are quite heavily optimized in Cython. I recall that their
coordinate descent (for generalized linear models) implementation compares
quite favourably against a widely used and cleverly written Fortran
implementation. Sounds like Brian has found the decision tree implementation
to be quite speedy as well.

Suffice it to say, it's possible to write quite fast Python code (and in my
experience, almost always possible to achieve C-like speeds with a dash of
Cython), and it's also possible to really drop the ball and write very slow
C/C++ code.

David
Olivier Grisel
2011-12-04 16:28:42 UTC
Permalink
Post by David Warde-Farley
Post by Denis Kochedykov
Hi David,
Thanks, very good points. That is
1. C++ rather than Python (in fact this, looks like a plus for me -
performance, universality, etc)
I agree from the perspective of universality, but beware of the trap of
making speed generalizations about languages. A lot of the speed-critical
parts of sklearn are quite heavily optimized in Cython. I recall that their
coordinate descent (for generalized linear models) implementation compares
quite favourably against a widely used and cleverly written Fortran
implementation.
It depends on the data. The version in sklearn does not have a number
of important optimizations found in glmnet (R frontend with a Fortran
backend) that can be critical for some n_informative / n_features and
n_features / n_samples ratios (I don't remember exactly how. Also
correlations between informative features might have an impact on the
convergence speed too).
Post by David Warde-Farley
Sounds like Brian has found the decision tree implementation
to be quite speedy as well.
Same remark applies here: the regression random forest is still
significantly slower in sklearn than in R's GBM. See ongoing work
here:

https://github.com/scikit-learn/scikit-learn/pull/448
Post by David Warde-Farley
Suffice it to say, it's possible to write quite fast Python code (and in my
experience, almost always possible to achieve C-like speeds with a dash of
Cython), and it's also possible to really drop the ball and write very slow
C/C++ code.
Indeed speed cannot be inferred from the implementation language: the
algorithm, default parameters and implementation are much more
important. All three varies from one module to another in sklearn and
other lib.

If you want hard numbers on a specific task I would suggest you to
play with http://scikit-learn.github.com/ml-benchmarks/ and add your
own dataset and library to it if not represented by the existing.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Continue reading on narkive:
Loading...