- let the community (to put zero additional burden on the current maintainers)
maintain a fork of scikit-learn that provides no guarantees other than it is
kept upto date with scikit-learn/master.
The problem with this is that we are still going to have our tracker
filled with problems that are related to the fork, and not master. To put
things in perspective, our tracker has 336 issue open, and 1318 closed.
Just keeping track on those issues is very hard.
Thus the need for a different repo (eg scikit-learn-contrib, as suggested
- people are welcome to add any algorithms to this (trivial, non-trivial,
What you are suggesting is very similar to things that have been tried as
a 'sandbox' for instance in scipy. Experience has shown that it code
rots, because nobody feels responsible for the code. It's been tried, it
fails, but if you feel like doing it, you should go ahead. Do you need
anything from us?
I would believe more in separate repos in a 'scikit-learn-contrib' github
organization, because it would give a feeling of responsibility to the
different owners of the repos.
- folks don't have to recreate packaging
I don't understand: if there are releases, and packaging, someone has to
do it. It doesn't happen just like this. It's actually a lot of work.
If it's just a fork, without any releases, what's the gain? In addition,
if somebody is not doing the work of making sure that it builds and run
on various platforms, quite quickly it will stop working on different
versions of Python and different platforms.
- it brings all the folks who are forking anyway together instead of splitting
off into forks (multiple forks are harder to use)
But someone has to be making the merges :). So the work is there.
- it makes for increased availability of algorithms that may be useful in
practice but never makes it out because the world is biased towards
Probably, provided that the project actually flies. But I really fear
coderot. The amount of work to keep the scikit-learn project going is
just huge. If nobody is doing this work, coderot would come in very
- it doesn't add anything to the current maintainers plates, nor take away
anything from the main project. perhaps those wishing to add things will take
it upon themselves to maintain this fork.
As long as it is called differently, and _has a different import name_.
If not, I can quite forcast the situation where users are complaining
about scikit-learn and after a long debugging session we find that they
are running some weird fork.
I think that there is something flawed in the way you see the life of a
project like scikit-learn. You seem to think that it is just an
accumulation of code. That putting code together is enough to make a
project successful. But if that's the case, why don't you just create
something else, just anything else, and accumulate code? More
importantly, why do you want algorithms in scikit-learn? Why aren't you
happy with just code on Internet that you can download? If you ask
yourself these questions, you will probably find where the value of
scikit-learn lies, and this will also tell you why there is a huge effort
in maintaining scikit-learn.
Things like this, eg sandboxes where there is no feeling of belonging to
a global project and no harmonizing effort, have been tried in the past.
They fail because of coderot. Actually, to put a historical perspective,
a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
have much working, mostly dead code. We hypothesized that it was because
of lack of visibility, so the 'sandbox' was cleaned, separated in some
structure, and renamed 'scikits'. Scikits weren't getting much traction
inside the scipy codebase, because people were having a hard time working
there (back then it was an SVN, but there was also the problem of
compiling scipy, which is a bit hard). So we started pulling things out
of the SVN. And that's how the current scikits were born. Some of these
scikits took off, because they had a clear project management: releases,
It's interesting that almost ten years later, we are falling in the same
problems. I think that this is not by chance. The reasons that these
evolutions happen are the following:
1. Projects are non-linearly hard to evolve. Bigger projects are harder to
drive than small projects, and significantly. This is a very very true
law of project management and is really underestimated by too many .
2. People want different things, and that's perfectly legitimate. The
statsmodels guys wanted control on p-values. The scikit-learn guys
wanted good prediction. Both usecases are valid (I am an avid user of
statsmodels), but doing both in the same project was much, much harder
than doing two projects.
Thus I think that it is natural that some ecosystem of different
projects, from general to specific, shapes up. Yes, it's very important to
keep in mind the big picture, and that people with close enough unite,
but only in balance with point 1.
By the way, I care very much about the ecosystem. When we split of HMMs,
I spent half a day making them a separate package, with setup.py, travis,
a README, examples, documentation:
It did take a good 4 hours. Nothing happens for free. I did this even
though I do not use HMMs at all.
In terms of action points, to summarize my position:
- You are free to create a fork. I strongly ask that you change the
import name, elsewhere you will be putting burden on the main
- What I think could work would be a scikit-learn-contrib organization with
different repository in it. I see that Matthieu and Andy have the same
feeling. I think we all agree that it should be done. I am ready to
create the organization, and give you (and many others) the keys of the
 This has actually been studied. Here is one paper (out of probably