Discussion:
Does anybody works on adaptive affinity propagation clustering algorithm?
(too old to reply)
Илья Патрушев
2014-12-02 14:06:25 UTC
Permalink
Hi everybody,

As far as I am aware, there is no adaptive affinity propagation clustering
algorithm implementation in neither the stable nor the development version
of sklearn.
I have recently implemented the adaptive affinity propagation algorithm as
a part of my image analysis project. I based my implementation on the paper
by Wang et al., 2007 and their Matlab code, and sklearn's affinity
propagation algorithm. This is not exactly a port of Matlab code since I
have slightly modified the Wang's approach to deal with oscillations and
added an optional upper limit on number of clusters.
I am planning to submit the code to sklearn eventually. So please let me
know if anybody already works on the algorithm, as we could join our
efforts and save some time.

Best wishes,
ilya.
--
Ilya Patrushev, PhD.
MRC National Institute for Medical Research
The Ridgeway
Mill Hill
London NW7 1AA
UK
Tel: 0208 816 2656
Fax: 0208 906 4477
Andy
2014-12-02 14:34:39 UTC
Permalink
Hi Ilya.

Thanks for your interest in contributing.
I am not expert in affinity propagation, so it would be great if you
could give some details of what the advantage of the method is.
The reference paper seems to be an arxiv preprint with 88 citations,
which would probably not qualify for inclusion in scikit-learn,
see the FAQ
http://scikit-learn.org/dev/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published

It might be a candidate for an external experimental / contribution
project, which has been an idea that has been floating around for a while.

Cheers,
Andy
Post by Илья Патрушев
Hi everybody,
As far as I am aware, there is no adaptive affinity propagation
clustering algorithm implementation in neither the stable nor the
development version of sklearn.
I have recently implemented the adaptive affinity propagation
algorithm as a part of my image analysis project. I based my
implementation on the paper by Wang et al., 2007 and their Matlab
code, and sklearn's affinity propagation algorithm. This is not
exactly a port of Matlab code since I have slightly modified the
Wang's approach to deal with oscillations and added an optional upper
limit on number of clusters.
I am planning to submit the code to sklearn eventually. So please let
me know if anybody already works on the algorithm, as we could join
our efforts and save some time.
Best wishes,
ilya.
--
Ilya Patrushev, PhD.
MRC National Institute for Medical Research
The Ridgeway
Mill Hill
London NW7 1AA
UK
Tel: 0208 816 2656
Fax: 0208 906 4477
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Tom Fawcett
2014-12-03 04:32:45 UTC
Permalink
Post by Andy
Hi Ilya.
Thanks for your interest in contributing.
I am not expert in affinity propagation, so it would be great if you could give some details of what the advantage of the method is.
The reference paper seems to be an arxiv preprint with 88 citations, which would probably not qualify for inclusion in scikit-learn,
see the FAQ http://scikit-learn.org/dev/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published <http://scikit-learn.org/dev/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published>
Wow, I had not seen this FAQ. "As a rule we only add well-established algorithms. A rule of thumb is at least 3 years since publications, 1000+ cites and wide use and usefullness.” I was intending to contribute a rule learning system to scikit-learn, and/or descriptive learning methods. I guess those are both right out. I thought scikit-learn would welcome some variety but 1000+ cites (sic) and wide use pretty much rules out anything but statistical learning. Among symbolic methods there is only one rather mediocre decision tree induction method.

Anyone know of another python framework that’s a little more welcoming?

-Tom
Gael Varoquaux
2014-12-03 06:05:22 UTC
Permalink
Anyone know of another python framework that’s a little more welcoming?
Well, packages need a decision rule to filter out the massive amount of
published algorithms, and implementing and doing maintainance on the
complete literature isn't possible. Other packages may have other
agendas, such as implementing a specific part of the literature, or the
papers by the authors of the packages :).

I don't know of any that would be specifically open to symbolic methods,
but maybe there is one that I don't know.

Gaël
Tom Fawcett
2014-12-03 08:12:41 UTC
Permalink
The bottom line is that you or anyone else is welcome to fork the project and be as welcoming as you like. But the project thrives on the basis that it is well-contained and well-maintained, and that simply can't be assured of a project without restrictive criteria for inclusion.
I think this is the crux of what I don’t understand. You seem to view scikit-learn like the core Python library, which must be carefully curated because it’s basically an extension of the language. There’s usually only one official core library package for a given task, so it’s supported and its quality is guaranteed.

From my use of scikit-learn I view it more as a CRAN or CPAN (or PyPi) ecosystem: it’s a fairly loose framework supporting many plug-in modules of varying quality. There are many alternatives for a given task so it’s much more of a pick-and-choose ensemble. That’s why I was surprised by the FAQ answer about contributions. It seems to me contributed modules should pass tests and respect the basic API structure. Beyond that I don’t see why scikit-learn imposes popularity thresholds on contributions.

But I didn’t come here to argue. I respect the immense work that’s gone into the project, and if that’s the way it’s run, so be it.

Regards,
-Tom
Gael Varoquaux
2014-12-03 08:28:42 UTC
Permalink
Post by Tom Fawcett
From my use of scikit-learn I view it more as a CRAN or CPAN (or PyPi)
That's because you are not on the receiving end when there is a problem
with something coded in scikit-learn. The equivalent of CRAN is PyPi:
there are different package maintainer for each package.
Post by Tom Fawcett
it’s a fairly loose framework supporting many plug-in modules of
varying quality.
We would like it to be tight, and of high quality.
Post by Tom Fawcett
There are many alternatives for a given task so it’s much more of a
pick-and-choose ensemble.
I agree here, but we are bandwidth limited, and need to include only the
most important alternatives for lack of time.
Post by Tom Fawcett
Beyond that I don’t see why scikit-learn imposes popularity thresholds
on contributions.
Because code is maintenance cost, and we need to balance the amount of
code we have with the size of the team (and add to this the fact that
complexity scales non linearly with the number of features).
Post by Tom Fawcett
But I didn’t come here to argue.
Sure. Thanks a lot for understanding. I am just trying to explain the
reasons behind this.

Cheers,

Gaël
federico vaggi
2014-12-03 08:30:40 UTC
Permalink
I think the crux is this:

*From my use of scikit-learn I view it more as a CRAN or CPAN (or PyPi)
ecosystem: it’s a fairly loose framework supporting many plug-in modules of
varying quality.*

scikit-learn is not itself an ecosystem - it is a single package within the
ecosystem, and the leaders of the project try really hard to keep it
coherent. For example, the HMM module was recently split off because it
was really hard to shoehorn it into the default sklearn API.

As the surface area of scikit-learn grows, the maintenance costs grow, and
keeping everything high quality becomes much harder. There is a very, very
serious trade off between versatility and quality, and sklearn errs on the
side of the latter.
Post by Joel Nothman
The bottom line is that you or anyone else is welcome to fork the
project and be as welcoming as you like. But the project thrives on the
basis that it is well-contained and well-maintained, and that simply can't
be assured of a project without restrictive criteria for inclusion.
I think this is the crux of what I don’t understand. You seem to view
scikit-learn like the core Python library, which must be carefully curated
because it’s basically an extension of the language. There’s usually only
one official core library package for a given task, so it’s supported and
its quality is guaranteed.
From my use of scikit-learn I view it more as a CRAN or CPAN (or PyPi)
ecosystem: it’s a fairly loose framework supporting many plug-in modules of
varying quality. There are many alternatives for a given task so it’s much
more of a pick-and-choose ensemble. That’s why I was surprised by the FAQ
answer about contributions. It seems to me contributed modules should pass
tests and respect the basic API structure. Beyond that I don’t see why
scikit-learn imposes popularity thresholds on contributions.
But I didn’t come here to argue. I respect the immense work that’s gone
into the project, and if that’s the way it’s run, so be it.
Regards,
-Tom
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Jacob Vanderplas
2014-12-03 08:35:17 UTC
Permalink
Hi Tom,
If I might add a suggestion: I think it would be great if you developed
these ideas in a separate repository, made the API compatible with
scikit-learn, and released the code on PyPI. Then it will be out there and
available to anyone who wants to use it. That's what I've ended up doing at
times with algorithms that don't quite fit within a current package in the
ecosystem.
Good luck,
Jake

Jake VanderPlas
Director of Research – Physical Sciences
eScience Institute, University of Washington
http://www.vanderplas.com
Post by federico vaggi
*From my use of scikit-learn I view it more as a CRAN or CPAN (or PyPi)
ecosystem: it’s a fairly loose framework supporting many plug-in modules of
varying quality.*
scikit-learn is not itself an ecosystem - it is a single package within
the ecosystem, and the leaders of the project try really hard to keep it
coherent. For example, the HMM module was recently split off because it
was really hard to shoehorn it into the default sklearn API.
As the surface area of scikit-learn grows, the maintenance costs grow, and
keeping everything high quality becomes much harder. There is a very, very
serious trade off between versatility and quality, and sklearn errs on the
side of the latter.
Post by Joel Nothman
The bottom line is that you or anyone else is welcome to fork the
project and be as welcoming as you like. But the project thrives on the
basis that it is well-contained and well-maintained, and that simply can't
be assured of a project without restrictive criteria for inclusion.
I think this is the crux of what I don’t understand. You seem to view
scikit-learn like the core Python library, which must be carefully curated
because it’s basically an extension of the language. There’s usually only
one official core library package for a given task, so it’s supported and
its quality is guaranteed.
Post by Joel Nothman
From my use of scikit-learn I view it more as a CRAN or CPAN (or PyPi)
ecosystem: it’s a fairly loose framework supporting many plug-in modules of
varying quality. There are many alternatives for a given task so it’s much
more of a pick-and-choose ensemble. That’s why I was surprised by the FAQ
answer about contributions. It seems to me contributed modules should pass
tests and respect the basic API structure. Beyond that I don’t see why
scikit-learn imposes popularity thresholds on contributions.
But I didn’t come here to argue. I respect the immense work that’s gone
into the project, and if that’s the way it’s run, so be it.
Regards,
-Tom
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2014-12-03 07:09:47 UTC
Permalink
Hi Tom,

Anyone is welcome to publish their implementations in a format compatible
with scikit-learn's estimators. However, the centralised project already
takes a vast amount of work (almost all of it unpaid) to maintain, even
while adopting a very restrictive scope. Incorporating less-established
techniques provides marginal benefit for huge costs, exacerbating the
potential for code rot and maintainer exhaustion.

That being said, the rule of thumb is only a rule of thumb. Counting
citations for a technique is not always straightforward, and it's often
practical to implement a recent variant of a well-established technique.
For example, LSH Forest is being adopted as a practical (in terms of free
parameters) variant of Locality Sensitive Hashing, although the LSH Forest
technique has only received <200 citations in 9 years. Even this is done at
some risk of the technique being superseded in the immediate term.

I'm not certain what qualifies as rule learning. But the 2000+ citations of
Liu et al's (1998) "Integrating classification and association rule mining"
suggest that this technique or perhaps a recent variant would be welcome.

Perhaps scikit-learn needs to strengthen and formalise its support for
external related projects that adopt its API design to implement less
established techniques. The listing at
https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets
lacks glamour, and could be easier to find and navigate.

The bottom line is that you or anyone else is welcome to fork the project
and be as welcoming as you like. But the project thrives on the basis that
it is well-contained and well-maintained, and that simply can't be assured
of a project without restrictive criteria for inclusion.
Post by Andy
Hi Ilya.
Thanks for your interest in contributing.
I am not expert in affinity propagation, so it would be great if you could
give some details of what the advantage of the method is.
The reference paper seems to be an arxiv preprint with 88 citations, which
would probably not qualify for inclusion in scikit-learn,
see the FAQ
http://scikit-learn.org/dev/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published
Wow, I had not seen this FAQ. *"As a rule we only add well-established
algorithms. A rule of thumb is at least 3 years since publications, 1000+
cites and wide use and usefullness.”* I was intending to contribute a
rule learning system to scikit-learn, and/or descriptive learning methods.
I guess those are both right out. I thought scikit-learn would welcome
some variety but 1000+ cites (sic) and wide use pretty much rules out
anything but statistical learning. Among symbolic methods there is only
one rather mediocre decision tree induction method.
Anyone know of another python framework that’s a little more welcoming?
-Tom
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2014-12-03 09:04:58 UTC
Permalink
Post by Joel Nothman
Hi Tom,
Anyone is welcome to publish their implementations in a format compatible
with scikit-learn's estimators. However, the centralised project already
takes a vast amount of work (almost all of it unpaid) to maintain, even
while adopting a very restrictive scope. Incorporating less-established
techniques provides marginal benefit for huge costs, exacerbating the
potential for code rot and maintainer exhaustion.
That being said, the rule of thumb is only a rule of thumb. Counting
citations for a technique is not always straightforward, and it's often
practical to implement a recent variant of a well-established technique.
For example, LSH Forest is being adopted as a practical (in terms of free
parameters) variant of Locality Sensitive Hashing, although the LSH Forest
technique has only received <200 citations in 9 years. Even this is done at
some risk of the technique being superseded in the immediate term.
I think 1000 citations is a bit too much to ask. We should probably update
the FAQ with something more reasonable, like say 200 citations. That said,
I agree that the citation threshold is just an indicator. For example, SAG
and AdaGrad, which are considerely considered for inclusion, have around 75
and 250 citations currently.
Post by Joel Nothman
I'm not certain what qualifies as rule learning. But the 2000+ citations
of Liu et al's (1998) "Integrating classification and association rule
mining" suggest that this technique or perhaps a recent variant would be
welcome.
Perhaps scikit-learn needs to strengthen and formalise its support for
external related projects that adopt its API design to implement less
established techniques. The listing at
https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets
lacks glamour, and could be easier to find and navigate.
+1

We need to bring this page to the main documentation and make it more sexy.

M.
Gael Varoquaux
2014-12-03 09:07:39 UTC
Permalink
Post by Mathieu Blondel
I think 1000 citations is a bit too much to ask. We should probably
update the FAQ with something more reasonable, like say 200 citations.
That said, I agree that the citation threshold is just an indicator.
For example, SAG and AdaGrad, which are considerely considered for
inclusion, have around 75 and 250 citations currently.
I agree. We should ammend this sentence to say that if the paper is an
clear-cut improvement on top of a very used method, it should be
examinded.
Post by Mathieu Blondel
Perhaps scikit-learn needs to strengthen and formalise its support
for external related projects that adopt its API design to
implement less established techniques. The listing
at https://github.com/scikit-learn/
scikit-learn/wiki/Third-party-projects-and-code-snippets lacks
glamour, and could be easier to find and navigate.
+1
+1
Post by Mathieu Blondel
We need to bring this page to the main documentation and make it more sexy.
Good with me.

G
Joel Nothman
2014-12-03 10:25:39 UTC
Permalink
Post by Gael Varoquaux
I agree. We should ammend this sentence to say that if the paper is an
clear-cut improvement on top of a very used method, it should be
examinded.
Done <http://scikit-learn.org/dev/faq.html>.
Post by Gael Varoquaux
Post by Mathieu Blondel
I think 1000 citations is a bit too much to ask. We should probably
update the FAQ with something more reasonable, like say 200 citations.
That said, I agree that the citation threshold is just an indicator.
For example, SAG and AdaGrad, which are considerely considered for
inclusion, have around 75 and 250 citations currently.
I agree. We should ammend this sentence to say that if the paper is an
clear-cut improvement on top of a very used method, it should be
examinded.
Post by Mathieu Blondel
Perhaps scikit-learn needs to strengthen and formalise its support
for external related projects that adopt its API design to
implement less established techniques. The listing
at https://github.com/scikit-learn/
scikit-learn/wiki/Third-party-projects-and-code-snippets lacks
glamour, and could be easier to find and navigate.
+1
+1
Post by Mathieu Blondel
We need to bring this page to the main documentation and make it more
sexy.
Good with me.
G
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Satrajit Ghosh
2014-12-03 14:56:55 UTC
Permalink
hi folks,

since this comes up from time to time and i completely understand the
needed focus and limited resources within scikit-learn, how about the
following approach:

- let the community (to put zero additional burden on the current
maintainers) maintain a fork of scikit-learn that provides no guarantees
other than it is kept upto date with scikit-learn/master.
- people are welcome to add any algorithms to this (trivial, non-trivial,
recent)
- if things prove useful within this branch/fork/labs they can be
incorporated into the main stream through the current standard PR mechanism

people will use it at their own discretion, but what it would allow is for
people to have a single place within which to toy with things while still
maintaining the core benefits of scikit-learn.

with the different kinds of data (types and size) coming online, algorithm
development has gone in many different directions. some variants are on
speed/hardware, others on generalizability, yet others on domain specific
apps, etc.,.. what works in one domain/app may completely fail in another.

the hope here is that this fork would let interested people toy with this
developmental eco-system as opposed to the stable maintained ecosystem. the
key advantages of having a fork are that:
- folks don't have to recreate packaging
- it brings all the folks who are forking anyway together instead of
splitting off into forks (multiple forks are harder to use)
- it makes for increased availability of algorithms that may be useful in
practice but never makes it out because the world is biased towards
loudspeakers
- it doesn't add anything to the current maintainers plates, nor take away
anything from the main project. perhaps those wishing to add things will
take it upon themselves to maintain this fork.
- and if you find that more people are using this fork rather than the
mainstream (that might tell you something about the current culture of
science and engineering in practice).
- there might be fixes that can be incorporated into master coming into
this fork because more people end up toying within it
- if this fork goes bust, nobody cares.

you could even call the fork:

scikit-learn-minefield
scikit-learn-teenage-mutants
...
scikit-learn-labs

cheers,

satra
Post by Gael Varoquaux
I agree. We should ammend this sentence to say that if the paper is an
Post by Gael Varoquaux
clear-cut improvement on top of a very used method, it should be
examinded.
Done <http://scikit-learn.org/dev/faq.html>.
Post by Gael Varoquaux
Post by Mathieu Blondel
I think 1000 citations is a bit too much to ask. We should probably
update the FAQ with something more reasonable, like say 200 citations.
That said, I agree that the citation threshold is just an indicator.
For example, SAG and AdaGrad, which are considerely considered for
inclusion, have around 75 and 250 citations currently.
I agree. We should ammend this sentence to say that if the paper is an
clear-cut improvement on top of a very used method, it should be
examinded.
Post by Mathieu Blondel
Perhaps scikit-learn needs to strengthen and formalise its support
for external related projects that adopt its API design to
implement less established techniques. The listing
at https://github.com/scikit-learn/
scikit-learn/wiki/Third-party-projects-and-code-snippets lacks
glamour, and could be easier to find and navigate.
+1
+1
Post by Mathieu Blondel
We need to bring this page to the main documentation and make it more
sexy.
Good with me.
G
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2014-12-03 15:26:41 UTC
Permalink
Hi Satra,

I can't find the link but there was a discussion some time ago about
creating a scikit-learn-contrib organization on github for this purpose.

Two differences with what you suggest:
1) this wouldn't be a fork, i.e., the intersection with scikit-learn would
be empty
2) we were thinking of creating repositories for different sub-topics
(multilabel classification, kernel approximations, etc)

2) might require too much work in terms of making releases so a global
scikit-learn-contrib might be more realistic.

scikit-learn-contrib would have its own website
http://contrib.scikit-learn.org.

There would still be some work involved for minimal reviewing and
releasing, though.

Mathieu
Post by Satrajit Ghosh
hi folks,
since this comes up from time to time and i completely understand the
needed focus and limited resources within scikit-learn, how about the
- let the community (to put zero additional burden on the current
maintainers) maintain a fork of scikit-learn that provides no guarantees
other than it is kept upto date with scikit-learn/master.
- people are welcome to add any algorithms to this (trivial, non-trivial,
recent)
- if things prove useful within this branch/fork/labs they can be
incorporated into the main stream through the current standard PR mechanism
people will use it at their own discretion, but what it would allow is for
people to have a single place within which to toy with things while still
maintaining the core benefits of scikit-learn.
with the different kinds of data (types and size) coming online, algorithm
development has gone in many different directions. some variants are on
speed/hardware, others on generalizability, yet others on domain specific
apps, etc.,.. what works in one domain/app may completely fail in another.
the hope here is that this fork would let interested people toy with this
developmental eco-system as opposed to the stable maintained ecosystem. the
- folks don't have to recreate packaging
- it brings all the folks who are forking anyway together instead of
splitting off into forks (multiple forks are harder to use)
- it makes for increased availability of algorithms that may be useful in
practice but never makes it out because the world is biased towards
loudspeakers
- it doesn't add anything to the current maintainers plates, nor take away
anything from the main project. perhaps those wishing to add things will
take it upon themselves to maintain this fork.
- and if you find that more people are using this fork rather than the
mainstream (that might tell you something about the current culture of
science and engineering in practice).
- there might be fixes that can be incorporated into master coming into
this fork because more people end up toying within it
- if this fork goes bust, nobody cares.
scikit-learn-minefield
scikit-learn-teenage-mutants
...
scikit-learn-labs
cheers,
satra
Post by Gael Varoquaux
I agree. We should ammend this sentence to say that if the paper is an
Post by Gael Varoquaux
clear-cut improvement on top of a very used method, it should be
examinded.
Done <http://scikit-learn.org/dev/faq.html>.
On 3 December 2014 at 20:07, Gael Varoquaux <
Post by Gael Varoquaux
Post by Mathieu Blondel
I think 1000 citations is a bit too much to ask. We should probably
update the FAQ with something more reasonable, like say 200 citations.
That said, I agree that the citation threshold is just an indicator.
For example, SAG and AdaGrad, which are considerely considered for
inclusion, have around 75 and 250 citations currently.
I agree. We should ammend this sentence to say that if the paper is an
clear-cut improvement on top of a very used method, it should be
examinded.
Post by Mathieu Blondel
Perhaps scikit-learn needs to strengthen and formalise its support
for external related projects that adopt its API design to
implement less established techniques. The listing
at https://github.com/scikit-learn/
scikit-learn/wiki/Third-party-projects-and-code-snippets lacks
glamour, and could be easier to find and navigate.
+1
+1
Post by Mathieu Blondel
We need to bring this page to the main documentation and make it more
sexy.
Good with me.
G
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Mathieu Blondel
2014-12-03 15:31:02 UTC
Permalink
As you mentioned popular methods from scikit-learn-contrib could be
promoted to scikit-learn.

Conversely, methods which became obsolete in scikit-learn could move to
scikit-learn-contrib to lower the maintenance burden.

Mathieu
Post by Mathieu Blondel
Hi Satra,
I can't find the link but there was a discussion some time ago about
creating a scikit-learn-contrib organization on github for this purpose.
1) this wouldn't be a fork, i.e., the intersection with scikit-learn would
be empty
2) we were thinking of creating repositories for different sub-topics
(multilabel classification, kernel approximations, etc)
2) might require too much work in terms of making releases so a global
scikit-learn-contrib might be more realistic.
scikit-learn-contrib would have its own website
http://contrib.scikit-learn.org.
There would still be some work involved for minimal reviewing and
releasing, though.
Mathieu
Post by Satrajit Ghosh
hi folks,
since this comes up from time to time and i completely understand the
needed focus and limited resources within scikit-learn, how about the
- let the community (to put zero additional burden on the current
maintainers) maintain a fork of scikit-learn that provides no guarantees
other than it is kept upto date with scikit-learn/master.
- people are welcome to add any algorithms to this (trivial, non-trivial,
recent)
- if things prove useful within this branch/fork/labs they can be
incorporated into the main stream through the current standard PR mechanism
people will use it at their own discretion, but what it would allow is
for people to have a single place within which to toy with things while
still maintaining the core benefits of scikit-learn.
with the different kinds of data (types and size) coming online,
algorithm development has gone in many different directions. some variants
are on speed/hardware, others on generalizability, yet others on domain
specific apps, etc.,.. what works in one domain/app may completely fail in
another.
the hope here is that this fork would let interested people toy with this
developmental eco-system as opposed to the stable maintained ecosystem. the
- folks don't have to recreate packaging
- it brings all the folks who are forking anyway together instead of
splitting off into forks (multiple forks are harder to use)
- it makes for increased availability of algorithms that may be useful in
practice but never makes it out because the world is biased towards
loudspeakers
- it doesn't add anything to the current maintainers plates, nor take
away anything from the main project. perhaps those wishing to add things
will take it upon themselves to maintain this fork.
- and if you find that more people are using this fork rather than the
mainstream (that might tell you something about the current culture of
science and engineering in practice).
- there might be fixes that can be incorporated into master coming into
this fork because more people end up toying within it
- if this fork goes bust, nobody cares.
scikit-learn-minefield
scikit-learn-teenage-mutants
...
scikit-learn-labs
cheers,
satra
Post by Gael Varoquaux
I agree. We should ammend this sentence to say that if the paper is an
Post by Gael Varoquaux
clear-cut improvement on top of a very used method, it should be
examinded.
Done <http://scikit-learn.org/dev/faq.html>.
On 3 December 2014 at 20:07, Gael Varoquaux <
Post by Gael Varoquaux
Post by Mathieu Blondel
I think 1000 citations is a bit too much to ask. We should probably
update the FAQ with something more reasonable, like say 200 citations.
That said, I agree that the citation threshold is just an indicator.
For example, SAG and AdaGrad, which are considerely considered for
inclusion, have around 75 and 250 citations currently.
I agree. We should ammend this sentence to say that if the paper is an
clear-cut improvement on top of a very used method, it should be
examinded.
Post by Mathieu Blondel
Perhaps scikit-learn needs to strengthen and formalise its support
for external related projects that adopt its API design to
implement less established techniques. The listing
at https://github.com/scikit-learn/
scikit-learn/wiki/Third-party-projects-and-code-snippets lacks
glamour, and could be easier to find and navigate.
+1
+1
Post by Mathieu Blondel
We need to bring this page to the main documentation and make it more
sexy.
Good with me.
G
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andy
2014-12-03 16:08:53 UTC
Permalink
I really want to push this approach, and hope I have time to establish
it early next year.
And I think that a zero-intersection approach would be better than a
fork, as it avoids incompatible changes.
Post by Mathieu Blondel
As you mentioned popular methods from scikit-learn-contrib could be
promoted to scikit-learn.
Conversely, methods which became obsolete in scikit-learn could move
to scikit-learn-contrib to lower the maintenance burden.
Mathieu
Hi Satra,
I can't find the link but there was a discussion some time ago
about creating a scikit-learn-contrib organization on github for
this purpose.
1) this wouldn't be a fork, i.e., the intersection with
scikit-learn would be empty
2) we were thinking of creating repositories for different
sub-topics (multilabel classification, kernel approximations, etc)
2) might require too much work in terms of making releases so a
global scikit-learn-contrib might be more realistic.
scikit-learn-contrib would have its own website
http://contrib.scikit-learn.org.
There would still be some work involved for minimal reviewing and
releasing, though.
Mathieu
hi folks,
since this comes up from time to time and i completely
understand the needed focus and limited resources within
- let the community (to put zero additional burden on the
current maintainers) maintain a fork of scikit-learn that
provides no guarantees other than it is kept upto date with
scikit-learn/master.
- people are welcome to add any algorithms to this (trivial,
non-trivial, recent)
- if things prove useful within this branch/fork/labs they can
be incorporated into the main stream through the current
standard PR mechanism
people will use it at their own discretion, but what it would
allow is for people to have a single place within which to toy
with things while still maintaining the core benefits of
scikit-learn.
with the different kinds of data (types and size) coming
online, algorithm development has gone in many different
directions. some variants are on speed/hardware, others on
generalizability, yet others on domain specific apps, etc.,..
what works in one domain/app may completely fail in another.
the hope here is that this fork would let interested people
toy with this developmental eco-system as opposed to the
stable maintained ecosystem. the key advantages of having a
- folks don't have to recreate packaging
- it brings all the folks who are forking anyway together
instead of splitting off into forks (multiple forks are harder
to use)
- it makes for increased availability of algorithms that may
be useful in practice but never makes it out because the world
is biased towards loudspeakers
- it doesn't add anything to the current maintainers plates,
nor take away anything from the main project. perhaps those
wishing to add things will take it upon themselves to maintain
this fork.
- and if you find that more people are using this fork rather
than the mainstream (that might tell you something about the
current culture of science and engineering in practice).
- there might be fixes that can be incorporated into master
coming into this fork because more people end up toying within it
- if this fork goes bust, nobody cares.
scikit-learn-minefield
scikit-learn-teenage-mutants
...
scikit-learn-labs
cheers,
satra
On Wed, Dec 3, 2014 at 5:25 AM, Joel Nothman
I agree. We should ammend this sentence to say that if
the paper is an
clear-cut improvement on top of a very used method, it
should be
examinded.
Done <http://scikit-learn.org/dev/faq.html>.
On 3 December 2014 at 20:07, Gael Varoquaux
On Wed, Dec 03, 2014 at 06:04:58PM +0900, Mathieu
Post by Mathieu Blondel
I think 1000 citations is a bit too much to ask. We
should probably
Post by Mathieu Blondel
update the FAQ with something more reasonable, like
say 200 citations.
Post by Mathieu Blondel
That said, I agree that the citation threshold is
just an indicator.
Post by Mathieu Blondel
For example, SAG and AdaGrad, which are considerely
considered for
Post by Mathieu Blondel
inclusion, have around 75 and 250 citations currently.
I agree. We should ammend this sentence to say that if
the paper is an
clear-cut improvement on top of a very used method, it
should be
examinded.
Post by Mathieu Blondel
Perhaps scikit-learn needs to strengthen and
formalise its support
Post by Mathieu Blondel
for external related projects that adopt its API
design to
Post by Mathieu Blondel
implement less established techniques. The listing
at https://github.com/scikit-learn/
scikit-learn/wiki/Third-party-projects-and-code-snippets
lacks
Post by Mathieu Blondel
glamour, and could be easier to find and navigate.
+1
+1
Post by Mathieu Blondel
We need to bring this page to the main documentation
and make it more sexy.
Good with me.
G
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade
BIRT Server
from Actuate! Instantly Supercharge Your Business
Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App
Integration & more
Get technology previously reserved for billion-dollar
corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports
and Dashboards
with Interactivity, Sharing, Native Excel Exports, App
Integration & more
Get technology previously reserved for billion-dollar
corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App
Integration & more
Get technology previously reserved for billion-dollar
corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2014-12-03 16:18:47 UTC
Permalink
- let the community (to put zero additional burden on the current maintainers)
maintain a fork of scikit-learn that provides no guarantees other than it is
kept upto date with scikit-learn/master. 
The problem with this is that we are still going to have our tracker
filled with problems that are related to the fork, and not master. To put
things in perspective, our tracker has 336 issue open, and 1318 closed.
Just keeping track on those issues is very hard.

Thus the need for a different repo (eg scikit-learn-contrib, as suggested
by Mathieu).
- people are welcome to add any algorithms to this (trivial, non-trivial,
recent)
What you are suggesting is very similar to things that have been tried as
a 'sandbox' for instance in scipy. Experience has shown that it code
rots, because nobody feels responsible for the code. It's been tried, it
fails, but if you feel like doing it, you should go ahead. Do you need
anything from us?

I would believe more in separate repos in a 'scikit-learn-contrib' github
organization, because it would give a feeling of responsibility to the
different owners of the repos.
- folks don't have to recreate packaging
I don't understand: if there are releases, and packaging, someone has to
do it. It doesn't happen just like this. It's actually a lot of work.

If it's just a fork, without any releases, what's the gain? In addition,
if somebody is not doing the work of making sure that it builds and run
on various platforms, quite quickly it will stop working on different
versions of Python and different platforms.
- it brings all the folks who are forking anyway together instead of splitting
off into forks (multiple forks are harder to use)
But someone has to be making the merges :). So the work is there.
- it makes for increased availability of algorithms that may be useful in
practice but never makes it out because the world is biased towards
loudspeakers
Probably, provided that the project actually flies. But I really fear
coderot. The amount of work to keep the scikit-learn project going is
just huge. If nobody is doing this work, coderot would come in very
quickly.
- it doesn't add anything to the current maintainers plates, nor take away
anything from the main project. perhaps those wishing to add things will take
it upon themselves to maintain this fork.
As long as it is called differently, and _has a different import name_.
If not, I can quite forcast the situation where users are complaining
about scikit-learn and after a long debugging session we find that they
are running some weird fork.


I think that there is something flawed in the way you see the life of a
project like scikit-learn. You seem to think that it is just an
accumulation of code. That putting code together is enough to make a
project successful. But if that's the case, why don't you just create
something else, just anything else, and accumulate code? More
importantly, why do you want algorithms in scikit-learn? Why aren't you
happy with just code on Internet that you can download? If you ask
yourself these questions, you will probably find where the value of
scikit-learn lies, and this will also tell you why there is a huge effort
in maintaining scikit-learn.


Things like this, eg sandboxes where there is no feeling of belonging to
a global project and no harmonizing effort, have been tried in the past.
They fail because of coderot. Actually, to put a historical perspective,
a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
have much working, mostly dead code. We hypothesized that it was because
of lack of visibility, so the 'sandbox' was cleaned, separated in some
structure, and renamed 'scikits'. Scikits weren't getting much traction
inside the scipy codebase, because people were having a hard time working
there (back then it was an SVN, but there was also the problem of
compiling scipy, which is a bit hard). So we started pulling things out
of the SVN. And that's how the current scikits were born. Some of these
scikits took off, because they had a clear project management: releases,
documentation, quality.

It's interesting that almost ten years later, we are falling in the same
problems. I think that this is not by chance. The reasons that these
evolutions happen are the following:

1. Projects are non-linearly hard to evolve. Bigger projects are harder to
drive than small projects, and significantly. This is a very very true
law of project management and is really underestimated by too many [1].

2. People want different things, and that's perfectly legitimate. The
statsmodels guys wanted control on p-values. The scikit-learn guys
wanted good prediction. Both usecases are valid (I am an avid user of
statsmodels), but doing both in the same project was much, much harder
than doing two projects.

Thus I think that it is natural that some ecosystem of different
projects, from general to specific, shapes up. Yes, it's very important to
keep in mind the big picture, and that people with close enough unite,
but only in balance with point 1.

By the way, I care very much about the ecosystem. When we split of HMMs,
I spent half a day making them a separate package, with setup.py, travis,
a README, examples, documentation:
https://github.com/hmmlearn
It did take a good 4 hours. Nothing happens for free. I did this even
though I do not use HMMs at all.


In terms of action points, to summarize my position:

- You are free to create a fork. I strongly ask that you change the
import name, elsewhere you will be putting burden on the main
scikit-learn maintainers.

- What I think could work would be a scikit-learn-contrib organization with
different repository in it. I see that Matthieu and Andy have the same
feeling. I think we all agree that it should be done. I am ready to
create the organization, and give you (and many others) the keys of the
kingdom.

Gaël


[1] This has actually been studied. Here is one paper (out of probably
many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
Joel Nothman
2014-12-03 23:55:13 UTC
Permalink
While anything is better than publishing an extended fork of the main
repository, I would like to see someone cite an instance where a
open-slather contrib repository has been particularly successful
(especially one where diverse contributions are assured). In line with
Gaël's experience of sandbox coderot, I think it provides very little
benefit over distributed open-source repositories.

For example, let's say someone has implemented an algorithm (Affinity
Propagation is what triggered this discussion so you might consider that).
Someone else wants to come and add features to it, or even just clean the
code, but by this time the original contributor has moved onto greener
pastures and is not interested in responding to a pull request. Who has the
right, and who the responsibility, to say that this change should be
allowed? Does the contrib repository, too, require an army of maintainers
to familiarise themselves with a vast collection of moderate-quality code?
Without strict gatekeepers, a centralised repository provides almost
nothing, and with strict gatekeepers it entails exactly the issue that we
are trying to solve.

The model of a distributed plugin library (think Django) seems much more
successful when diversity and changing/variant needs are inevitable. Each
contribution is published individually on PyPI and/or open-source hosting,
and someone curates or facilitates a centralised library (like
djangopackages.com). When a contributor doesn't want to maintain anymore,
the project is forked; and the fittest survive.

At the same time, scikit-learn is already trying to facilitate external
contributions:

- it is working towards an estimator verification API
<https://github.com/scikit-learn/scikit-learn/issues/3810> so that it is
easy to test that externally-contributed estimators conform to many
scikit-learn API standards. Contributions to developing this are welcome!
- Gaël has commissioned a sphinx plugin
<https://github.com/sphinx-gallery/sphinx-gallery> to make it easy for
projects to build documentation by example as in scikit-learn's example
gallery <http://scikit-learn.org/stable/auto_examples/>. Perhaps this
could facilitate also displaying external examples in the contrib library
(but only if someone is willing to code up such a feature!).

Making a template repository that people can clone to get started writing
an external package might be a nice extension of these ideas. Another idea
would be to have a conventional prefix for packages that extend
scikit-learn (just as django packages tend to be prefixed in PyPI by
django-).

Still, I think facilitating the construction and access to external
projects will be much more wieldy than a centralised contribs repo, and may
even streamline contribution back to the main repository.
Post by Satrajit Ghosh
Post by Satrajit Ghosh
- let the community (to put zero additional burden on the current
maintainers)
Post by Satrajit Ghosh
maintain a fork of scikit-learn that provides no guarantees other than
it is
Post by Satrajit Ghosh
kept upto date with scikit-learn/master.
The problem with this is that we are still going to have our tracker
filled with problems that are related to the fork, and not master. To put
things in perspective, our tracker has 336 issue open, and 1318 closed.
Just keeping track on those issues is very hard.
Thus the need for a different repo (eg scikit-learn-contrib, as suggested
by Mathieu).
Post by Satrajit Ghosh
- people are welcome to add any algorithms to this (trivial, non-trivial,
recent)
What you are suggesting is very similar to things that have been tried as
a 'sandbox' for instance in scipy. Experience has shown that it code
rots, because nobody feels responsible for the code. It's been tried, it
fails, but if you feel like doing it, you should go ahead. Do you need
anything from us?
I would believe more in separate repos in a 'scikit-learn-contrib' github
organization, because it would give a feeling of responsibility to the
different owners of the repos.
Post by Satrajit Ghosh
- folks don't have to recreate packaging
I don't understand: if there are releases, and packaging, someone has to
do it. It doesn't happen just like this. It's actually a lot of work.
If it's just a fork, without any releases, what's the gain? In addition,
if somebody is not doing the work of making sure that it builds and run
on various platforms, quite quickly it will stop working on different
versions of Python and different platforms.
Post by Satrajit Ghosh
- it brings all the folks who are forking anyway together instead of
splitting
Post by Satrajit Ghosh
off into forks (multiple forks are harder to use)
But someone has to be making the merges :). So the work is there.
Post by Satrajit Ghosh
- it makes for increased availability of algorithms that may be useful in
practice but never makes it out because the world is biased towards
loudspeakers
Probably, provided that the project actually flies. But I really fear
coderot. The amount of work to keep the scikit-learn project going is
just huge. If nobody is doing this work, coderot would come in very
quickly.
Post by Satrajit Ghosh
- it doesn't add anything to the current maintainers plates, nor take
away
Post by Satrajit Ghosh
anything from the main project. perhaps those wishing to add things will
take
Post by Satrajit Ghosh
it upon themselves to maintain this fork.
As long as it is called differently, and _has a different import name_.
If not, I can quite forcast the situation where users are complaining
about scikit-learn and after a long debugging session we find that they
are running some weird fork.
I think that there is something flawed in the way you see the life of a
project like scikit-learn. You seem to think that it is just an
accumulation of code. That putting code together is enough to make a
project successful. But if that's the case, why don't you just create
something else, just anything else, and accumulate code? More
importantly, why do you want algorithms in scikit-learn? Why aren't you
happy with just code on Internet that you can download? If you ask
yourself these questions, you will probably find where the value of
scikit-learn lies, and this will also tell you why there is a huge effort
in maintaining scikit-learn.
Things like this, eg sandboxes where there is no feeling of belonging to
a global project and no harmonizing effort, have been tried in the past.
They fail because of coderot. Actually, to put a historical perspective,
a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
have much working, mostly dead code. We hypothesized that it was because
of lack of visibility, so the 'sandbox' was cleaned, separated in some
structure, and renamed 'scikits'. Scikits weren't getting much traction
inside the scipy codebase, because people were having a hard time working
there (back then it was an SVN, but there was also the problem of
compiling scipy, which is a bit hard). So we started pulling things out
of the SVN. And that's how the current scikits were born. Some of these
scikits took off, because they had a clear project management: releases,
documentation, quality.
It's interesting that almost ten years later, we are falling in the same
problems. I think that this is not by chance. The reasons that these
1. Projects are non-linearly hard to evolve. Bigger projects are harder to
drive than small projects, and significantly. This is a very very true
law of project management and is really underestimated by too many [1].
2. People want different things, and that's perfectly legitimate. The
statsmodels guys wanted control on p-values. The scikit-learn guys
wanted good prediction. Both usecases are valid (I am an avid user of
statsmodels), but doing both in the same project was much, much harder
than doing two projects.
Thus I think that it is natural that some ecosystem of different
projects, from general to specific, shapes up. Yes, it's very important to
keep in mind the big picture, and that people with close enough unite,
but only in balance with point 1.
By the way, I care very much about the ecosystem. When we split of HMMs,
I spent half a day making them a separate package, with setup.py, travis,
https://github.com/hmmlearn
It did take a good 4 hours. Nothing happens for free. I did this even
though I do not use HMMs at all.
- You are free to create a fork. I strongly ask that you change the
import name, elsewhere you will be putting burden on the main
scikit-learn maintainers.
- What I think could work would be a scikit-learn-contrib organization with
different repository in it. I see that Matthieu and Andy have the same
feeling. I think we all agree that it should be done. I am ready to
create the organization, and give you (and many others) the keys of the
kingdom.
Gaël
[1] This has actually been studied. Here is one paper (out of probably
many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Chris Holdgraf
2014-12-04 00:00:25 UTC
Permalink
Not really weighing in on this conversation, but this thread reminded me
just how awesome sklearn is. *So many thanks* to everybody on this list who
helps contribute, you all are awesome :)
Post by Joel Nothman
While anything is better than publishing an extended fork of the main
repository, I would like to see someone cite an instance where a
open-slather contrib repository has been particularly successful
(especially one where diverse contributions are assured). In line with
Gaël's experience of sandbox coderot, I think it provides very little
benefit over distributed open-source repositories.
For example, let's say someone has implemented an algorithm (Affinity
Propagation is what triggered this discussion so you might consider that).
Someone else wants to come and add features to it, or even just clean the
code, but by this time the original contributor has moved onto greener
pastures and is not interested in responding to a pull request. Who has the
right, and who the responsibility, to say that this change should be
allowed? Does the contrib repository, too, require an army of maintainers
to familiarise themselves with a vast collection of moderate-quality code?
Without strict gatekeepers, a centralised repository provides almost
nothing, and with strict gatekeepers it entails exactly the issue that we
are trying to solve.
The model of a distributed plugin library (think Django) seems much more
successful when diversity and changing/variant needs are inevitable. Each
contribution is published individually on PyPI and/or open-source hosting,
and someone curates or facilitates a centralised library (like
djangopackages.com). When a contributor doesn't want to maintain anymore,
the project is forked; and the fittest survive.
At the same time, scikit-learn is already trying to facilitate external
- it is working towards an estimator verification API
<https://github.com/scikit-learn/scikit-learn/issues/3810> so that it
is easy to test that externally-contributed estimators conform to many
scikit-learn API standards. Contributions to developing this are welcome!
- Gaël has commissioned a sphinx plugin
<https://github.com/sphinx-gallery/sphinx-gallery> to make it easy for
projects to build documentation by example as in scikit-learn's example
gallery <http://scikit-learn.org/stable/auto_examples/>. Perhaps this
could facilitate also displaying external examples in the contrib library
(but only if someone is willing to code up such a feature!).
Making a template repository that people can clone to get started writing
an external package might be a nice extension of these ideas. Another idea
would be to have a conventional prefix for packages that extend
scikit-learn (just as django packages tend to be prefixed in PyPI by
django-).
Still, I think facilitating the construction and access to external
projects will be much more wieldy than a centralised contribs repo, and may
even streamline contribution back to the main repository.
Post by Satrajit Ghosh
Post by Satrajit Ghosh
- let the community (to put zero additional burden on the current
maintainers)
Post by Satrajit Ghosh
maintain a fork of scikit-learn that provides no guarantees other than
it is
Post by Satrajit Ghosh
kept upto date with scikit-learn/master.
The problem with this is that we are still going to have our tracker
filled with problems that are related to the fork, and not master. To put
things in perspective, our tracker has 336 issue open, and 1318 closed.
Just keeping track on those issues is very hard.
Thus the need for a different repo (eg scikit-learn-contrib, as suggested
by Mathieu).
Post by Satrajit Ghosh
- people are welcome to add any algorithms to this (trivial,
non-trivial,
Post by Satrajit Ghosh
recent)
What you are suggesting is very similar to things that have been tried as
a 'sandbox' for instance in scipy. Experience has shown that it code
rots, because nobody feels responsible for the code. It's been tried, it
fails, but if you feel like doing it, you should go ahead. Do you need
anything from us?
I would believe more in separate repos in a 'scikit-learn-contrib' github
organization, because it would give a feeling of responsibility to the
different owners of the repos.
Post by Satrajit Ghosh
- folks don't have to recreate packaging
I don't understand: if there are releases, and packaging, someone has to
do it. It doesn't happen just like this. It's actually a lot of work.
If it's just a fork, without any releases, what's the gain? In addition,
if somebody is not doing the work of making sure that it builds and run
on various platforms, quite quickly it will stop working on different
versions of Python and different platforms.
Post by Satrajit Ghosh
- it brings all the folks who are forking anyway together instead of
splitting
Post by Satrajit Ghosh
off into forks (multiple forks are harder to use)
But someone has to be making the merges :). So the work is there.
Post by Satrajit Ghosh
- it makes for increased availability of algorithms that may be useful
in
Post by Satrajit Ghosh
practice but never makes it out because the world is biased towards
loudspeakers
Probably, provided that the project actually flies. But I really fear
coderot. The amount of work to keep the scikit-learn project going is
just huge. If nobody is doing this work, coderot would come in very
quickly.
Post by Satrajit Ghosh
- it doesn't add anything to the current maintainers plates, nor take
away
Post by Satrajit Ghosh
anything from the main project. perhaps those wishing to add things
will take
Post by Satrajit Ghosh
it upon themselves to maintain this fork.
As long as it is called differently, and _has a different import name_.
If not, I can quite forcast the situation where users are complaining
about scikit-learn and after a long debugging session we find that they
are running some weird fork.
I think that there is something flawed in the way you see the life of a
project like scikit-learn. You seem to think that it is just an
accumulation of code. That putting code together is enough to make a
project successful. But if that's the case, why don't you just create
something else, just anything else, and accumulate code? More
importantly, why do you want algorithms in scikit-learn? Why aren't you
happy with just code on Internet that you can download? If you ask
yourself these questions, you will probably find where the value of
scikit-learn lies, and this will also tell you why there is a huge effort
in maintaining scikit-learn.
Things like this, eg sandboxes where there is no feeling of belonging to
a global project and no harmonizing effort, have been tried in the past.
They fail because of coderot. Actually, to put a historical perspective,
a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
have much working, mostly dead code. We hypothesized that it was because
of lack of visibility, so the 'sandbox' was cleaned, separated in some
structure, and renamed 'scikits'. Scikits weren't getting much traction
inside the scipy codebase, because people were having a hard time working
there (back then it was an SVN, but there was also the problem of
compiling scipy, which is a bit hard). So we started pulling things out
of the SVN. And that's how the current scikits were born. Some of these
scikits took off, because they had a clear project management: releases,
documentation, quality.
It's interesting that almost ten years later, we are falling in the same
problems. I think that this is not by chance. The reasons that these
1. Projects are non-linearly hard to evolve. Bigger projects are harder to
drive than small projects, and significantly. This is a very very true
law of project management and is really underestimated by too many [1].
2. People want different things, and that's perfectly legitimate. The
statsmodels guys wanted control on p-values. The scikit-learn guys
wanted good prediction. Both usecases are valid (I am an avid user of
statsmodels), but doing both in the same project was much, much harder
than doing two projects.
Thus I think that it is natural that some ecosystem of different
projects, from general to specific, shapes up. Yes, it's very important to
keep in mind the big picture, and that people with close enough unite,
but only in balance with point 1.
By the way, I care very much about the ecosystem. When we split of HMMs,
I spent half a day making them a separate package, with setup.py, travis,
https://github.com/hmmlearn
It did take a good 4 hours. Nothing happens for free. I did this even
though I do not use HMMs at all.
- You are free to create a fork. I strongly ask that you change the
import name, elsewhere you will be putting burden on the main
scikit-learn maintainers.
- What I think could work would be a scikit-learn-contrib organization with
different repository in it. I see that Matthieu and Andy have the same
feeling. I think we all agree that it should be done. I am ready to
create the organization, and give you (and many others) the keys of the
kingdom.
Gaël
[1] This has actually been studied. Here is one paper (out of probably
many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
_____________________________________

PhD Candidate in Neuroscience | UC Berkeley <http://hwni.org/>
Editor and Web Master | Berkeley Science Review
<http://sciencereview.berkeley.edu/>
_____________________________________
Satrajit Ghosh
2014-12-04 01:01:46 UTC
Permalink
hi gael and joel,

i'll insert a short response here. i actually agree with all the things
both of you said. i will however comment on two things:

1. algorithmic scenarios:

a. adding algorithms that can be built directly of the scikit-learn api
b. adding algorithms that require refactoring some not all underlying
pieces.

in case a), i could simply have a python script, i don't need a fork, but
in case b), i need a fork.

2. i love decentralization, but the current architecture doesn't allow me
to do the very simple use-case. i want to compare models in scikit-learn to
models outside scikit-learn. what's nice about the api is that it makes
comparing models easy, i can search over various models. however, if i have
to install or merge 5 different scikit-learn forks to be able to compare
those algorithms that are not in scikit learn that becomes expensive. if i
could do this in an easier manner, i wouldn't really ask for a common
bleeding repo.

cheers,

satra
Post by Joel Nothman
While anything is better than publishing an extended fork of the main
repository, I would like to see someone cite an instance where a
open-slather contrib repository has been particularly successful
(especially one where diverse contributions are assured). In line with
Gaël's experience of sandbox coderot, I think it provides very little
benefit over distributed open-source repositories.
For example, let's say someone has implemented an algorithm (Affinity
Propagation is what triggered this discussion so you might consider that).
Someone else wants to come and add features to it, or even just clean the
code, but by this time the original contributor has moved onto greener
pastures and is not interested in responding to a pull request. Who has the
right, and who the responsibility, to say that this change should be
allowed? Does the contrib repository, too, require an army of maintainers
to familiarise themselves with a vast collection of moderate-quality code?
Without strict gatekeepers, a centralised repository provides almost
nothing, and with strict gatekeepers it entails exactly the issue that we
are trying to solve.
The model of a distributed plugin library (think Django) seems much more
successful when diversity and changing/variant needs are inevitable. Each
contribution is published individually on PyPI and/or open-source hosting,
and someone curates or facilitates a centralised library (like
djangopackages.com). When a contributor doesn't want to maintain anymore,
the project is forked; and the fittest survive.
At the same time, scikit-learn is already trying to facilitate external
- it is working towards an estimator verification API
<https://github.com/scikit-learn/scikit-learn/issues/3810> so that it
is easy to test that externally-contributed estimators conform to many
scikit-learn API standards. Contributions to developing this are welcome!
- Gaël has commissioned a sphinx plugin
<https://github.com/sphinx-gallery/sphinx-gallery> to make it easy for
projects to build documentation by example as in scikit-learn's example
gallery <http://scikit-learn.org/stable/auto_examples/>. Perhaps this
could facilitate also displaying external examples in the contrib library
(but only if someone is willing to code up such a feature!).
Making a template repository that people can clone to get started writing
an external package might be a nice extension of these ideas. Another idea
would be to have a conventional prefix for packages that extend
scikit-learn (just as django packages tend to be prefixed in PyPI by
django-).
Still, I think facilitating the construction and access to external
projects will be much more wieldy than a centralised contribs repo, and may
even streamline contribution back to the main repository.
Post by Satrajit Ghosh
Post by Satrajit Ghosh
- let the community (to put zero additional burden on the current
maintainers)
Post by Satrajit Ghosh
maintain a fork of scikit-learn that provides no guarantees other than
it is
Post by Satrajit Ghosh
kept upto date with scikit-learn/master.
The problem with this is that we are still going to have our tracker
filled with problems that are related to the fork, and not master. To put
things in perspective, our tracker has 336 issue open, and 1318 closed.
Just keeping track on those issues is very hard.
Thus the need for a different repo (eg scikit-learn-contrib, as suggested
by Mathieu).
Post by Satrajit Ghosh
- people are welcome to add any algorithms to this (trivial,
non-trivial,
Post by Satrajit Ghosh
recent)
What you are suggesting is very similar to things that have been tried as
a 'sandbox' for instance in scipy. Experience has shown that it code
rots, because nobody feels responsible for the code. It's been tried, it
fails, but if you feel like doing it, you should go ahead. Do you need
anything from us?
I would believe more in separate repos in a 'scikit-learn-contrib' github
organization, because it would give a feeling of responsibility to the
different owners of the repos.
Post by Satrajit Ghosh
- folks don't have to recreate packaging
I don't understand: if there are releases, and packaging, someone has to
do it. It doesn't happen just like this. It's actually a lot of work.
If it's just a fork, without any releases, what's the gain? In addition,
if somebody is not doing the work of making sure that it builds and run
on various platforms, quite quickly it will stop working on different
versions of Python and different platforms.
Post by Satrajit Ghosh
- it brings all the folks who are forking anyway together instead of
splitting
Post by Satrajit Ghosh
off into forks (multiple forks are harder to use)
But someone has to be making the merges :). So the work is there.
Post by Satrajit Ghosh
- it makes for increased availability of algorithms that may be useful
in
Post by Satrajit Ghosh
practice but never makes it out because the world is biased towards
loudspeakers
Probably, provided that the project actually flies. But I really fear
coderot. The amount of work to keep the scikit-learn project going is
just huge. If nobody is doing this work, coderot would come in very
quickly.
Post by Satrajit Ghosh
- it doesn't add anything to the current maintainers plates, nor take
away
Post by Satrajit Ghosh
anything from the main project. perhaps those wishing to add things
will take
Post by Satrajit Ghosh
it upon themselves to maintain this fork.
As long as it is called differently, and _has a different import name_.
If not, I can quite forcast the situation where users are complaining
about scikit-learn and after a long debugging session we find that they
are running some weird fork.
I think that there is something flawed in the way you see the life of a
project like scikit-learn. You seem to think that it is just an
accumulation of code. That putting code together is enough to make a
project successful. But if that's the case, why don't you just create
something else, just anything else, and accumulate code? More
importantly, why do you want algorithms in scikit-learn? Why aren't you
happy with just code on Internet that you can download? If you ask
yourself these questions, you will probably find where the value of
scikit-learn lies, and this will also tell you why there is a huge effort
in maintaining scikit-learn.
Things like this, eg sandboxes where there is no feeling of belonging to
a global project and no harmonizing effort, have been tried in the past.
They fail because of coderot. Actually, to put a historical perspective,
a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
have much working, mostly dead code. We hypothesized that it was because
of lack of visibility, so the 'sandbox' was cleaned, separated in some
structure, and renamed 'scikits'. Scikits weren't getting much traction
inside the scipy codebase, because people were having a hard time working
there (back then it was an SVN, but there was also the problem of
compiling scipy, which is a bit hard). So we started pulling things out
of the SVN. And that's how the current scikits were born. Some of these
scikits took off, because they had a clear project management: releases,
documentation, quality.
It's interesting that almost ten years later, we are falling in the same
problems. I think that this is not by chance. The reasons that these
1. Projects are non-linearly hard to evolve. Bigger projects are harder to
drive than small projects, and significantly. This is a very very true
law of project management and is really underestimated by too many [1].
2. People want different things, and that's perfectly legitimate. The
statsmodels guys wanted control on p-values. The scikit-learn guys
wanted good prediction. Both usecases are valid (I am an avid user of
statsmodels), but doing both in the same project was much, much harder
than doing two projects.
Thus I think that it is natural that some ecosystem of different
projects, from general to specific, shapes up. Yes, it's very important to
keep in mind the big picture, and that people with close enough unite,
but only in balance with point 1.
By the way, I care very much about the ecosystem. When we split of HMMs,
I spent half a day making them a separate package, with setup.py, travis,
https://github.com/hmmlearn
It did take a good 4 hours. Nothing happens for free. I did this even
though I do not use HMMs at all.
- You are free to create a fork. I strongly ask that you change the
import name, elsewhere you will be putting burden on the main
scikit-learn maintainers.
- What I think could work would be a scikit-learn-contrib organization with
different repository in it. I see that Matthieu and Andy have the same
feeling. I think we all agree that it should be done. I am ready to
create the organization, and give you (and many others) the keys of the
kingdom.
Gaël
[1] This has actually been studied. Here is one paper (out of probably
many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Joel Nothman
2014-12-04 02:18:47 UTC
Permalink
I know what you mean by needing new features or refactoring inside the main
project. I've got a case that requires a more polymorphic definition of
sklearn.base.clone. I think such changes should be possible within the main
repo, and need to be argued by their proponent, with tests documented to
say that a feature is required for external projects.

I don't see what's hard about comparing models from outside scikit-learn,
on the assumption that all the packages worth comparing are trivial to
install, and listed in scikit-learn's "Extension Library".
Post by Satrajit Ghosh
hi gael and joel,
i'll insert a short response here. i actually agree with all the things
a. adding algorithms that can be built directly of the scikit-learn api
b. adding algorithms that require refactoring some not all underlying
pieces.
in case a), i could simply have a python script, i don't need a fork, but
in case b), i need a fork.
2. i love decentralization, but the current architecture doesn't allow me
to do the very simple use-case. i want to compare models in scikit-learn to
models outside scikit-learn. what's nice about the api is that it makes
comparing models easy, i can search over various models. however, if i have
to install or merge 5 different scikit-learn forks to be able to compare
those algorithms that are not in scikit learn that becomes expensive. if i
could do this in an easier manner, i wouldn't really ask for a common
bleeding repo.
cheers,
satra
Post by Joel Nothman
While anything is better than publishing an extended fork of the main
repository, I would like to see someone cite an instance where a
open-slather contrib repository has been particularly successful
(especially one where diverse contributions are assured). In line with
Gaël's experience of sandbox coderot, I think it provides very little
benefit over distributed open-source repositories.
For example, let's say someone has implemented an algorithm (Affinity
Propagation is what triggered this discussion so you might consider that).
Someone else wants to come and add features to it, or even just clean the
code, but by this time the original contributor has moved onto greener
pastures and is not interested in responding to a pull request. Who has the
right, and who the responsibility, to say that this change should be
allowed? Does the contrib repository, too, require an army of maintainers
to familiarise themselves with a vast collection of moderate-quality code?
Without strict gatekeepers, a centralised repository provides almost
nothing, and with strict gatekeepers it entails exactly the issue that we
are trying to solve.
The model of a distributed plugin library (think Django) seems much more
successful when diversity and changing/variant needs are inevitable. Each
contribution is published individually on PyPI and/or open-source hosting,
and someone curates or facilitates a centralised library (like
djangopackages.com). When a contributor doesn't want to maintain
anymore, the project is forked; and the fittest survive.
At the same time, scikit-learn is already trying to facilitate external
- it is working towards an estimator verification API
<https://github.com/scikit-learn/scikit-learn/issues/3810> so that it
is easy to test that externally-contributed estimators conform to many
scikit-learn API standards. Contributions to developing this are welcome!
- Gaël has commissioned a sphinx plugin
<https://github.com/sphinx-gallery/sphinx-gallery> to make it easy
for projects to build documentation by example as in scikit-learn's example
gallery <http://scikit-learn.org/stable/auto_examples/>. Perhaps this
could facilitate also displaying external examples in the contrib library
(but only if someone is willing to code up such a feature!).
Making a template repository that people can clone to get started writing
an external package might be a nice extension of these ideas. Another idea
would be to have a conventional prefix for packages that extend
scikit-learn (just as django packages tend to be prefixed in PyPI by
django-).
Still, I think facilitating the construction and access to external
projects will be much more wieldy than a centralised contribs repo, and may
even streamline contribution back to the main repository.
On 4 December 2014 at 03:18, Gael Varoquaux <
Post by Satrajit Ghosh
Post by Satrajit Ghosh
- let the community (to put zero additional burden on the current
maintainers)
Post by Satrajit Ghosh
maintain a fork of scikit-learn that provides no guarantees other than
it is
Post by Satrajit Ghosh
kept upto date with scikit-learn/master.
The problem with this is that we are still going to have our tracker
filled with problems that are related to the fork, and not master. To put
things in perspective, our tracker has 336 issue open, and 1318 closed.
Just keeping track on those issues is very hard.
Thus the need for a different repo (eg scikit-learn-contrib, as suggested
by Mathieu).
Post by Satrajit Ghosh
- people are welcome to add any algorithms to this (trivial,
non-trivial,
Post by Satrajit Ghosh
recent)
What you are suggesting is very similar to things that have been tried as
a 'sandbox' for instance in scipy. Experience has shown that it code
rots, because nobody feels responsible for the code. It's been tried, it
fails, but if you feel like doing it, you should go ahead. Do you need
anything from us?
I would believe more in separate repos in a 'scikit-learn-contrib' github
organization, because it would give a feeling of responsibility to the
different owners of the repos.
Post by Satrajit Ghosh
- folks don't have to recreate packaging
I don't understand: if there are releases, and packaging, someone has to
do it. It doesn't happen just like this. It's actually a lot of work.
If it's just a fork, without any releases, what's the gain? In addition,
if somebody is not doing the work of making sure that it builds and run
on various platforms, quite quickly it will stop working on different
versions of Python and different platforms.
Post by Satrajit Ghosh
- it brings all the folks who are forking anyway together instead of
splitting
Post by Satrajit Ghosh
off into forks (multiple forks are harder to use)
But someone has to be making the merges :). So the work is there.
Post by Satrajit Ghosh
- it makes for increased availability of algorithms that may be useful
in
Post by Satrajit Ghosh
practice but never makes it out because the world is biased towards
loudspeakers
Probably, provided that the project actually flies. But I really fear
coderot. The amount of work to keep the scikit-learn project going is
just huge. If nobody is doing this work, coderot would come in very
quickly.
Post by Satrajit Ghosh
- it doesn't add anything to the current maintainers plates, nor take
away
Post by Satrajit Ghosh
anything from the main project. perhaps those wishing to add things
will take
Post by Satrajit Ghosh
it upon themselves to maintain this fork.
As long as it is called differently, and _has a different import name_.
If not, I can quite forcast the situation where users are complaining
about scikit-learn and after a long debugging session we find that they
are running some weird fork.
I think that there is something flawed in the way you see the life of a
project like scikit-learn. You seem to think that it is just an
accumulation of code. That putting code together is enough to make a
project successful. But if that's the case, why don't you just create
something else, just anything else, and accumulate code? More
importantly, why do you want algorithms in scikit-learn? Why aren't you
happy with just code on Internet that you can download? If you ask
yourself these questions, you will probably find where the value of
scikit-learn lies, and this will also tell you why there is a huge effort
in maintaining scikit-learn.
Things like this, eg sandboxes where there is no feeling of belonging to
a global project and no harmonizing effort, have been tried in the past.
They fail because of coderot. Actually, to put a historical perspective,
a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
have much working, mostly dead code. We hypothesized that it was because
of lack of visibility, so the 'sandbox' was cleaned, separated in some
structure, and renamed 'scikits'. Scikits weren't getting much traction
inside the scipy codebase, because people were having a hard time working
there (back then it was an SVN, but there was also the problem of
compiling scipy, which is a bit hard). So we started pulling things out
of the SVN. And that's how the current scikits were born. Some of these
scikits took off, because they had a clear project management: releases,
documentation, quality.
It's interesting that almost ten years later, we are falling in the same
problems. I think that this is not by chance. The reasons that these
1. Projects are non-linearly hard to evolve. Bigger projects are harder to
drive than small projects, and significantly. This is a very very true
law of project management and is really underestimated by too many [1].
2. People want different things, and that's perfectly legitimate. The
statsmodels guys wanted control on p-values. The scikit-learn guys
wanted good prediction. Both usecases are valid (I am an avid user of
statsmodels), but doing both in the same project was much, much harder
than doing two projects.
Thus I think that it is natural that some ecosystem of different
projects, from general to specific, shapes up. Yes, it's very important to
keep in mind the big picture, and that people with close enough unite,
but only in balance with point 1.
By the way, I care very much about the ecosystem. When we split of HMMs,
I spent half a day making them a separate package, with setup.py, travis,
https://github.com/hmmlearn
It did take a good 4 hours. Nothing happens for free. I did this even
though I do not use HMMs at all.
- You are free to create a fork. I strongly ask that you change the
import name, elsewhere you will be putting burden on the main
scikit-learn maintainers.
- What I think could work would be a scikit-learn-contrib organization with
different repository in it. I see that Matthieu and Andy have the same
feeling. I think we all agree that it should be done. I am ready to
create the organization, and give you (and many others) the keys of the
kingdom.
Gaël
[1] This has actually been studied. Here is one paper (out of probably
many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Satrajit Ghosh
2014-12-04 02:43:44 UTC
Permalink
hi joel,

I don't see what's hard about comparing models from outside scikit-learn,
Post by Joel Nothman
on the assumption that all the packages worth comparing are trivial to
install, and listed in scikit-learn's "Extension Library".
i was referring to the scenario where this wasn't a standalone package but
simply a fork of scikit-learn that someone coded a new model into. i agree
if extensions are built as standalone packages, it would be trivial to
install and use.

cheers,

satra
Post by Joel Nothman
Post by Satrajit Ghosh
hi gael and joel,
i'll insert a short response here. i actually agree with all the things
a. adding algorithms that can be built directly of the scikit-learn api
b. adding algorithms that require refactoring some not all underlying
pieces.
in case a), i could simply have a python script, i don't need a fork, but
in case b), i need a fork.
2. i love decentralization, but the current architecture doesn't allow me
to do the very simple use-case. i want to compare models in scikit-learn to
models outside scikit-learn. what's nice about the api is that it makes
comparing models easy, i can search over various models. however, if i have
to install or merge 5 different scikit-learn forks to be able to compare
those algorithms that are not in scikit learn that becomes expensive. if i
could do this in an easier manner, i wouldn't really ask for a common
bleeding repo.
cheers,
satra
Post by Joel Nothman
While anything is better than publishing an extended fork of the main
repository, I would like to see someone cite an instance where a
open-slather contrib repository has been particularly successful
(especially one where diverse contributions are assured). In line with
Gaël's experience of sandbox coderot, I think it provides very little
benefit over distributed open-source repositories.
For example, let's say someone has implemented an algorithm (Affinity
Propagation is what triggered this discussion so you might consider that).
Someone else wants to come and add features to it, or even just clean the
code, but by this time the original contributor has moved onto greener
pastures and is not interested in responding to a pull request. Who has the
right, and who the responsibility, to say that this change should be
allowed? Does the contrib repository, too, require an army of maintainers
to familiarise themselves with a vast collection of moderate-quality code?
Without strict gatekeepers, a centralised repository provides almost
nothing, and with strict gatekeepers it entails exactly the issue that we
are trying to solve.
The model of a distributed plugin library (think Django) seems much more
successful when diversity and changing/variant needs are inevitable. Each
contribution is published individually on PyPI and/or open-source hosting,
and someone curates or facilitates a centralised library (like
djangopackages.com). When a contributor doesn't want to maintain
anymore, the project is forked; and the fittest survive.
At the same time, scikit-learn is already trying to facilitate external
- it is working towards an estimator verification API
<https://github.com/scikit-learn/scikit-learn/issues/3810> so that
it is easy to test that externally-contributed estimators conform to many
scikit-learn API standards. Contributions to developing this are welcome!
- Gaël has commissioned a sphinx plugin
<https://github.com/sphinx-gallery/sphinx-gallery> to make it easy
for projects to build documentation by example as in scikit-learn's example
gallery <http://scikit-learn.org/stable/auto_examples/>. Perhaps
this could facilitate also displaying external examples in the contrib
library (but only if someone is willing to code up such a feature!).
Making a template repository that people can clone to get started
writing an external package might be a nice extension of these ideas.
Another idea would be to have a conventional prefix for packages that
extend scikit-learn (just as django packages tend to be prefixed in PyPI by
django-).
Still, I think facilitating the construction and access to external
projects will be much more wieldy than a centralised contribs repo, and may
even streamline contribution back to the main repository.
On 4 December 2014 at 03:18, Gael Varoquaux <
Post by Satrajit Ghosh
Post by Satrajit Ghosh
- let the community (to put zero additional burden on the current
maintainers)
Post by Satrajit Ghosh
maintain a fork of scikit-learn that provides no guarantees other
than it is
Post by Satrajit Ghosh
kept upto date with scikit-learn/master.
The problem with this is that we are still going to have our tracker
filled with problems that are related to the fork, and not master. To put
things in perspective, our tracker has 336 issue open, and 1318 closed.
Just keeping track on those issues is very hard.
Thus the need for a different repo (eg scikit-learn-contrib, as suggested
by Mathieu).
Post by Satrajit Ghosh
- people are welcome to add any algorithms to this (trivial,
non-trivial,
Post by Satrajit Ghosh
recent)
What you are suggesting is very similar to things that have been tried as
a 'sandbox' for instance in scipy. Experience has shown that it code
rots, because nobody feels responsible for the code. It's been tried, it
fails, but if you feel like doing it, you should go ahead. Do you need
anything from us?
I would believe more in separate repos in a 'scikit-learn-contrib' github
organization, because it would give a feeling of responsibility to the
different owners of the repos.
Post by Satrajit Ghosh
- folks don't have to recreate packaging
I don't understand: if there are releases, and packaging, someone has to
do it. It doesn't happen just like this. It's actually a lot of work.
If it's just a fork, without any releases, what's the gain? In addition,
if somebody is not doing the work of making sure that it builds and run
on various platforms, quite quickly it will stop working on different
versions of Python and different platforms.
Post by Satrajit Ghosh
- it brings all the folks who are forking anyway together instead of
splitting
Post by Satrajit Ghosh
off into forks (multiple forks are harder to use)
But someone has to be making the merges :). So the work is there.
Post by Satrajit Ghosh
- it makes for increased availability of algorithms that may be
useful in
Post by Satrajit Ghosh
practice but never makes it out because the world is biased towards
loudspeakers
Probably, provided that the project actually flies. But I really fear
coderot. The amount of work to keep the scikit-learn project going is
just huge. If nobody is doing this work, coderot would come in very
quickly.
Post by Satrajit Ghosh
- it doesn't add anything to the current maintainers plates, nor take
away
Post by Satrajit Ghosh
anything from the main project. perhaps those wishing to add things
will take
Post by Satrajit Ghosh
it upon themselves to maintain this fork.
As long as it is called differently, and _has a different import name_.
If not, I can quite forcast the situation where users are complaining
about scikit-learn and after a long debugging session we find that they
are running some weird fork.
I think that there is something flawed in the way you see the life of a
project like scikit-learn. You seem to think that it is just an
accumulation of code. That putting code together is enough to make a
project successful. But if that's the case, why don't you just create
something else, just anything else, and accumulate code? More
importantly, why do you want algorithms in scikit-learn? Why aren't you
happy with just code on Internet that you can download? If you ask
yourself these questions, you will probably find where the value of
scikit-learn lies, and this will also tell you why there is a huge effort
in maintaining scikit-learn.
Things like this, eg sandboxes where there is no feeling of belonging to
a global project and no harmonizing effort, have been tried in the past.
They fail because of coderot. Actually, to put a historical perspective,
a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
have much working, mostly dead code. We hypothesized that it was because
of lack of visibility, so the 'sandbox' was cleaned, separated in some
structure, and renamed 'scikits'. Scikits weren't getting much traction
inside the scipy codebase, because people were having a hard time working
there (back then it was an SVN, but there was also the problem of
compiling scipy, which is a bit hard). So we started pulling things out
of the SVN. And that's how the current scikits were born. Some of these
scikits took off, because they had a clear project management: releases,
documentation, quality.
It's interesting that almost ten years later, we are falling in the same
problems. I think that this is not by chance. The reasons that these
1. Projects are non-linearly hard to evolve. Bigger projects are harder to
drive than small projects, and significantly. This is a very very true
law of project management and is really underestimated by too many [1].
2. People want different things, and that's perfectly legitimate. The
statsmodels guys wanted control on p-values. The scikit-learn guys
wanted good prediction. Both usecases are valid (I am an avid user of
statsmodels), but doing both in the same project was much, much harder
than doing two projects.
Thus I think that it is natural that some ecosystem of different
projects, from general to specific, shapes up. Yes, it's very important to
keep in mind the big picture, and that people with close enough unite,
but only in balance with point 1.
By the way, I care very much about the ecosystem. When we split of HMMs,
I spent half a day making them a separate package, with setup.py, travis,
https://github.com/hmmlearn
It did take a good 4 hours. Nothing happens for free. I did this even
though I do not use HMMs at all.
- You are free to create a fork. I strongly ask that you change the
import name, elsewhere you will be putting burden on the main
scikit-learn maintainers.
- What I think could work would be a scikit-learn-contrib organization with
different repository in it. I see that Matthieu and Andy have the same
feeling. I think we all agree that it should be done. I am ready to
create the organization, and give you (and many others) the keys of the
kingdom.
Gaël
[1] This has actually been studied. Here is one paper (out of probably
many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Lars Buitinck
2014-12-04 08:36:25 UTC
Permalink
Post by Joel Nothman
For example, let's say someone has implemented an algorithm (Affinity
Propagation is what triggered this discussion so you might consider that).
Someone else wants to come and add features to it, or even just clean the
code, but by this time the original contributor has moved onto greener
pastures and is not interested in responding to a pull request. Who has the
right, and who the responsibility, to say that this change should be
allowed? Does the contrib repository, too, require an army of maintainers to
familiarise themselves with a vast collection of moderate-quality code?
Without strict gatekeepers, a centralised repository provides almost
nothing, and with strict gatekeepers it entails exactly the issue that we
are trying to solve.
My thought exactly. Publishing separate packages is the way to go. The
other thing we still need to do is implement the utils/_utils split,
i.e., provide a stable set of utilities for extension writers to use.
(This is also exactly the part where a forked repo is going to run
into trouble: utils gets refactored often, without regard for
backwards compat, and when it is, the fork is either going to diverge
or all code in it has to be checked and updated.)
Rafael Calsaverini
2014-12-04 09:14:33 UTC
Permalink
Chiming in as a user who never contributed but who uses sklearn a lot
(yeah, I know, I need to find the time to help a a little), I tend to agree
with this.

I know a couple successful projects that have stand alone plugins,
withwithv standardized names and interface and a easy to find curated list
or some kind of advertising about plugins on the main project. Maybe even a
contrib repo which is just a collection of subrepos (but this invites some
maintenance cost danger). Django and Flask are two examples.

In contrast, I know of no successful case of a separate contrib fork or
something like that. It might be better to follow former examples of
success.

That said, I guess every user of sklearn see value in the package exactly
because of the incredible amount of time and effort put in by the
maintainers to guarantee a dependable codebase, with good docs, good
performance, with trustworthy code quality. And for that we should all be
thankful (and also try to help a bit, which I am in debt for). Ruining that
(by overloading the maintainers with work) will ruin the usefulness of the
package.
Post by Joel Nothman
Post by Joel Nothman
For example, let's say someone has implemented an algorithm (Affinity
Propagation is what triggered this discussion so you might consider
that).
Post by Joel Nothman
Someone else wants to come and add features to it, or even just clean the
code, but by this time the original contributor has moved onto greener
pastures and is not interested in responding to a pull request. Who has
the
Post by Joel Nothman
right, and who the responsibility, to say that this change should be
allowed? Does the contrib repository, too, require an army of
maintainers to
Post by Joel Nothman
familiarise themselves with a vast collection of moderate-quality code?
Without strict gatekeepers, a centralised repository provides almost
nothing, and with strict gatekeepers it entails exactly the issue that we
are trying to solve.
My thought exactly. Publishing separate packages is the way to go. The
other thing we still need to do is implement the utils/_utils split,
i.e., provide a stable set of utilities for extension writers to use.
(This is also exactly the part where a forked repo is going to run
into trouble: utils gets refactored often, without regard for
backwards compat, and when it is, the fork is either going to diverge
or all code in it has to be checked and updated.)
------------------------------------------------------------
------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&
iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andy
2014-12-04 16:48:32 UTC
Permalink
I feel that maintaining package infrastructure is quite some work, if
you want to have online documentation and continuous integration.
It took me a day to build the pystruct docs after I tried to update the
gallery from sklearn master.

I guess that having an example repo that has a build, travis and sphinx
setup would help.
I'm not sure if we can do readthedocs, which would decrease friction
even more.

Currently the related projects page is pretty hard to find, too :-/
http://scikit-learn.org/dev/related_projects.html
Post by Lars Buitinck
Post by Joel Nothman
For example, let's say someone has implemented an algorithm (Affinity
Propagation is what triggered this discussion so you might consider that).
Someone else wants to come and add features to it, or even just clean the
code, but by this time the original contributor has moved onto greener
pastures and is not interested in responding to a pull request. Who has the
right, and who the responsibility, to say that this change should be
allowed? Does the contrib repository, too, require an army of maintainers to
familiarise themselves with a vast collection of moderate-quality code?
Without strict gatekeepers, a centralised repository provides almost
nothing, and with strict gatekeepers it entails exactly the issue that we
are trying to solve.
My thought exactly. Publishing separate packages is the way to go. The
other thing we still need to do is implement the utils/_utils split,
i.e., provide a stable set of utilities for extension writers to use.
(This is also exactly the part where a forked repo is going to run
into trouble: utils gets refactored often, without regard for
backwards compat, and when it is, the fork is either going to diverge
or all code in it has to be checked and updated.)
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Andy
2014-12-03 16:10:24 UTC
Permalink
Post by Mathieu Blondel
I think 1000 citations is a bit too much to ask. We should probably
update the FAQ with something more reasonable, like say 200 citations.
That said, I agree that the citation threshold is just an indicator.
For example, SAG and AdaGrad, which are considerely considered for
inclusion, have around 75 and 250 citations currently.
I totally made up this number and it might be too high. No-one
complained on the PR, but I'd be happy to lower it.

For SAG I am not super sure it is a great example. It is a year old and
the same group already published two improvements to it. So it seems
obsolete before it is merged.
Gael Varoquaux
2014-12-03 16:13:04 UTC
Permalink
Post by Andy
For SAG I am not super sure it is a great example. It is a year old and
the same group already published two improvements to it. So it seems
obsolete before it is merged.
Same feeling here, eventhought I have a personnal need for it. That's
typically something that should go in a 'contrib', IMHO (I know that Alex
will not be happy to hear me say this).

G
Sturla Molden
2014-12-04 10:00:24 UTC
Permalink
Post by Tom Fawcett
Wow, I had not seen this FAQ. "As a rule we only add well-established
algorithms. A rule of thumb is at least 3 years since publications, 1000+
cites and wide use and usefullness.”
A dumping ground for any kind of algorithm that "someone has found useful"
is not a good way to design a library. A known example is OpenSSL. Most of
its security problems can be traced back to that.

But PyPI is open for anyone who wants to publish a Python package.

Sturla
Tom Fawcett
2014-12-04 16:44:44 UTC
Permalink
Post by Sturla Molden
Post by Tom Fawcett
Wow, I had not seen this FAQ. "As a rule we only add well-established
algorithms. A rule of thumb is at least 3 years since publications, 1000+
cites and wide use and usefullness.”
A dumping ground for any kind of algorithm that "someone has found useful"
is not a good way to design a library.
An unfortunate disparagement. Remember, we’re in the open-source world. Not too long ago closed-source people predicted open-source would be a dumping ground of low-quality code written by hobbyists and amateurs.
Post by Sturla Molden
A known example is OpenSSL. Most of its security problems can be traced back to that.
Can’t comment on this. I wouldn’t think a security library, whose components have to work together tightly, would be a good comparison to scikit-learn, which is more of a pick-and-choose collection of models with a loose API.
Post by Sturla Molden
But PyPI is open for anyone who wants to publish a Python package.
Fair enough. I’ll comment on this in another message.

Regards,
-Tom
Илья Патрушев
2014-12-03 13:08:17 UTC
Permalink
Hi Andy,

Adaptive Affinity Propagation is essentially an additional optimisation
layer on top of the original Affinity Propagation algorithm.
Affinity Propagation algorithm works off the similarity matrix and tries to
identify a number of data points that would be "centres" of clusters. The
behaviour of Affinity Propagation algorithm is governed by two parameters:
preferences (a vector of n_samples size) and damping.
The preferences on one hand are the way to incorporate prior knowledge
about likely cluster centres, on the other hand they control the number of
clusters produced by the algorithm. When there is no prior knowledge,
preferences are set to the same value for all sample points. The general
relationship between the preference value and the number of clusters is;
the greater the value the greater the number of clusters. The authors of
the Affinity Propagation algorithm recommend using the median similarity
value, but in the end one has to find the right preference value for each
new clustering problem.
The damping parameter defines speed at which the algorithm updates its
responsibility/availability evidence. The higher the damping parameter is
the less the algorithm prone to oscillations, but this slows down
convergence.
The Wang's solution is to run Affinity Propagation algorithm starting with
quite high preference value (like .5 of median similarity). As it
converges, the goodness of clustering is measured (they suggested
Silhouette index) the preference is decreased, and these steps are repeated
until the algorithm produces some minimal number of clusters. Along with
that, the presence of oscillations is monitored and should they appear they
are controlled by adjusting the damping parameter, should it reaches
maximum value by reducing the preference value.
The pdf in arXiv is the English translation of the original paper published
in Chinese.
I agree, Adaptive Affinity Propagation is not as widely used method as
defined in FAQ, I should have looked in it beforehand. May be it can be
considered a clear-cut improvement of the Affinity Propagation algorithm?
Any way if it is not to be added in sklearn, I am quite happy to release it
via PyPI.

Best wishes,
ilya
Post by Andy
Hi Ilya.
Thanks for your interest in contributing.
I am not expert in affinity propagation, so it would be great if you could
give some details of what the advantage of the method is.
The reference paper seems to be an arxiv preprint with 88 citations, which
would probably not qualify for inclusion in scikit-learn,
see the FAQ
http://scikit-learn.org/dev/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published
It might be a candidate for an external experimental / contribution
project, which has been an idea that has been floating around for a while.
Cheers,
Andy
Hi everybody,
As far as I am aware, there is no adaptive affinity propagation
clustering algorithm implementation in neither the stable nor the
development version of sklearn.
I have recently implemented the adaptive affinity propagation algorithm
as a part of my image analysis project. I based my implementation on the
paper by Wang et al., 2007 and their Matlab code, and sklearn's affinity
propagation algorithm. This is not exactly a port of Matlab code since I
have slightly modified the Wang's approach to deal with oscillations and
added an optional upper limit on number of clusters.
I am planning to submit the code to sklearn eventually. So please let me
know if anybody already works on the algorithm, as we could join our
efforts and save some time.
Best wishes,
ilya.
--
Ilya Patrushev, PhD.
MRC National Institute for Medical Research
The Ridgeway
Mill Hill
London NW7 1AA
UK
Tel: 0208 816 2656
Fax: 0208 906 4477
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREEhttp://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Ilya Patrushev,
MRC National Institute for Medical Research
The Ridgeway
Mill Hill
London NW7 1AA
UK
Tel: 0208 816 2656
Fax: 0208 906 4477
Gael Varoquaux
2014-12-03 13:54:06 UTC
Permalink
Привет Илья,

I'm actually really not excited about affinity propagation. Firstly, it's
slow. Clustering has pretty much 2 usecases. The first one is to find
latent meaningful structure. This is a hard problem in the sens of
learning theory, thus to be able to trust the solution one needs many
sample. The second problem is to be able to reduce the problem size, by
assigning replacing samples by centers. Both of these usecases are really
relevant only when there are many samples. Thus a slow clustering method
is not very useful. The second reason that I don't like affinity
propagation, is that it has many parameters to set, and gives very
strange/unstable results.

I think that the empirical comparison of clustering algorithms that we
have on top of the clustering page:
http://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods
is quite telling in terms of what are the limitation of affinity
propagation. I have personally not seen it used in any non trivial
application (or academic papers interested in it theoretically).

Now, the enhancements that you are proposing are trying to tackle both
limitations of affinity propagation. So, on paper, they look great.
However, I am a computer scientist that publishes papers on methods, and
thus I know how weak a claim is when it is in a paper by the authors of a
method. Thus I don't trust that a method actually has the benefits that
it claims it has, unless I see it proved on many different applications,
by many different people. Experience has really taught me this, and I
must say that there are some methods that I regret pushing in
scikit-learn. That's why we have the requirements on the number of
citations. We find that a method that is really useful gets used, and
thus cited. Of one of proving us wrong, is to do an implementation
outside of scikit-learn, in a separate package, and in the examples of
this package, show that the method really solves very well problems that
are not solved way by the methods in scikit-learn.


Do you understand our line of thought's? It's not against methods in
general, it's just that we are trying hard to find the right subset of
the literature that we should be struggling to keep alive and kicking.

Cheers,

Gaël
Post by Илья Патрушев
Hi Andy,
Adaptive Affinity Propagation is essentially an additional optimisation layer
on top of the original Affinity Propagation algorithm.
Affinity Propagation algorithm works off the similarity matrix and tries to
identify a number of data points that would be "centres" of clusters. The
preferences (a vector of n_samples size) and damping.
The preferences on one hand are the way to incorporate prior knowledge about
likely cluster centres, on the other hand they control the number of clusters
produced by the algorithm. When there is no prior knowledge, preferences are
set to the same value for all sample points. The general relationship between
the preference value and the number of clusters is; the greater the value the
greater the number of clusters. The authors of the Affinity Propagation
algorithm recommend using the median similarity value, but in the end one has
to find the right preference value for each new clustering problem.
The damping parameter defines speed at which the algorithm updates its
responsibility/availability evidence. The higher the damping parameter is the
less the algorithm prone to oscillations, but this slows down convergence.
The Wang's solution is to run Affinity Propagation algorithm starting with
quite high preference value (like .5 of median similarity). As it converges,
the goodness of clustering is measured (they suggested Silhouette index) the
preference is decreased, and these steps are repeated until the algorithm
produces some minimal number of clusters. Along with that, the presence of
oscillations is monitored and should they appear they are controlled by
adjusting the damping parameter, should it reaches maximum value by reducing
the preference value.
The pdf in arXiv is the English translation of the original paper published in
Chinese.
I agree, Adaptive Affinity Propagation is not as widely used method as defined
in FAQ, I should have looked in it beforehand. May be it can be considered a
clear-cut improvement of the Affinity Propagation algorithm?
Any way if it is not to be added in sklearn, I am quite happy to release it via
PyPI.
Best wishes,
ilya
Hi Ilya.
Thanks for your interest in contributing.
I am not expert in affinity propagation, so it would be great if you could
give some details of what the advantage of the method is.
The reference paper seems to be an arxiv preprint with 88 citations, which
would probably not qualify for inclusion in scikit-learn,
see the FAQ http://scikit-learn.org/dev/faq.html#
can-i-add-this-new-algorithm-that-i-or-someone-else-just-published
It might be a candidate for an external experimental / contribution
project, which has been an idea that has been floating around for a while.
Cheers,
Andy
Hi everybody,
As far as I am aware, there is no adaptive affinity propagation
clustering algorithm implementation in neither the stable nor the
development version of sklearn.
I have recently implemented the adaptive affinity propagation algorithm
as a part of my image analysis project. I based my implementation on
the paper by Wang et al., 2007 and their Matlab code, and sklearn's
affinity propagation algorithm. This is not exactly a port of Matlab
code since I have slightly modified the Wang's approach to deal with
oscillations and added an optional upper limit on number of clusters.
I am planning to submit the code to sklearn eventually. So please let
me know if anybody already works on the algorithm, as we could join our
efforts and save some time.
Best wishes,
ilya.
--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
Илья Патрушев
2014-12-03 16:13:06 UTC
Permalink
Hi Gaël,

Sure, I understand the rationale behind the requirement of 1000+ cites
etc., and as I mentioned above, I am quite happy to release it via PyPI.
Wang et al. 2008 claim that their approach improves correctness of Affinity
Propagation clustering (though it increases the running times). Correct me
if I am wrong, from your reply it looks like you are not persuaded by the
paper and do not recommend including the algorithm in sklearn.

Best wishes,
ilya.
ПрОвет Илья,
I'm actually really not excited about affinity propagation. Firstly, it's
slow. Clustering has pretty much 2 usecases. The first one is to find
latent meaningful structure. This is a hard problem in the sens of
learning theory, thus to be able to trust the solution one needs many
sample. The second problem is to be able to reduce the problem size, by
assigning replacing samples by centers. Both of these usecases are really
relevant only when there are many samples. Thus a slow clustering method
is not very useful. The second reason that I don't like affinity
propagation, is that it has many parameters to set, and gives very
strange/unstable results.
I think that the empirical comparison of clustering algorithms that we
http://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods
is quite telling in terms of what are the limitation of affinity
propagation. I have personally not seen it used in any non trivial
application (or academic papers interested in it theoretically).
Now, the enhancements that you are proposing are trying to tackle both
limitations of affinity propagation. So, on paper, they look great.
However, I am a computer scientist that publishes papers on methods, and
thus I know how weak a claim is when it is in a paper by the authors of a
method. Thus I don't trust that a method actually has the benefits that
it claims it has, unless I see it proved on many different applications,
by many different people. Experience has really taught me this, and I
must say that there are some methods that I regret pushing in
scikit-learn. That's why we have the requirements on the number of
citations. We find that a method that is really useful gets used, and
thus cited. Of one of proving us wrong, is to do an implementation
outside of scikit-learn, in a separate package, and in the examples of
this package, show that the method really solves very well problems that
are not solved way by the methods in scikit-learn.
Do you understand our line of thought's? It's not against methods in
general, it's just that we are trying hard to find the right subset of
the literature that we should be struggling to keep alive and kicking.
Cheers,
Gaël
Post by Илья Патрушев
Hi Andy,
Adaptive Affinity Propagation is essentially an additional optimisation
layer
Post by Илья Патрушев
on top of the original Affinity Propagation algorithm.
Affinity Propagation algorithm works off the similarity matrix and tries
to
Post by Илья Патрушев
identify a number of data points that would be "centres" of clusters. The
behaviour of Affinity Propagation algorithm is governed by two
preferences (a vector of n_samples size) and damping.
The preferences on one hand are the way to incorporate prior knowledge
about
Post by Илья Патрушев
likely cluster centres, on the other hand they control the number of
clusters
Post by Илья Патрушев
produced by the algorithm. When there is no prior knowledge, preferences
are
Post by Илья Патрушев
set to the same value for all sample points. The general relationship
between
Post by Илья Патрушев
the preference value and the number of clusters is; the greater the
value the
Post by Илья Патрушев
greater the number of clusters. The authors of the Affinity Propagation
algorithm recommend using the median similarity value, but in the end
one has
Post by Илья Патрушев
to find the right preference value for each new clustering problem.
The damping parameter defines speed at which the algorithm updates its
responsibility/availability evidence. The higher the damping parameter
is the
Post by Илья Патрушев
less the algorithm prone to oscillations, but this slows down
convergence.
Post by Илья Патрушев
The Wang's solution is to run Affinity Propagation algorithm starting
with
Post by Илья Патрушев
quite high preference value (like .5 of median similarity). As it
converges,
Post by Илья Патрушев
the goodness of clustering is measured (they suggested Silhouette index)
the
Post by Илья Патрушев
preference is decreased, and these steps are repeated until the algorithm
produces some minimal number of clusters. Along with that, the presence
of
Post by Илья Патрушев
oscillations is monitored and should they appear they are controlled by
adjusting the damping parameter, should it reaches maximum value by
reducing
Post by Илья Патрушев
the preference value.
The pdf in arXiv is the English translation of the original paper
published in
Post by Илья Патрушев
Chinese.
I agree, Adaptive Affinity Propagation is not as widely used method as
defined
Post by Илья Патрушев
in FAQ, I should have looked in it beforehand. May be it can be
considered a
Post by Илья Патрушев
clear-cut improvement of the Affinity Propagation algorithm?
Any way if it is not to be added in sklearn, I am quite happy to release
it via
Post by Илья Патрушев
PyPI.
Best wishes,
ilya
Hi Ilya.
Thanks for your interest in contributing.
I am not expert in affinity propagation, so it would be great if you
could
Post by Илья Патрушев
give some details of what the advantage of the method is.
The reference paper seems to be an arxiv preprint with 88 citations,
which
Post by Илья Патрушев
would probably not qualify for inclusion in scikit-learn,
see the FAQ http://scikit-learn.org/dev/faq.html#
can-i-add-this-new-algorithm-that-i-or-someone-else-just-published
It might be a candidate for an external experimental / contribution
project, which has been an idea that has been floating around for a
while.
Post by Илья Патрушев
Cheers,
Andy
Hi everybody,
As far as I am aware, there is no adaptive affinity propagation
clustering algorithm implementation in neither the stable nor the
development version of sklearn.
I have recently implemented the adaptive affinity propagation
algorithm
Post by Илья Патрушев
as a part of my image analysis project. I based my
implementation on
Post by Илья Патрушев
the paper by Wang et al., 2007 and their Matlab code, and
sklearn's
Post by Илья Патрушев
affinity propagation algorithm. This is not exactly a port of
Matlab
Post by Илья Патрушев
code since I have slightly modified the Wang's approach to deal
with
Post by Илья Патрушев
oscillations and added an optional upper limit on number of
clusters.
Post by Илья Патрушев
I am planning to submit the code to sklearn eventually. So
please let
Post by Илья Патрушев
me know if anybody already works on the algorithm, as we could
join our
Post by Илья Патрушев
efforts and save some time.
Best wishes,
ilya.
--
Gael Varoquaux
Researcher, INRIA Parietal
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Ilya Patrushev,
MRC National Institute for Medical Research
The Ridgeway
Mill Hill
London NW7 1AA
UK
Tel: 0208 816 2656
Fax: 0208 906 4477
Gael Varoquaux
2014-12-03 16:23:28 UTC
Permalink
Sure, I understand the rationale behind the requirement of 1000+ cites etc.,
and as I mentioned above, I am quite happy to release it via PyPI.
And put it in a scikit-learn-contrib repo? That would be sweet.
Wang et al. 2008 claim that their approach improves correctness of Affinity
Propagation clustering (though it increases the running times).
Correct me if I am wrong, from your reply it looks like you are not
persuaded by the paper and do not recommend including the algorithm in
sklearn.
Yes. But on the other hand, I do not hold the Truth. I would be very,
very happy to be proven wrong, and if clearly proven wrong, integrate it
in scikit-learn.

You know, I have no horses in this race. The algorithms that I develop
are not part of scikit-learn, and will never be, because of the
requirements that we have. I just want scikit-learn to be something
genuinely useful. Partly out of selfishness, because I have a research
team that is relying on it to do the data analysis.

Gaël
Илья Патрушев
2014-12-03 16:32:25 UTC
Permalink
Fair enough.

Best wishes,
ilya.
Post by Gael Varoquaux
Post by Илья Патрушев
Sure, I understand the rationale behind the requirement of 1000+ cites
etc.,
Post by Илья Патрушев
and as I mentioned above, I am quite happy to release it via PyPI.
And put it in a scikit-learn-contrib repo? That would be sweet.
Post by Илья Патрушев
Wang et al. 2008 claim that their approach improves correctness of
Affinity
Post by Илья Патрушев
Propagation clustering (though it increases the running times).
Correct me if I am wrong, from your reply it looks like you are not
persuaded by the paper and do not recommend including the algorithm in
sklearn.
Yes. But on the other hand, I do not hold the Truth. I would be very,
very happy to be proven wrong, and if clearly proven wrong, integrate it
in scikit-learn.
You know, I have no horses in this race. The algorithms that I develop
are not part of scikit-learn, and will never be, because of the
requirements that we have. I just want scikit-learn to be something
genuinely useful. Partly out of selfishness, because I have a research
team that is relying on it to do the data analysis.
Gaël
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Ilya Patrushev,
MRC National Institute for Medical Research
The Ridgeway
Mill Hill
London NW7 1AA
UK
Tel: 0208 816 2656
Fax: 0208 906 4477
Loading...