Discussion:
inclusion of a new classifier in scikits.learn
(too old to reply)
Victor Oliveira
2011-08-30 14:00:29 UTC
Permalink
Hello everyone,

I'm Victor Oliveira, a master student in computer engineering at
University of Campinas. I've been doing my thesis in the Optimum-Path
Forest [OPF] classifier [1,2,3,4] for image processing. As a way to
compare its performance to others classifiers, I implemented it
according scikits.learn guidelines.

As you can see in the references, it is/has:

* naturally multi-class
* fast fitting and predicting
* good accuracy
* few parameters
* allows some superposition between clusters

The classifier comes with a supervised and an unsupervised version also.
I implemented it in C and used Cython to make a proper Python interface.

So, are you interested? There is some documentation and formatting
missing, but I'd be glad to do it if you say so!

LibOPF repository: https://github.com/victormatheus/LibOPF
Handwritten digits classification example:
https://github.com/victormatheus/LibOPF/blob/master/examples/handwritten.py

More about me and our lab:
http://meudepositodeideias.wordpress.com/
http://parati.dca.fee.unicamp.br/adesso/wiki/L2I/view/
http://www.liv.ic.unicamp.br/

[1] João P. Papa, Alexandre X. Falcão and Celso T. N. Suzuki.
Supervised Pattern Classification based on Optimum-Path Forest. Intl.
Journal of Imaging Systems and Technology, Wiley, Vol. 19, Issue 2,
pp. 120–131, Jun 2009.
[2] L.M. Rocha, F.A.M. Cappabianco, and A.X. Falcão. Data clustering
as an optimum-path forest problem with applications in image analysis.
International Journal of Imaging Systems and Technology, 19(2):50-68,
2009.
[3] A. X. Falcao, J. Stolfi, and R. A. Lotufo. The image foresting
transform: Theory, algorithms, and applications. IEEE Trans. on Patt.
Anal. Mach. Intell., 26(1):19-29, 2004.
[4] A.T. da Silva, A.X. Falcão, L.P. Magalhães: A new CBIR approach
based on relevance feedback and optimum-path forest classification.
Journal of WSCG, Vol.18, No.1-3, pp. 73-80.
Mathieu Blondel
2011-08-31 11:41:43 UTC
Permalink
Post by Victor Oliveira
So, are you interested? There is some documentation and formatting
missing, but I'd be glad to do it if you say so!
In general, scikit-learn's goal is to implement relatively well-known
and commonly-used algorithms (we still have a long way to go!). Adding
cutting-edge algorithms could become a maintainance burden as we would
have to constantly keep track of the new tiny enhancements. Of course,
the border is not always easy to draw: dictionary learning is a quite
recent area of machine-learning and sometimes, we may even innovate
over the standard baseline implementations.

My first impression is that OPF doesn't fit in scikit-learn yet: it's
quite recent (2009) and didn't get enough citations yet (30 according
to Google). Of course, there is always room for flexibility if you can
prove that it works much better than existing algorithms in
scikit-learn. For instance, how does it fare in comparison to decision
trees or random forests? Whatever the final decision is, don't take it
personally, it just means that we must make decisions as to what goes
and what doesn't go in to the code base.

Even projects which don't make it to the code base can stick to the
basic scikit-learn API: fit, predict, transform. In any case, this
will make your algorithm easy to try out for our users, even if that
means installing it separately.

What we could do is maintain a list of third-party projects which are
scikit-learn compatible. One disadvantage of doing so is that the list
could become a maintainance burden. I don't have any strong opinion on
this.

What do other people think?

Mathieu
Gael Varoquaux
2011-08-31 12:03:17 UTC
Permalink
Post by Mathieu Blondel
What do other people think?
I think exactly as you do. We have a policy of _not_ putting our own
algorithms in the scikit for this reason. However, we want to release
them in a scikit compliant package. I would be very happy to see many of
these packages. First they could serve as a maturing ground for
implementation that could land in the scikit if they become major
players. Second they open the door to application-specific code. We just
had a discussion on computer-vision related code. Every application of
machine learning needs custom code. In my group, we have heaps of -messy-
neuroimaging-specific code, that we would like to release (we are trying
to find the time to clean it up).

G
Robert Layton
2011-08-31 12:33:19 UTC
Permalink
As a suggestion, could a scikits.learn/experimental branch be set up? It
would not be released under any official release, but it would allow 'new'
algorithms to go somewhere.
Advantages would include allowing algorithms to get some visibility (which
would help them gain the usage needed to eventually be moved into the main
branch).
It would have to be made perfectly clear, with a big red sticker somewhere,
that algorithms in experimental are:
- Not tested to the extent that main code is
- May break with future API changes, and those breakages will not be picked
up in normal routine
- Any breakages will not be addressed
- May not work, or work as intended
- Are subject to change or even removal in future
- PRs into the branch are considered with a significantly lower priority
than PRs into the main branch

There would still be a basic requirement (PEP8, Pyflakes, doesn't break
other tests).

My thought basically is that if people want to open source their algorithms,
there should be encouragement to do that.

Thoughts? The disadvantages of having difficult code floating around may not
be worth the advantages, but it may be something to consider, if not as a
branch, then perhaps a code "cookbook" somewhere.
Post by Gael Varoquaux
Post by Mathieu Blondel
What do other people think?
I think exactly as you do. We have a policy of _not_ putting our own
algorithms in the scikit for this reason. However, we want to release
them in a scikit compliant package. I would be very happy to see many of
these packages. First they could serve as a maturing ground for
implementation that could land in the scikit if they become major
players. Second they open the door to application-specific code. We just
had a discussion on computer-vision related code. Every application of
machine learning needs custom code. In my group, we have heaps of -messy-
neuroimaging-specific code, that we would like to release (we are trying
to find the time to clean it up).
G
------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
My public key can be found at: http://pgp.mit.edu/
Search for this email address and select the key from "2011-08-19" (key id:
54BA8735)
Older keys can be used, but please inform me beforehand (and update when
possible!)
Gael Varoquaux
2011-08-31 12:36:14 UTC
Permalink
Post by Robert Layton
Thoughts? The disadvantages of having difficult code floating around may
not be worth the advantages, but it may be something to consider, if not
as a branch, then perhaps a code "cookbook" somewhere.
In all projects sandboxes have grown to be a problem accross time, as
they would accumulate unmaintained code and nobody had the time or the
the authority to clean them up.

I would really push for related projects on github: it offers all the
tracking and freedom that we need, and has the advantage of being the
exact same workflow as for the scikit-learn.

Gael
Olivier Grisel
2011-08-31 12:45:25 UTC
Permalink
Post by Gael Varoquaux
   Thoughts? The disadvantages of having difficult code floating around may
   not be worth the advantages, but it may be something to consider, if not
   as a branch, then perhaps a code "cookbook" somewhere.
In all projects sandboxes have grown to be a problem accross time, as
they would accumulate unmaintained code and nobody had the time or the
the authority to clean them up.
I would really push for related projects on github: it offers all the
tracking and freedom that we need, and has the advantage of being the
exact same workflow as for the scikit-learn.
+1.

We could also add page in the documentation or a new section in the
top level README file to list pointers to 3rd party projects that
follow the scikit-learn conventions (fit / predict / transform API
with raw arrays or scipy sparse matrices as input, explicit variable
names for the shapes such as n_samples, n_features, n_components, pep8
compliant code...).
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Lars Buitinck
2011-08-31 12:50:47 UTC
Permalink
Post by Olivier Grisel
Post by Gael Varoquaux
I would really push for related projects on github: it offers all the
tracking and freedom that we need, and has the advantage of being the
exact same workflow as for the scikit-learn.
+1.
+1. Git makes it pretty easy to merge projects later on.
Post by Olivier Grisel
We could also add page in the documentation or a new section in the
top level README file to list pointers to 3rd party projects that
follow the scikit-learn conventions (fit / predict / transform API
with raw arrays or scipy sparse matrices as input, explicit variable
names for the shapes such as n_samples, n_features, n_components, pep8
compliant code...).
Maybe not the top-level Makefile, but a list of related software would be nice.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
Alexandre Gramfort
2011-08-31 13:16:07 UTC
Permalink
Post by Mathieu Blondel
What we could do is maintain a list of third-party projects which are
scikit-learn compatible. One disadvantage of doing so is that the list
could become a maintainance burden. I don't have any strong opinion on
this.
I think it's a great idea. A reduce the maintenance burden the list
of third party projects could be a wiki page on the scikit Github which can be
to updated even by external contributors.

Alex
Gael Varoquaux
2011-08-31 13:19:36 UTC
Permalink
Post by Alexandre Gramfort
I think it's a great idea. A reduce the maintenance burden the list
of third party projects could be a wiki page on the scikit Github which can be
to updated even by external contributors.
+1. And we should reference it from the README/docs.
Mathieu Blondel
2011-08-31 14:07:48 UTC
Permalink
On Wed, Aug 31, 2011 at 10:16 PM, Alexandre Gramfort
Post by Alexandre Gramfort
I think it's a great idea. A reduce the maintenance burden the list
of third party projects could be a wiki page on the scikit Github which can be
to updated even by external contributors.
Good idea for the wiki. Do you know scikit-learn compliant projects
that we could use to bootstrap the list?

Mathieu
Alexandre Gramfort
2011-08-31 14:16:35 UTC
Permalink
Post by Mathieu Blondel
Good idea for the wiki. Do you know scikit-learn compliant projects
that we could use to bootstrap the list?
crab was also mentioned in the past on the mailing list but I don't know
what the status is.

Alex
Vincent Michel
2011-08-31 15:28:16 UTC
Permalink
The Wiki idea seems good to me.

It should be also nice to write one line with the name and reference papers
of the different algorithms developed in these side projects, in order to
keep a track of what could be merged in a (near or not) future.

Vincent
Post by Alexandre Gramfort
Post by Mathieu Blondel
Good idea for the wiki. Do you know scikit-learn compliant projects
that we could use to bootstrap the list?
crab was also mentioned in the past on the mailing list but I don't know
what the status is.
Alex
------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2011-08-31 12:51:55 UTC
Permalink
Hi Victor,

I haven't read the references yet but can you please summarize where
this algorithm shines in terms of predictive accuracy and speed w.r.t.
standard baseline algorithms such as RBF support vector machines and
random forest?

Does the model makes an assumption that the inputs are 2D images (with
shape (n_samples, n_pixels_width, n_pixels_height)) or can it accept
any kind of output with size (n_samples, n_features) without
assumption on the spatial correlation of individual features?

Can you give us an idea of the empirical runtime for various values of
n_samples / n_features?

Also in scikit-learn we are trying to move away from C/C++ code to
write the algorithms directly in cython and avoid memory allocation
issues and make it easy to make all fitted models picklable (this is
very important IMHO).

There is a notable exception to this rule which is liblinear / libsvm
which are not going to try an reimplement our-selfs in cython since
the libsvm codebase is very mature and the standard implementation of
the SMO algorithm for SVMs. Note that care was taken to make the
libsvm fitted models picklable in python too.

For instance the balltree was recently rewritten in pure cython
(thanks Jake) and the decision trees pull requests are also being
reworked to only use cython.
--
Olivier
Victor Oliveira
2011-09-01 18:59:59 UTC
Permalink
Hi Olivier,

OPF is a general classifier, it works in feature vectors of any
dimension. Our experiments have shown that its accuracy is similar or
better than RBF SVMs in most cases, its performance also is better,
specially when there are many classes, because it is naturally
multi-class. We haven't compared it to random forests, so I can't
answer this with sure.

As a real example I've compared it to scikits.learn's SVM in the
problem of MNIST handwritten digits database [not the simplified one
that is shipped with scikits.learn]. It has 60000 28x28 [a
784-dimension feature vector]
gray-scale images of digits for training and 10000 ones for
classification.

Here is the code with some comparison graphics:
code: http://parati.dca.fee.unicamp.br/adesso/wiki/toolboxOPF/handwritten/view/

Like SVM, OPF has a default implementation
[LibOPF:http://www.ic.unicamp.br/~afalcao/libopf/]
which can be used as validation, but the underlying algorithm isn't very
complex and I imagine there would be no problem in implementing it in
Cython.

Also, we've found that for applications where we need many learning
iterations [like content-based image retrieval], OPF
performs better than others common classifiers:

Victor Oliveira

http://dl.acm.org/citation.cfm?id=1730323
http://www.sciencedirect.com/science/article/pii/S0031320311001853
Post by Olivier Grisel
Hi Victor,
I haven't read the references yet but can you please summarize where
this algorithm shines in terms of predictive accuracy and speed w.r.t.
standard baseline algorithms such as RBF support vector machines and
random forest?
Does the model makes an assumption that the inputs are 2D images (with
shape (n_samples, n_pixels_width, n_pixels_height)) or can it accept
any kind of output with size (n_samples, n_features) without
assumption on the spatial correlation of individual features?
Can you give us an idea of the empirical runtime for various values of
n_samples / n_features?
Also in scikit-learn we are trying to move away from C/C++ code to
write the algorithms directly in cython and avoid memory allocation
issues and make it easy to make all fitted models picklable (this is
very important IMHO).
There is a notable exception to this rule which is liblinear / libsvm
which are not going to try an reimplement our-selfs in cython since
the libsvm codebase is very mature and the standard implementation of
the SMO algorithm for SVMs. Note that care was taken to make the
libsvm fitted models picklable in python too.
For instance the balltree was recently rewritten in pure cython
(thanks Jake) and the decision trees pull requests are also being
reworked to only use cython.
--
Olivier
------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Victor Oliveira
2011-09-13 12:11:03 UTC
Permalink
Hello everyone!

So, what was decided? There'll be a wiki for scikits.learn-compatible projects?

thanks.
Post by Victor Oliveira
Hi Olivier,
OPF is a general classifier, it works in feature vectors of any
dimension. Our experiments have shown that its accuracy is similar or
better than RBF SVMs in most cases, its performance also is better,
specially when there are many classes, because it is naturally
multi-class. We haven't compared it to random forests, so I can't
answer this with sure.
As a real example I've compared it to scikits.learn's SVM in the
problem of MNIST handwritten digits database [not the simplified one
that is shipped with scikits.learn]. It has 60000 28x28 [a
784-dimension feature vector]
gray-scale images of digits for training and 10000 ones for
classification.
code: http://parati.dca.fee.unicamp.br/adesso/wiki/toolboxOPF/handwritten/view/
Like SVM, OPF has a default implementation
[LibOPF:http://www.ic.unicamp.br/~afalcao/libopf/]
which can be used as validation, but the underlying algorithm isn't very
complex and I imagine there would be no problem in implementing it in
Cython.
Also, we've found that for applications where we need many learning
iterations [like content-based image retrieval], OPF
Victor Oliveira
http://dl.acm.org/citation.cfm?id=1730323
http://www.sciencedirect.com/science/article/pii/S0031320311001853
Post by Olivier Grisel
Hi Victor,
I haven't read the references yet but can you please summarize where
this algorithm shines in terms of predictive accuracy and speed w.r.t.
standard baseline algorithms such as RBF support vector machines and
random forest?
Does the model makes an assumption that the inputs are 2D images (with
shape (n_samples, n_pixels_width, n_pixels_height)) or can it accept
any kind of output with size (n_samples, n_features) without
assumption on the spatial correlation of individual features?
Can you give us an idea of the empirical runtime for various values of
n_samples / n_features?
Also in scikit-learn we are trying to move away from C/C++ code to
write the algorithms directly in cython and avoid memory allocation
issues and make it easy to make all fitted models picklable (this is
very important IMHO).
There is a notable exception to this rule which is liblinear / libsvm
which are not going to try an reimplement our-selfs in cython since
the libsvm codebase is very mature and the standard implementation of
the SMO algorithm for SVMs. Note that care was taken to make the
libsvm fitted models picklable in python too.
For instance the balltree was recently rewritten in pure cython
(thanks Jake) and the decision trees pull requests are also being
reworked to only use cython.
--
Olivier
------------------------------------------------------------------------------
Special Offer -- Download ArcSight Logger for FREE!
Finally, a world-class log management solution at an even better
price-free! And you'll get a free "Love Thy Logs" t-shirt when you
download Logger. Secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsisghtdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2011-09-16 08:56:47 UTC
Permalink
Post by Victor Oliveira
Hello everyone!
So, what was decided? There'll be a wiki for scikits.learn-compatible projects?
Hi Victor,

I have just created the wikipage here:
https://github.com/scikit-learn/scikit-learn/wiki/Related-Projects
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Continue reading on narkive:
Loading...