Discussion:
Vectorizer issue
(too old to reply)
Olivier Grisel
2011-01-07 13:17:12 UTC
Permalink
I forward this conversation to the mailing list since it is of general
interest. It is about the issue:

http://sourceforge.net/apps/trac/scikit-learn/ticket/162
I kind of agree with the reporter: why would we do stop word removal
hardcoded by default for english only? Furthermore the max_df feature
is able to deal with stop words quite effectively according to some
grid_search tests.
I would like to support English with minimum hassle and stop word
removal is a very sane default choice in my opinion. Max_df can be a
nice language-independent way to do it but it's difficult to estimate
a good default value for the parameter and would thus probably require
grid search.
I was thinking of moving stop word removal to the preprocessor. This
way, we can have RomanPreprocessor and EnglishPreprocessor. But stop
word removal implies that words are tokenized...
Another idea is to merge the current analyzer and preprocessor objects
into one object with two public methods: tokenize and preprocess. This
way, we can have an object for roman languages in general and a more
specialized object for English. Also this way, people can inherit the
class and override preprocess without necessarily override tokenize
and vice-versa. This also makes the hierarchy of objects a little bit
simpler.
I am +1 for this. I will try to find the time to do that before my
pycon tutorial (unless off-course you or someone else wants to do it
sooner).
By the way, I don't like the names WordNGramAnalyzer and
CharNGramAnalyzer so much. The verb "analyze" is not very clear. I
don't have good alternatives though. Maybe WordTokenizer and
CharacterTokenizer? (It's not necessary to include "NGram" in the
class name in my opinion)
I reused the naming convention of the Lucene project which is a
reference in the domain: an combines string preprocessing (e.g.
charset decoding, lowercasing, HTML cleaning...), the tokenizer itself
(e.g. split on whitespaces and punctuation) and token based
post-processing (n-grams of tokens, stop words filtering, ...).

But I agree to change the names to tokenizer if people find it more intuitive.
This discussion should probably go to the ML.
I agree: done :)
I would also remove the regexp based HTML / XML based tag stripping as
it might remove sections of text documents that happen to have "<" and
">" on the same line. Tag stripping is best done using either lxml
HTML or XML parser and the text_content method.
But it's nice to be able to strip tags without lxml. The regexp needs
to be updated to not consider strings containing spaces between < and
as tags then.
But most HTML tags have valid whitespaces. One could write more
advanced regexp but I agree with:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Furthermore lxml is very, very, very fast and is able to handle
invalid, broken HTML. If we want to do real HTML support we should
further implements heuristics such as:
https://code.google.com/p/boilerpipe/ by this is probably out of the
scope of scikit-learn.

So my opinion is: let us make the text feature extractor work well and
deterministically for pure "text" and give pointers on how to
preprocess HTML, pdf, JSON tweets and so on in the documentation.

We could also start a new python project for text preprocessing tools
that work well with the scikit.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2011-01-07 15:17:17 UTC
Permalink
On Fri, Jan 7, 2011 at 10:17 PM, Olivier Grisel
Post by Olivier Grisel
I am +1 for this. I will try to find the time to do that before my
pycon tutorial (unless off-course you or someone else wants to do it
sooner).
Cool! Do you plan to talk about analyzers in your tutorial?
Post by Olivier Grisel
I reused the naming convention of the Lucene project which is a
reference in the domain: an combines string preprocessing (e.g.
charset decoding, lowercasing, HTML cleaning...), the tokenizer itself
(e.g. split on whitespaces and punctuation) and token based
post-processing (n-grams of tokens, stop words filtering, ...).
Didn't know about that. Let's stick to "analyzer" then :)
Post by Olivier Grisel
Furthermore lxml is very, very, very fast and is able to handle
invalid, broken HTML. If we want to do real HTML support we should
https://code.google.com/p/boilerpipe/ by this is probably out of the
scope of scikit-learn.
Complex parsing of the HTML structure is definitely out of the scope
of the scikit but a simple utility function to strip tags would still
be valuable (I'm thinking of simple tags such as bold, italic, etc).
Like you said people who want to do more serious HTML parsing will
probably exclude the boilerplate or extract the content structure.

Mathieu
Olivier Grisel
2011-01-07 16:06:59 UTC
Permalink
Post by Mathieu Blondel
On Fri, Jan 7, 2011 at 10:17 PM, Olivier Grisel
Post by Olivier Grisel
I am +1 for this. I will try to find the time to do that before my
pycon tutorial (unless off-course you or someone else wants to do it
sooner).
Cool! Do you plan to talk about analyzers in your tutorial?
Yes among other things. I also plan to talk about sound, scene /
object images and faces classification and clustering.
Post by Mathieu Blondel
Post by Olivier Grisel
I reused the naming convention of the Lucene project which is a
reference in the domain: an combines string preprocessing (e.g.
charset decoding, lowercasing, HTML cleaning...), the tokenizer itself
(e.g. split on whitespaces and punctuation) and token based
post-processing (n-grams of tokens, stop words filtering, ...).
Didn't know about that. Let's stick to "analyzer" then :)
Post by Olivier Grisel
Furthermore lxml is very, very, very fast and is able to handle
invalid, broken HTML. If we want to do real HTML support we should
https://code.google.com/p/boilerpipe/ by this is probably out of the
scope of scikit-learn.
Complex parsing of the HTML structure is definitely out of the scope
of the scikit but a simple utility function to strip tags would still
be valuable (I'm thinking of simple tags such as bold, italic, etc).
But in the real life you will always have stuff like:

<span class="action"
onclick="javacsript:callFunctionWithParameters('arg1', 'arg2');">some
interesting content words</span>

and CDATA payloads in RSS feeds. So it is really non-trivial to do
even simple XML / HTML cleaning with regexps. Hence I would rather not
have any such hack enabled in the default analyzer.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2011-01-07 17:01:08 UTC
Permalink
Post by Olivier Grisel
and CDATA payloads in RSS feeds. So it is really non-trivial to do
even simple XML / HTML cleaning with regexps. Hence I would rather not
have any such hack enabled in the default analyzer.
OK for not activating by default. A user who merely wants to strip
tags can easily waste an hour or two on this so it would still be nice
to provide an utility function (lxml-based?) or a link to a snippet in
the documentation.

Mathieu
Fabian Pedregosa
2011-01-10 10:37:09 UTC
Permalink
Post by Olivier Grisel
and CDATA payloads in RSS feeds. So it is really non-trivial to do
even simple XML / HTML cleaning with regexps. Hence I would rather not
have any such hack enabled in the default analyzer.
OK for not activating by default.  A user who merely wants to strip
tags can easily waste an hour or two on this so it would still be nice
to provide an utility function (lxml-based?) or a link to a snippet in
the documentation.
Sounds good. Could you (Mathieu, Olivier) also take a look into issue #163 ?

http://sourceforge.net/apps/trac/scikit-learn/ticket/163

Thanks,

Fabian.
Olivier Grisel
2011-01-10 10:50:45 UTC
Permalink
Post by Fabian Pedregosa
Post by Olivier Grisel
and CDATA payloads in RSS feeds. So it is really non-trivial to do
even simple XML / HTML cleaning with regexps. Hence I would rather not
have any such hack enabled in the default analyzer.
OK for not activating by default.  A user who merely wants to strip
tags can easily waste an hour or two on this so it would still be nice
to provide an utility function (lxml-based?) or a link to a snippet in
the documentation.
Sounds good. Could you (Mathieu, Olivier) also take a look into issue #163 ?
http://sourceforge.net/apps/trac/scikit-learn/ticket/163
I am on it.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2011-01-10 10:53:13 UTC
Permalink
On Mon, Jan 10, 2011 at 7:37 PM, Fabian Pedregosa
Post by Fabian Pedregosa
Sounds good. Could you (Mathieu, Olivier) also take a look into issue #163 ?
http://sourceforge.net/apps/trac/scikit-learn/ticket/163
Segmenting Chinese or Japanese sentences is probably out of the scope
of the scikit. The solution is to create one's own analyzer object and
to pass it as an argument to the vectorizer.

Mathieu
Mathieu Blondel
2011-01-10 11:26:53 UTC
Permalink
Post by Mathieu Blondel
Segmenting Chinese or Japanese sentences is probably out of the scope
of the scikit. The solution is to create one's own analyzer object and
to pass it as an argument to the vectorizer.
I just gave it some thoughts. For the new analyzer object, we should
introduce 3 new public methods (preprocess, tokenize and postprocess)
and analyze should be implemented in terms of those 3. For
Chinese/Japanese, one will be able to inherit from the base analyzer
and override preprocess and tokenize. This way, the user can still
benefit from n-gram output without reimplementing postprocess.

If I'm not mistaken, only tokenize will differ between a word and a
character analyzer. Thinking of the future online models, for
preprocess/tokenize/postprocess, we may want to return iterators
rather than lists.

For the classes, I'm thinking of this configuration:

Analyzer (implements postprocess)
RomanAnalyzer (implements preprocess)
RomanWordAnalyzer (implements tokenize)
RomanCharacterAnalyzer (implements tokenize)

It's amazing how many iterations we're having over this API.

Mathieu
Olivier Grisel
2011-01-10 12:45:19 UTC
Permalink
Post by Mathieu Blondel
Post by Mathieu Blondel
Segmenting Chinese or Japanese sentences is probably out of the scope
of the scikit. The solution is to create one's own analyzer object and
to pass it as an argument to the vectorizer.
I just gave it some thoughts. For the new analyzer object, we should
introduce 3 new public methods (preprocess, tokenize and postprocess)
and analyze should be implemented in terms of those 3. For
Chinese/Japanese, one will be able to inherit from the base analyzer
and override preprocess and tokenize. This way, the user can still
benefit from n-gram output without reimplementing postprocess.
If I'm not mistaken, only tokenize will differ between a word and a
character analyzer. Thinking of the future online models, for
preprocess/tokenize/postprocess, we may want to return iterators
rather than lists.
Analyzer (implements postprocess)
 RomanAnalyzer (implements preprocess)
   RomanWordAnalyzer (implements tokenize)
   RomanCharacterAnalyzer (implements tokenize)
I am +1 for this suggestion. Can you please open a ticket on github
with your proposal?
Post by Mathieu Blondel
It's amazing how many iterations we're having over this API.
http://xkcd.com/844/ :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Mathieu Blondel
2011-01-10 14:50:28 UTC
Permalink
On Mon, Jan 10, 2011 at 9:45 PM, Olivier Grisel
Post by Olivier Grisel
I am +1 for this suggestion. Can you please open a ticket on github
with your proposal?
Done!

https://github.com/scikit-learn/scikit-learn/issues/issue/37

Mathieu
xinfan meng
2011-01-10 15:31:11 UTC
Permalink
I am computing the pairwise similarity between words. The first step is to
extract the context of each word and construct a feature vector. I used NLTK
to read text from corpus and used CountVectorizer to create a context-word
matrix. The CountVectorizer can be implemented with ProbDist and
CondtionalProbDist in NLTK, but it would be better to have one such method
ready to use whenever I want to, since this task is so common in NLP. There
are still one issue that I am concerned about: the efficiency. I think I
will try the sparse vector later.

BTW, I want to bring your attentions to this issue: NLTK and scikits.learn
are nice complement to each other. NLTK has been developed for a couple of
years and consists many convenient text processing mechanisms. However, NLTK
does not provide many machine learning algorithms. By contrast,
scikits.learn has many machine learning algorithms but now lacks routines
dealing with raw text. Thus I think it will be helpful to take a look at
what has been implemented in NLTK and try to make NLTK and scikits.learn
work together. Also I think it is awesome to introduce scikits.learn to NLP
researchers in the mailing list of NLTK user. I have been using NLTK for two
years but did not know scikits.learn until recently. I am sure there are a
lot of people that want to try all kinds of machine learning algorithms on
text will benefit from this project.

Last but not least, great works, really.
#162: Vectorizer classes in feature_extraction.text is confusing
-------------------------+--------------------------------------------------
Reporter: fannix | Owner: ogrisel
Type: Enhancement | Status: assigned
Priority: minor | Milestone: 0.6
Keywords: |
-------------------------+--------------------------------------------------
BTW fannix, would you like to join the mailing list and share with us what
is your feedback on scikit-learn: what have you used it for? what do you
plan to use it for? what is missing?
On Mon, Jan 10, 2011 at 9:45 PM, Olivier Grisel
Post by Olivier Grisel
I am +1 for this suggestion. Can you please open a ticket on github
with your proposal?
Done!
https://github.com/scikit-learn/scikit-learn/issues/issue/37
Mathieu
------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any
company
that requires sensitive data to be transmitted over the Web. Learn how to
best implement a security strategy that keeps consumers' information secure
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Best Wishes
--------------------------------------------
Meng Xinfan蒙新泛
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
Olivier Grisel
2011-01-10 15:47:49 UTC
Permalink
Post by xinfan meng
I am computing the pairwise similarity between words. The first step is to
extract the context of each word and construct a feature vector. I used NLTK
to read text from corpus and used CountVectorizer to create a context-word
matrix. The CountVectorizer can be implemented with ProbDist and
CondtionalProbDist in NLTK, but it would be better to have one such method
ready to use whenever I want to, since this task is so common in NLP. There
are still one issue that I am concerned about: the efficiency.
In terms of CPU or memory? Also as a first processing for your context
word matrix you should try the scikits.learn.pca.RandomizedPCA
decomposition which should be scalable to hundred of thousands of
sparse features for tens of thousands of samples provided you only
want comparatively much fewer singular vectors. Then you can pass the
results to a clustering algorithms that works on dense data such as
k-means and meanshift. If RandomizedPCA is to still too slow for you
problem, then try the gensim implementation of SVD which is more
scalable (can work with data that does not fit in memory).
Post by xinfan meng
I think I will try the sparse vector later.
Indeed you probably should. The API should be the same.
Post by xinfan meng
BTW, I want to bring your attentions to this issue:  NLTK and scikits.learn
are nice complement to each other. NLTK has been developed for a couple of
years and consists many convenient text processing mechanisms. However, NLTK
does not provide many machine learning algorithms. By contrast,
scikits.learn has many machine learning algorithms but now lacks routines
dealing with raw text. Thus I think it will be helpful to take a look at
what has been implemented in NLTK and try to make NLTK and scikits.learn
work together. Also I think it is awesome to introduce scikits.learn to NLP
researchers in the mailing list of NLTK user. I have been using NLTK for two
years but did not know scikits.learn until recently. I am sure there are a
lot of people that want to try all kinds of machine learning algorithms on
text will benefit from this project.
I will do a talk + tutorial on text classification (among other
things) at pycon in march. That will give me the opportunity to write
more documentation for the text feature parts of scikit-learn. I will
keep the NLTK integration in focus (e.s.p as they already provide so
many annotated corpora). If you want to share some insight or code
snippets for this please do.

You can comment directly on this issue as I think this is relevant for
the upcoming refactoring:

https://github.com/scikit-learn/scikit-learn/issues/issue/37
Post by xinfan meng
Last but not least, great works, really.
Thanks very much :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2011-01-10 16:01:22 UTC
Permalink
While I am at it, you should also checkout this other project I am working on:

https://github.com/ogrisel/pignlproc

Those are Apache Pig utilities to build training corpora for machine
learning / NLP out of public Wikipedia and DBpedia dumps.

Also Peter is doing interesting python / NLP stuff in nut:

https://github.com/pprett/nut
--
Olivier
Gael Varoquaux
2011-01-10 16:44:16 UTC
Permalink
Post by xinfan meng
BTW, I want to bring your attentions to this issue:  NLTK and
scikits.learn are nice complement to each other. NLTK has been developed
for a couple of years and consists many convenient text
processing mechanisms. However, NLTK does not provide many machine
learning algorithms. By contrast, scikits.learn has many machine learning
algorithms but now lacks routines dealing with raw text. Thus I think it
will be helpful to take a look at what has been implemented in NLTK and
try to make NLTK and scikits.learn work together.
Hi Xinfan,

I am really happy that you are finding the scikit useful for your work.

With regards to your comments on linking the scikit and NLTK, I am sure
that it would be of a great benefit to the people doing natural language
processing, but we have to balance that with the fact that the
scikit-learn is a general-purpose machine learning package. People with
many different background use it and contribute to it (the team I am in
works on learning from images and time series in the context of
neuroscience research, for instance).

I feel that we shouldn't try to put application-specific code in the
scikit. The reason being that it would be code that could be understood
and tested by only a fraction of the users and the developers. The
research group I am in feels a stronge urge to combine the scikit with
nipy (neuroimaging in Python). It would really give great solution to our
day-to-day problems. However, I have been resisting adding any
neuroimaging-related code to the scikit. For somebody like you, for
instance, it would be a net loss as it would increase the maintenance
burden for the scikit with no benefit to you. So far, we have simply
started a series of tutorials ('NeuroImaging with the Scikit-learn':
http://nisl.github.com/). We might go beyond if we find the man-power.

As far as where to set the limit between what can go in the scikit, and
what should be hosted elsewhere, I can't give a final answer, but my gut
feeling is that any code that imports from a package that is not in the
scikit-learn's current dependencies (numpy, scipy, matplotlib) should not
be in the scikit, as it means that it will be untested and untestable for
a variety of developers and users. I can't count the number of time that
I have fended of one of my colleagues who wanted to add a neuroimaging
example requiring a neuroimaging IO library.

On a side note, I personally find that a domain-agnostic learning package
is incredibly useful. Discussing with people from different background on
the scikit-learn mailing list has introduced me to notion and techniques
that I didn't know.

That said, I hope that we will see the outcome of a package code using
the scikit-learn and other packages (NLTK for instance) to do NLP. Should
it live in NLTK or in a third package, I can't judge.

My 2 cents,

Gael
xinfan meng
2011-01-11 02:39:08 UTC
Permalink
Thanks for your replies. I did agree with you that we should introduce yet
another dependency. My point is that since Grisel want plan to refactor the
text feature extraction methods, I thinks it is better to first take a look
at NLTK's implementations, which contained many "user stories". I know
nothing about neuroimaging, but as far as NLP is concern, there are two
many ways to extract useful information from text. Thus I would not expect
the feature extraction modules of scikits.learn to be comprehensive in
functionality , instead I think it would be more appropriate to let
scikits.learn to talk to other packages in this case. What is the
interchanging languages they used to talk? Maybe in numpy's or scipy's
matrix or vector. They provide a sparse matrix which is very suitable for
text representation.

So, in summary, the text feature extraction methods in scikits.learn should
be flexible and friendly in interface and efficient in representation.

On Tue, Jan 11, 2011 at 12:44 AM, Gael Varoquaux <
Post by xinfan meng
Post by xinfan meng
BTW, I want to bring your attentions to this issue: NLTK and
scikits.learn are nice complement to each other. NLTK has been
developed
Post by xinfan meng
for a couple of years and consists many convenient text
processing mechanisms. However, NLTK does not provide many machine
learning algorithms. By contrast, scikits.learn has many machine
learning
Post by xinfan meng
algorithms but now lacks routines dealing with raw text. Thus I think
it
Post by xinfan meng
will be helpful to take a look at what has been implemented in NLTK
and
Post by xinfan meng
try to make NLTK and scikits.learn work together.
Hi Xinfan,
I am really happy that you are finding the scikit useful for your work.
With regards to your comments on linking the scikit and NLTK, I am sure
that it would be of a great benefit to the people doing natural language
processing, but we have to balance that with the fact that the
scikit-learn is a general-purpose machine learning package. People with
many different background use it and contribute to it (the team I am in
works on learning from images and time series in the context of
neuroscience research, for instance).
I feel that we shouldn't try to put application-specific code in the
scikit. The reason being that it would be code that could be understood
and tested by only a fraction of the users and the developers. The
research group I am in feels a stronge urge to combine the scikit with
nipy (neuroimaging in Python). It would really give great solution to our
day-to-day problems. However, I have been resisting adding any
neuroimaging-related code to the scikit. For somebody like you, for
instance, it would be a net loss as it would increase the maintenance
burden for the scikit with no benefit to you. So far, we have simply
http://nisl.github.com/). We might go beyond if we find the man-power.
As far as where to set the limit between what can go in the scikit, and
what should be hosted elsewhere, I can't give a final answer, but my gut
feeling is that any code that imports from a package that is not in the
scikit-learn's current dependencies (numpy, scipy, matplotlib) should not
be in the scikit, as it means that it will be untested and untestable for
a variety of developers and users. I can't count the number of time that
I have fended of one of my colleagues who wanted to add a neuroimaging
example requiring a neuroimaging IO library.
On a side note, I personally find that a domain-agnostic learning package
is incredibly useful. Discussing with people from different background on
the scikit-learn mailing list has introduced me to notion and techniques
that I didn't know.
That said, I hope that we will see the outcome of a package code using
the scikit-learn and other packages (NLTK for instance) to do NLP. Should
it live in NLTK or in a third package, I can't judge.
My 2 cents,
Gael
------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any
company
that requires sensitive data to be transmitted over the Web. Learn how to
best implement a security strategy that keeps consumers' information secure
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Best Wishes
--------------------------------------------
Meng Xinfan蒙新泛
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
Mathieu Blondel
2011-01-11 04:45:52 UTC
Permalink
Post by xinfan meng
Thanks for your replies. I did agree with you that we should introduce yet
another dependency. My point is that since Grisel want plan to refactor the
text feature extraction methods, I thinks it is better to first take a look
at NLTK's implementations, which contained many "user stories". I know
nothing about neuroimaging,  but as far as  NLP is  concern, there are two
many ways to extract useful information from text. Thus I would not expect
What constitutes an instance is the user's responsibility. For
example, for you it is a word context. For others it might be an
entire HTML page or a database field. Users who have very specific
needs may have to construct the matrix by themselves.

What kind of new functionality that could make your life easier do you
have in mind?

Mathieu
xinfan meng
2011-01-11 05:19:30 UTC
Permalink
I think I will not rush to give my list of expected new functionalities,
since I am not sure if they are realistic and I would like to first try
Grisel's suggestions. So I will try to be more familiar with this framework
in order tod provide some more mature suggestions in the future. But I do
have some questions. Since scikits.learn.feature_extraction.text. have both
sparse and dense versions, so does it means that the algorithms are going to
support both kinds of matrix as arguments? And as far as I know, the matrix
provided to affinity propagation and spectral clustering are actually
different in nature, one is similarity matrix and the other is an
instance-feature matrix. So do you plan to add some routines to convert a
matrix between them? Thanks.
Post by xinfan meng
Post by xinfan meng
Thanks for your replies. I did agree with you that we should introduce
yet
Post by xinfan meng
another dependency. My point is that since Grisel want plan to refactor
the
Post by xinfan meng
text feature extraction methods, I thinks it is better to first take a
look
Post by xinfan meng
at NLTK's implementations, which contained many "user stories". I know
nothing about neuroimaging, but as far as NLP is concern, there are
two
Post by xinfan meng
many ways to extract useful information from text. Thus I would not
expect
What constitutes an instance is the user's responsibility. For
example, for you it is a word context. For others it might be an
entire HTML page or a database field. Users who have very specific
needs may have to construct the matrix by themselves.
What kind of new functionality that could make your life easier do you
have in mind?
Mathieu
------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any
company
that requires sensitive data to be transmitted over the Web. Learn how to
best implement a security strategy that keeps consumers' information secure
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Best Wishes
--------------------------------------------
Meng Xinfan蒙新泛
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China
Mathieu Blondel
2011-01-11 05:50:12 UTC
Permalink
Post by xinfan meng
I think I will not rush to give my list of expected new functionalities,
since I am not sure if they are realistic and I would like to first try
Grisel's suggestions. So I will try to be more familiar with this framework
in order tod provide some more mature suggestions in the future. But I do
have some questions. Since scikits.learn.feature_extraction.text. have both
sparse and dense versions, so does it means that the algorithms are going to
support both kinds of matrix as arguments?
Some algorithms have both dense and sparse implementations, but not
all. Usually the sparse versions are API-compatible with the dense
versions, with the exception that they take scipy sparse matrices as
argument rather than numpy arrays. For example, you can import SVC
from scikits.learn.svm or scikits.learn.svm.sparse.
Post by xinfan meng
provided to affinity propagation and spectral clustering are actually
different in nature, one is similarity matrix and the other is an
instance-feature matrix. So do you plan to add some routines to convert a
matrix between them? Thanks.
Indeed, though you can do clustering with the matrix output by
CountVectorizer with K-means and Mean Shift.

There are plans to add a module for similarity/distance matrix as well
as kernel matrix computations (dense and sparse). Scipy also has a
module for that but for dense only. See function pdist in
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

Mathieu
Alexandre Gramfort
2011-01-11 14:42:32 UTC
Permalink
Post by xinfan meng
provided to affinity propagation and spectral clustering are actually
different in nature, one is similarity matrix and the other is an
instance-feature matrix. So do you plan to add some routines to convert a
matrix between them?
I think it would be great to have a sparse version of affinity propagation and
meanshift for example. That would allow both methods to scale with the help
of neighbors.kneighbors_graph. Any volunteer?

Alex
Olivier Grisel
2011-01-11 14:49:26 UTC
Permalink
Post by Alexandre Gramfort
Post by xinfan meng
provided to affinity propagation and spectral clustering are actually
different in nature, one is similarity matrix and the other is an
instance-feature matrix. So do you plan to add some routines to convert a
matrix between them?
I think it would be great to have a sparse version of affinity propagation and
meanshift for example. That would allow both methods to scale with the help
of neighbors.kneighbors_graph. Any volunteer?
That would be great. I will probably start a cython implementation of
sequential k-means soon. I might do the CSR-sparse representation as
well.

In the mean time, performing a RamdomizedPCA with whiten=True on the
raw sparse data to get a dense projecting on the reduced dimensional
space (say 100 components) might be a good way to preprocess
large-scale sparse data.

Also reducing the dimension with PCA has experimentally be seen to
improve the quality of a clustering algorithms such as kmeans (at
least on image patch data): it helps find a good area before rerunning
on a less reduced data (a bit like curriculum learning):

https://sites.google.com/site/kmeanslearning/random-results
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Alexandre Gramfort
2011-01-11 14:54:40 UTC
Permalink
for k-means there is also this pointer given previously on the ML :

Peter Gehler has a re-implementation in C with Python bindings of
Elkan's ICML'03 paper here:
http://mloss.org/software/view/48/
(Apache License, we could ask him to relicense for re-implementation
in Cython in scikits.learn).

I tried and it's about 10 times faster on digits compared to what we have now.

any volunteer to ask the author for BSD licensing and adding it the
scikit?

Alex

On Tue, Jan 11, 2011 at 9:49 AM, Olivier Grisel
Post by Olivier Grisel
Post by Alexandre Gramfort
Post by xinfan meng
provided to affinity propagation and spectral clustering are actually
different in nature, one is similarity matrix and the other is an
instance-feature matrix. So do you plan to add some routines to convert a
matrix between them?
I think it would be great to have a sparse version of affinity propagation and
meanshift for example. That would allow both methods to scale with the help
of neighbors.kneighbors_graph. Any volunteer?
That would be great. I will probably start a cython implementation of
sequential k-means soon. I might do the CSR-sparse representation as
well.
In the mean time, performing a RamdomizedPCA with whiten=True on the
raw sparse data to get a dense projecting on the reduced dimensional
space (say 100 components) might be a good way to preprocess
large-scale sparse data.
Also reducing the dimension with PCA has experimentally be seen to
improve the quality of a clustering algorithms such as kmeans (at
least on image patch data): it helps find a good area before rerunning
 https://sites.google.com/site/kmeanslearning/random-results
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to
best implement a security strategy that keeps consumers' information secure
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Nicolas Pinto
2011-01-12 15:24:03 UTC
Permalink
Hey guys,
Post by Alexandre Gramfort
any volunteer to ask the author for BSD licensing and adding it the
scikit?
I just did, hopefully we'll get a positive answer ;-)

Cheers,

N
Post by Alexandre Gramfort
Hey guys,
Post by Alexandre Gramfort
any volunteer to ask the author for BSD licensing and adding it the
scikit?
I just did, hopefully we'll get a positive answer ;-)
Cheers,
N
Post by Alexandre Gramfort
Alex
On Tue, Jan 11, 2011 at 9:49 AM, Olivier Grisel
Post by Olivier Grisel
Post by Alexandre Gramfort
Post by xinfan meng
provided to affinity propagation and spectral clustering are actually
different in nature, one is similarity matrix and the other is an
instance-feature matrix. So do you plan to add some routines to convert a
matrix between them?
I think it would be great to have a sparse version of affinity propagation and
meanshift for example. That would allow both methods to scale with the help
of neighbors.kneighbors_graph. Any volunteer?
That would be great. I will probably start a cython implementation of
sequential k-means soon. I might do the CSR-sparse representation as
well.
In the mean time, performing a RamdomizedPCA with whiten=True on the
raw sparse data to get a dense projecting on the reduced dimensional
space (say 100 components) might be a good way to preprocess
large-scale sparse data.
Also reducing the dimension with PCA has experimentally be seen to
improve the quality of a clustering algorithms such as kmeans (at
least on image patch data): it helps find a good area before rerunning
 https://sites.google.com/site/kmeanslearning/random-results
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to
best implement a security strategy that keeps consumers' information secure
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to
best implement a security strategy that keeps consumers' information secure
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Nicolas Pinto, PhD
Research Scientist in Brain and Computer Sciences
The Rowland Institute at Harvard
McGovern Institute for Brain Research at MIT
http://web.mit.edu/pinto
--
Nicolas Pinto
http://web.mit.edu/pinto
Nicolas Pinto
2011-01-12 15:25:29 UTC
Permalink
FYI

(cc'ing Peter Gehler)
Hi Nico, Thanks for the email. I believe I can re-license, let me look at it
and come back to you next week. There is also a new version with weighted
kmeans and kmeans++ initialization.
Happy new year and all the best, Peter
Hey Peter,
Hope all is well.
Your k-means code would be a great addition to scikits.learn. Would
you mind re-licensing it to BSD (see below) ?
Happy new year!
Cheers,
Nicolas
Hey guys,
any volunteer to ask the author for BSD licensing and adding it the
scikit?
I just did, hopefully we'll get a positive answer ;-)
Cheers,
N
Hey guys,
Post by Alexandre Gramfort
any volunteer to ask the author for BSD licensing and adding it the
scikit?
I just did, hopefully we'll get a positive answer ;-)
Cheers,
N
Post by Alexandre Gramfort
Alex
On Tue, Jan 11, 2011 at 9:49 AM, Olivier Grisel
Post by Olivier Grisel
Post by Alexandre Gramfort
Post by xinfan meng
provided to affinity propagation and spectral clustering are actually
different in nature, one is similarity matrix and the other is an
instance-feature matrix. So do you plan to add some routines to convert a
matrix between them?
I think it would be great to have a sparse version of affinity propagation and
meanshift for example. That would allow both methods to scale with the help
of neighbors.kneighbors_graph. Any volunteer?
That would be great. I will probably start a cython implementation of
sequential k-means soon. I might do the CSR-sparse representation as
well.
In the mean time, performing a RamdomizedPCA with whiten=True on the
raw sparse data to get a dense projecting on the reduced dimensional
space (say 100 components) might be a good way to preprocess
large-scale sparse data.
Also reducing the dimension with PCA has experimentally be seen to
improve the quality of a clustering algorithms such as kmeans (at
least on image patch data): it helps find a good area before rerunning
 https://sites.google.com/site/kmeanslearning/random-results
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to
best implement a security strategy that keeps consumers' information secure
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to
best implement a security strategy that keeps consumers' information secure
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Nicolas Pinto, PhD
Research Scientist in Brain and Computer Sciences
The Rowland Institute at Harvard
McGovern Institute for Brain Research at MIT
http://web.mit.edu/pinto
--
Nicolas Pinto
http://web.mit.edu/pinto
--
Nicolas Pinto
http://web.mit.edu/pinto
Olivier Grisel
2011-01-12 15:53:06 UTC
Permalink
Post by Nicolas Pinto
Hey guys,
Post by Alexandre Gramfort
any volunteer to ask the author for BSD licensing and adding it the
scikit?
I just did, hopefully we'll get a positive answer ;-)
Hi Nicolas, nice to see you around here. I suppose you are aware of
the use of convolutional triangle kmeans for feature extraction. A
discussion about A. Coates' paper has spontaneously appeared here. I
would appreciate if you could share your opinion.

https://sites.google.com/site/kmeanslearning/

Also James and I recently started working on reproducing those results
in the scikit in those branches:

https://github.com/ogrisel/scikit-learn/tree/image-patches
https://github.com/jaberg/scikit-learn/tree/ogrisel_image-patches

I also CC Zak as he might be interested as well.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
James Bergstra
2011-01-12 17:37:31 UTC
Permalink
I think I got pretty far yesterday reproducing everything but I don't get
the awesome performance yet :/

I'll send an update later today once I get the latest code pushed. I'll be
crowdsourcing you guys to try and figure out what's the missing secret
sauce...

(Are Nicolas and Zak on this mailing list?)

James

On Wed, Jan 12, 2011 at 10:53 AM, Olivier Grisel
Post by Olivier Grisel
Post by Nicolas Pinto
Hey guys,
Post by Alexandre Gramfort
any volunteer to ask the author for BSD licensing and adding it the
scikit?
I just did, hopefully we'll get a positive answer ;-)
Hi Nicolas, nice to see you around here. I suppose you are aware of
the use of convolutional triangle kmeans for feature extraction. A
discussion about A. Coates' paper has spontaneously appeared here. I
would appreciate if you could share your opinion.
https://sites.google.com/site/kmeanslearning/
Also James and I recently started working on reproducing those results
https://github.com/ogrisel/scikit-learn/tree/image-patches
https://github.com/jaberg/scikit-learn/tree/ogrisel_image-patches
I also CC Zak as he might be interested as well.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
Protect Your Site and Customers from Malware Attacks
Learn about various malware tactics and how to avoid them. Understand
malware threats, the impact they can have on your business, and how you
can protect your company and customers by using code signing.
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
http://www-etud.iro.umontreal.ca/~bergstrj
Olivier Grisel
2011-01-12 17:44:42 UTC
Permalink
Post by James Bergstra
I think I got pretty far yesterday reproducing everything but I don't get
the awesome performance yet :/
By looking at the filters I get with whitened k-means on 6x6 patches
from the CIFAR dataset, I don't get as clean filters as the one
reported by A. Coates et al or by Andrej Karpathy matlab code.

The gray level filters look beautiful though. The problem only arise
when clustering with the uncollapsed color dimensions.

=> There might be an issue with my code.
Post by James Bergstra
I'll send an update later today once I get the latest code pushed.  I'll be
crowdsourcing you guys to try and figure out what's the missing secret
sauce...
(Are Nicolas and Zak on this mailing list?)
I am not 100% sure, I'll CC them both in this mail to check :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Nicolas Pinto
2011-01-12 18:29:35 UTC
Permalink
Post by Olivier Grisel
Post by James Bergstra
(Are Nicolas and Zak on this mailing list?)
Yep!
Post by Olivier Grisel
Post by James Bergstra
I think I got pretty far yesterday reproducing everything but I don't get
the awesome performance yet :/
On CIFAR-10, NORB, or both? I'm still wondering why these papers jump
from dataset to dataset all the time (e.g. Lee et al. ICML'09 Best
Paper used Caltech101 but not here anymore, and for some reason the
authors don't even compare against established baselines that Honglak
reproduced in 2008... weird). Is the difficulty and suitability for
deep learning algorithms clearly understood with these datasets?
Post by Olivier Grisel
By looking at the filters I get with whitened k-means on 6x6 patches
from the CIFAR dataset, I don't get as clean filters as the one
reported by A. Coates et al or by Andrej Karpathy matlab code.
Clean filters don't necessarily perform better, do they?. Should we
aim to get "clean filters" or reproduce the performance on the same
benchmarks ? Actually if you take many (e.g. thousands) random
filters, you can sometimes do much better than many learning
algorithms, since the architecture parameters sometimes matter *much*
more (number of layers, number of filters, normalization schemes,
etc.) and these are hard to learn (you may need derivative-free
"black-box" methods, or apply decision trees like Yoshua Bengio).

I'll try to prepare better answers soon, hopefully with unpublished
insights we got since we started in 2006 as they may be interesting to
some of you.

Let me get back to you as soon as possible on
https://sites.google.com/site/kmeanslearning

Now back to the huge backlog to process ;-)

Cheers,

N
Post by Olivier Grisel
By looking at the filters I get with whitened k-means on 6x6 patches
from the CIFAR dataset, I don't get as clean filters as the one
reported by A. Coates et al or by Andrej Karpathy matlab code.
The gray level filters look beautiful though. The problem only arise
when clustering with the uncollapsed color dimensions.
=> There might be an issue with my code.
Post by James Bergstra
I'll send an update later today once I get the latest code pushed.  I'll be
crowdsourcing you guys to try and figure out what's the missing secret
sauce...
(Are Nicolas and Zak on this mailing list?)
I am not 100% sure, I'll CC them both in this mail to check :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
--
Nicolas Pinto
http://web.mit.edu/pinto
Zak Stone
2011-01-12 22:04:27 UTC
Permalink
Hello all,

I am indeed on this list, and thank you for cc'ing me to be sure! I
just joined the new, more specific mailing list below as well.
(Nicolas, have you joined that discussion too?)

https://groups.google.com/forum/#!forum/kmeanslearning

I'm very interested in features for image recognition (especially for
face identification), though I am currently working on a project that
builds on top of any reasonable features. I look forward to following
these threads!

Zak
Post by Nicolas Pinto
Post by Olivier Grisel
Post by James Bergstra
(Are Nicolas and Zak on this mailing list?)
Yep!
Post by Olivier Grisel
Post by James Bergstra
I think I got pretty far yesterday reproducing everything but I don't get
the awesome performance yet :/
On CIFAR-10, NORB, or both? I'm still wondering why these papers jump
from dataset to dataset all the time (e.g. Lee et al. ICML'09 Best
Paper used Caltech101 but not here anymore, and for some reason the
authors don't even compare against established baselines that Honglak
reproduced in 2008... weird). Is the difficulty and suitability for
deep learning algorithms clearly understood with these datasets?
Post by Olivier Grisel
By looking at the filters I get with whitened k-means on 6x6 patches
from the CIFAR dataset, I don't get as clean filters as the one
reported by A. Coates et al or by Andrej Karpathy matlab code.
Clean filters don't necessarily perform better, do they?. Should we
aim to get "clean filters" or reproduce the performance on the same
benchmarks ? Actually if you take many (e.g. thousands) random
filters, you can sometimes do much better than many learning
algorithms, since the architecture parameters sometimes matter *much*
more (number of layers, number of filters, normalization schemes,
etc.) and these are hard to learn (you may need derivative-free
"black-box" methods, or apply decision trees like Yoshua Bengio).
I'll try to prepare better answers soon, hopefully with unpublished
insights we got since we started in 2006 as they may be interesting to
some of you.
Let me get back to you as soon as possible on
https://sites.google.com/site/kmeanslearning
Now back to the huge backlog to process ;-)
Cheers,
N
Post by Olivier Grisel
By looking at the filters I get with whitened k-means on 6x6 patches
from the CIFAR dataset, I don't get as clean filters as the one
reported by A. Coates et al or by Andrej Karpathy matlab code.
The gray level filters look beautiful though. The problem only arise
when clustering with the uncollapsed color dimensions.
=> There might be an issue with my code.
Post by James Bergstra
I'll send an update later today once I get the latest code pushed.  I'll be
crowdsourcing you guys to try and figure out what's the missing secret
sauce...
(Are Nicolas and Zak on this mailing list?)
I am not 100% sure, I'll CC them both in this mail to check :)
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
--
Nicolas Pinto
http://web.mit.edu/pinto
Olivier Grisel
2011-01-13 01:31:25 UTC
Permalink
I've made some progress in reproducing these results.  I took Olivier's
scikits branch as a point of departure and added
- colour support
I had already implemented this friday evening in my branch:
https://github.com/ogrisel/scikit-learn/commit/38d1ac1d99e048d645bb5eeb1e15662529df3588

However as the patch did not look as good in color mode as they did in
gray levels mode I might have a bug somewhere in my reshapes...
- local contrast normalization
- speed improvements to kmeans algo in scikits (pending pull request to
trunk)
- support for dropping leading PCA dimensions prior to clustering (makes
striped filters)
- the convolutional pooling feature extraction (triangle version)
- a classification testing code
Great!
https://github.com/jaberg/scikit-learn/commits/ogrisel_image-patches
examples/applications/plot_image_classification_convolutional_features.py
(sorry Olivier if I make merging annoying, I'm happy to work in a different
file going forward).
No problem, merging issues are nothing compared to the joy of working
collaboratively on this :)
But I broke down the end-to-end algorithm into three stages that you can
1. python plot_image_classification_convolutional_features.py train_kmeans
2. python plot_image_classification_convolutional_features.py
features_from_saved_extractor
3. python plot_image_classification_convolutional_features.py
classify_features
Each of these commands takes arguments too, you can read the source file to
see what they are.
The default setting of these scripts should produce something like the image
attached in the "kernels.png" file (created in cwd).
Woa ! :)

How many kmeans iteration does it take?
On how many input images / patches do you train the kmeans?
For how long does it run?
The filters are looking sortof like in the paper, but I still can't figure
out how to make them so localized.  Also, the classification results are
which is nowhere near the 73% that the paper reports.
I will probably give it a try this WE too after merging you branch.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
James Bergstra
2011-01-13 02:37:30 UTC
Permalink
Post by Olivier Grisel
I've made some progress in reproducing these results. I took Olivier's
scikits branch as a point of departure and added
- colour support
https://github.com/ogrisel/scikit-learn/commit/38d1ac1d99e048d645bb5eeb1e15662529df3588
However as the patch did not look as good in color mode as they did in
gray levels mode I might have a bug somewhere in my reshapes...
You're right I think it was actually working in colour, and the only thing
broken was the imshow in the main experiment script. :P

I also *thought* something was a mistake and it turns out maybe it wasn't...
when pre-processing each image patch you subtract off the mean pixel and
divide by the standard deviation, right? I thought the idea was to subtract
off the mean intensity *per channel* but I just switched to subtracting off
the mean pixel *for all channels* (which I've seen in a few other people's
code now) and that gives much sharper filters from the algorithm, more like
what others report. So I'm leaving in the option to do either
(center_mode='all' or 'channel') but it seems like the simple thing is the
better one.
Post by Olivier Grisel
- local contrast normalization
- speed improvements to kmeans algo in scikits (pending pull request to
trunk)
- support for dropping leading PCA dimensions prior to clustering (makes
striped filters)
- the convolutional pooling feature extraction (triangle version)
- a classification testing code
Great!
https://github.com/jaberg/scikit-learn/commits/ogrisel_image-patches
examples/applications/plot_image_classification_convolutional_features.py
(sorry Olivier if I make merging annoying, I'm happy to work in a
different
file going forward).
No problem, merging issues are nothing compared to the joy of working
collaboratively on this :)
But I broke down the end-to-end algorithm into three stages that you can
1. python plot_image_classification_convolutional_features.py
train_kmeans
2. python plot_image_classification_convolutional_features.py
features_from_saved_extractor
3. python plot_image_classification_convolutional_features.py
classify_features
Each of these commands takes arguments too, you can read the source file
to
see what they are.
The default setting of these scripts should produce something like the
image
attached in the "kernels.png" file (created in cwd).
Woa ! :)
How many kmeans iteration does it take?
Qualitatively, probably around 20? I let it run for 60, it had still not
converged.
Post by Olivier Grisel
On how many input images / patches do you train the kmeans?
I think it was 160000.
Post by Olivier Grisel
For how long does it run?
Maybe 5 minutes?
Post by Olivier Grisel
The filters are looking sortof like in the paper, but I still can't
figure
out how to make them so localized. Also, the classification results are
which is nowhere near the 73% that the paper reports.
I will probably give it a try this WE too after merging you branch.
Cool I just fixed a few more bugs, so make sure to merge latest.

James
--
http://www-etud.iro.umontreal.ca/~bergstrj
Continue reading on narkive:
Loading...