Discussion:
Tutorial or tutorials
(too old to reply)
Gael Varoquaux
2011-07-02 19:25:36 UTC
Permalink
I am preparing my scikit-learn tutorial for the scipy conference. I have
cloned the current scikit learn tutorial and started modifying it for my
purposes.

I created a branch to do my changes, as I wanted to put the focus on
different things in the scikit (the scipy crowd is less interested in
text-mining than inference problems). However, it is becoming clear for
me that there is no one-size-fits-all tutorial and that trying to merge
my scipy tutorial with the current will just give a big beast that is
suboptimial in terms of teaching. Here is what I propose:

1. Rename the current tutorial to 'text-mining tutorial'.
2. Create a second tutorial, maybe 'statistical-learning for sientific
data processing'.

What do people think?

By the way, I have changed the layout of the HTML to be more
screen-friendly when used as slides. This might be useful to merge in the
existing tutorial.

Gaël

PS: My Internet access is very irregular right now, as I am 'airport
surfing.
Alexandre Gramfort
2011-07-02 19:43:36 UTC
Permalink
Hi Gael,
 1. Rename the current tutorial to 'text-mining tutorial'.
 2. Create a second tutorial, maybe 'statistical-learning for sientific
   data processing'.
can you be more specific about you want to put in 2?
what would be your outline and what application do you have in mind
for illustration?

maybe text can be only one application section and you can start
by isolating the text specific tuto?
PS: My Internet access is very irregular right now, as I am 'airport
surfing.
same for me :)

Alex
Gael Varoquaux
2011-07-02 19:46:49 UTC
Permalink
Post by Alexandre Gramfort
 1. Rename the current tutorial to 'text-mining tutorial'.
 2. Create a second tutorial, maybe 'statistical-learning for sientific
   data processing'.
can you be more specific about you want to put in 2?
what would be your outline and what application do you have in mind
for illustration?
What is listed in
http://conference.scipy.org/scipy2011/tutorials.php#gael
Post by Alexandre Gramfort
maybe text can be only one application section and you can start
by isolating the text specific tuto?
That's what I thought in the beginning, but it seems to me that it will
be quite awkward, and as I said suboptimal in terms of learning
experience. The pro of a tutorial over a full documentation is that it is
focussed and thus shorter. We would be loosing this.

Gaël
Alexandre Gramfort
2011-07-02 19:55:23 UTC
Permalink
Post by Gael Varoquaux
Post by Alexandre Gramfort
can you be more specific about you want to put in 2?
what would be your outline and what application do you have in mind
for illustration?
What is listed in
http://conference.scipy.org/scipy2011/tutorials.php#gael
Post by Alexandre Gramfort
maybe text can be only one application section and you can start
by isolating the text specific tuto?
That's what I thought in the beginning, but it seems to me that it will
be quite awkward, and as I said suboptimal in terms of learning
experience. The pro of a tutorial over a full documentation is that it is
focussed and thus shorter. We would be loosing this.
seems fair to me.

I'll watch your fork to give some feed back

Alex
Gael Varoquaux
2011-07-02 19:58:08 UTC
Permalink
Post by Alexandre Gramfort
I'll watch your fork to give some feed back
I have no wifi in the plane, you'll have to wait for when we meet up in
Munich :P

G
Vlad Niculae
2011-07-03 12:45:06 UTC
Permalink
A clean sphinx build on my system is broken, because the template has
hardcoded "_static" and "_images" paths, and the sphinxtoghpages
extension strips the underscores.

I found that the github underscore problem can be bypassed by putting
a file named ".nojekyll" in the root dir.

Vlad

On Sat, Jul 2, 2011 at 10:58 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Alexandre Gramfort
I'll watch your fork to give some feed back
I have no wifi in the plane, you'll have to wait for when we meet up in
Munich :P
G
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2011-07-03 13:11:29 UTC
Permalink
Post by Vlad Niculae
A clean sphinx build on my system is broken, because the template has
hardcoded "_static" and "_images" paths, and the sphinxtoghpages
extension strips the underscores.
I found that the github underscore problem can be bypassed by putting
a file named ".nojekyll" in the root dir.
Interesting. Please feel free to do the change if you can make it work.

I use the following tool to do the import to github itself:

http://pypi.python.org/pypi/ghp-import

There should be a Makefile target to do the upload.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Vlad Niculae
2011-07-03 13:22:30 UTC
Permalink
Since the last mail, I've been trying to hack and fix the
"sphinxtoghpages.py" script. I find it horrible and written in a very
enterprisey style, with tons of 3-line objects and factories which
makes it hell on earth to track where something actually gets done.

I can't figure out for the life of me why, on my (win32) system, it
walks through all the html and js files, but it fails to replace the
relative urls inside these files, like it should.

The script also has an unfortunate bug/feature: if one runs "make
html" twice in a row without "make clean", it throws an exception
because there are no folders to rename, since it renamed them on the
last pass.

If I can't figure it out in 5 minutes I won't waste my sanity on that
file. IMHO the .nojekyll trick is a cleaner solution than hard
renaming all the files after building, as the sphinxtoghpages script
attempts.

P.S. I hope I didn't offend anyone, however I'm pretty sure that
script couldn't have been written by any of you guys.
Post by Olivier Grisel
Post by Vlad Niculae
A clean sphinx build on my system is broken, because the template has
hardcoded "_static" and "_images" paths, and the sphinxtoghpages
extension strips the underscores.
I found that the github underscore problem can be bypassed by putting
a file named ".nojekyll" in the root dir.
Interesting. Please feel free to do the change if you can make it work.
 http://pypi.python.org/pypi/ghp-import
There should be a Makefile target to do the upload.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Olivier Grisel
2011-07-03 13:26:04 UTC
Permalink
Post by Vlad Niculae
Since the last mail, I've been trying to hack and fix the
"sphinxtoghpages.py" script. I find it horrible and written in a very
enterprisey style, with tons of 3-line objects and factories which
makes it hell on earth to track where something actually gets done.
I can't figure out for the life of me why, on my (win32) system, it
walks through all the html and js files, but it fails to replace the
relative urls inside these files, like it should.
The script also has an unfortunate bug/feature: if one runs "make
html" twice in a row without "make clean", it throws an exception
because there are no folders to rename, since it renamed them on the
last pass.
If I can't figure it out in 5 minutes I won't waste my sanity on that
file. IMHO the .nojekyll trick is a cleaner solution than hard
renaming all the files after building, as the sphinxtoghpages script
attempts.
No pbm. I don't even know who is the author of this sphinxtoghpages
utility and wasn't aware of the .nojekyll trick. I just used it by
copy / pasting the setup on another python project that used the
sphinx / github pages combo.

Still http://pypi.python.org/pypi/ghp-import is very useful for the upload part.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Vlad Niculae
2011-07-03 13:47:02 UTC
Permalink
Gael indeed uses ghp-import and has configured a target in the makefile.

I managed to fix the sphinxtoghpages script, at least on windows. I
also made it not throw the exception by default. I sent Gael a pull
request.
Post by Olivier Grisel
Post by Vlad Niculae
Since the last mail, I've been trying to hack and fix the
"sphinxtoghpages.py" script. I find it horrible and written in a very
enterprisey style, with tons of 3-line objects and factories which
makes it hell on earth to track where something actually gets done.
I can't figure out for the life of me why, on my (win32) system, it
walks through all the html and js files, but it fails to replace the
relative urls inside these files, like it should.
The script also has an unfortunate bug/feature: if one runs "make
html" twice in a row without "make clean", it throws an exception
because there are no folders to rename, since it renamed them on the
last pass.
If I can't figure it out in 5 minutes I won't waste my sanity on that
file. IMHO the .nojekyll trick is a cleaner solution than hard
renaming all the files after building, as the sphinxtoghpages script
attempts.
No pbm. I don't even know who is the author of this sphinxtoghpages
utility and wasn't aware of the .nojekyll trick. I just used it by
copy / pasting the setup on another python project that used the
sphinx / github pages combo.
Still http://pypi.python.org/pypi/ghp-import is very useful for the upload part.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Gael Varoquaux
2011-07-03 14:22:09 UTC
Permalink
Post by Vlad Niculae
Gael indeed uses ghp-import and has configured a target in the makefile.
That's because I just forked everything from Olivier :).

If we can use '.nojekyll', I am in favor of using this, and removing the
sphinxtoghpages: extra code means more maintainance.

G
Olivier Grisel
2011-07-03 14:40:04 UTC
Permalink
Post by Gael Varoquaux
Post by Vlad Niculae
Gael indeed uses ghp-import and has configured a target in the makefile.
That's because I just forked everything from Olivier :).
If we can use '.nojekyll', I am in favor of using this, and removing the
sphinxtoghpages: extra code means more maintainance.
+1 for .nojekyll instead of using sphinxtoghpages if it works as
expected (has someone tried)?

I just want to make clear that ghp-import and sphinxtoghpages.
ghp-import is only used to manage the gh-pages branch automatically:

sudo pip install ghp-import
git checkout master
cd tutorial
make clean html
make ghp-import

ghp does the gh-pages branch checkout / merge / push automatically.
It's much faster than doing it manually and less error prone too.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2011-07-03 13:41:04 UTC
Permalink
Post by Gael Varoquaux
I am preparing my scikit-learn tutorial for the scipy conference. I have
cloned the current scikit learn tutorial and started modifying it for my
purposes.
I created a branch to do my changes, as I wanted to put the focus on
different things in the scikit (the scipy crowd is less interested in
text-mining than inference problems). However, it is becoming clear for
me that there is no one-size-fits-all tutorial and that trying to merge
my scipy tutorial with the current will just give a big beast that is
 1. Rename the current tutorial to 'text-mining tutorial'.
 2. Create a second tutorial, maybe 'statistical-learning for sientific
   data processing'.
What do people think?
By the way, I have changed the layout of the HTML to be more
screen-friendly when used as slides. This might be useful to merge in the
existing tutorial.
I think we could make the chapter "Working with text data" [1] more
self contained.

[1] http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html

By more self-contained, I mean by not having to read the "Machine
Learning 101" [2] chapter if you just want to get started writing a
text document classification tool.

[2] http://scikit-learn.github.com/scikit-learn-tutorial/general_concepts.html

Later after the end of Vla's GSoC we might have enough material in the
scikit to write a similar tutorial on basic computer vision stuff
called "Image Classification and Denoising" with scikit learn for
instance.

Maybe we could merge the content of the scikit-learn-tutorial back
into the original scikit-learn documentation. We could introduce a sub
sections called "Tutorials" (plural).

In the "Tutorials" folder we could have:

- General Introduction to Machine Learning Concepts
- Statistical Learning for Discovery and Inference on Numerical Data
(or whatever Gael want to call it).
- Text Classification and Clustering
- Image Classification and Denoising
- Some rule of thumbs for choosing the right scikit-learn algorithm
for the task

Each tutorial would come with it's own set of exercises and helpers to
download real-life datasets and turn them into workable scikit-learn
formatted input (e.g. extract the text from an online archive of PDF
files for instance or Wikipedia articles).

The main difference I see between tutorials and the rest of the
documentation & examples it that in tutorials we deliberately choose
to ignore / overlook alternative scikit-learn models, classes and
utilities and important information on the various implementations to
favour teachability and quick results over comprehensiveness.
Off-course we should add links from the end of each tutorial to the
reference documentation (e.g. in a section called "Learn more
about..." with bullet point list for each sub topics with 1-sentence
overview description of the linked content).

Furthermore, tutorial could add dependencies on other tools /
libraries for data preprocessing while might not want to introduce
them as dependencies of the scikit-learn project it-self.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2011-07-03 14:20:18 UTC
Permalink
Post by Olivier Grisel
By more self-contained, I mean by not having to read the "Machine
Learning 101" [2] chapter if you just want to get started writing a
text document classification tool.
Right, but in a sens, in the tutorial that I'll be giving, I want a
'machine learning 101' too. I just want a different flavor to it, because
I am interested in putting the emphasis elsewhere.
Post by Olivier Grisel
The main difference I see between tutorials and the rest of the
documentation & examples it that in tutorials we deliberately choose
to ignore / overlook alternative scikit-learn models, classes and
utilities and important information on the various implementations to
favour teachability and quick results over comprehensiveness.
Right. In addition, I would still believe that each tutorial should
clearly state what the entry point is (requirements) and where it wants
to get.

Gael
Olivier Grisel
2011-07-03 14:50:42 UTC
Permalink
Post by Gael Varoquaux
Post by Olivier Grisel
By more self-contained, I mean by not having to read the "Machine
Learning 101" [2] chapter if you just want to get started writing a
text document classification tool.
Right, but in a sens, in the tutorial that I'll be giving, I want a
'machine learning 101' too. I just want a different flavor to it, because
I am interested in putting the emphasis elsewhere.
Don't you think it's possible to make it application agnostic (by just
quickly introducing the feature extraction concepts without diving too
much into application specific details?) so that it can be used as a
common introduction to all the application-centric tutorials?
Post by Gael Varoquaux
Post by Olivier Grisel
The main difference I see between tutorials and the rest of the
documentation & examples it that in tutorials we deliberately choose
to ignore / overlook alternative scikit-learn models, classes and
utilities and important information on the various implementations to
favour teachability and quick results over comprehensiveness.
Right. In addition, I would still believe that each tutorial should
clearly state what the entry point is (requirements) and where it wants
to get.
+1
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2011-07-03 15:06:22 UTC
Permalink
Post by Olivier Grisel
Don't you think it's possible to make it application agnostic (by just
quickly introducing the feature extraction concepts without diving too
much into application specific details?) so that it can be used as a
common introduction to all the application-centric tutorials?
I think, but it has a cost to teaching. For instance I don't discuss
feature extraction until way down the line: for scientists feature
extraction appear necessary only after they have understood that they
need to reduce the dimensionality of their problem. So I introduce
progressively the high-dimensional estimation problem first, and only
very late do I start talking about feature selection. Teaching feature
extraction, in itself, to scientists is often irrelevant, as it is
domain-specific, and they know it better us.

There is a compromise between a tutorial that targetted to an audience,
and factoring out code to make it more easily maintainable. Given that I
am going to speak and give the tutorial, I want it to be well-targetted.

Gael
Olivier Grisel
2011-07-03 15:36:49 UTC
Permalink
Post by Gael Varoquaux
Post by Olivier Grisel
Don't you think it's possible to make it application agnostic (by just
quickly introducing the feature extraction concepts without diving too
much into application specific details?) so that it can be used as a
common introduction to all the application-centric tutorials?
I think, but it has a cost to teaching. For instance I don't discuss
feature extraction until way down the line: for scientists feature
extraction appear necessary only after they have understood that they
need to reduce the dimensionality of their problem. So I introduce
progressively the high-dimensional estimation problem first, and only
very late do I start talking about feature selection. Teaching feature
extraction, in itself, to scientists is often irrelevant, as it is
domain-specific, and they know it better us.
There is a compromise between a tutorial that targetted to an audience,
and factoring out code to make it more easily maintainable. Given that I
am going to speak and give the tutorial, I want it to be well-targetted.
Ok write your tutorial the way you want without any maintenance
constraints for now and we will see about merging later, probably
after scipy (maybe during the scikit-learn sprint if you have time to
attend it?). I still think that we can have a common general intro but
would be curious to read your version before arguing about the
details.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Continue reading on narkive:
Loading...