Post by Gael Varoquaux
I am preparing my scikit-learn tutorial for the scipy conference. I have
cloned the current scikit learn tutorial and started modifying it for my
I created a branch to do my changes, as I wanted to put the focus on
different things in the scikit (the scipy crowd is less interested in
text-mining than inference problems). However, it is becoming clear for
me that there is no one-size-fits-all tutorial and that trying to merge
my scipy tutorial with the current will just give a big beast that is
1. Rename the current tutorial to 'text-mining tutorial'.
2. Create a second tutorial, maybe 'statistical-learning for sientific
What do people think?
By the way, I have changed the layout of the HTML to be more
screen-friendly when used as slides. This might be useful to merge in the
I think we could make the chapter "Working with text data"  more
By more self-contained, I mean by not having to read the "Machine
Learning 101"  chapter if you just want to get started writing a
text document classification tool.
Later after the end of Vla's GSoC we might have enough material in the
scikit to write a similar tutorial on basic computer vision stuff
called "Image Classification and Denoising" with scikit learn for
Maybe we could merge the content of the scikit-learn-tutorial back
into the original scikit-learn documentation. We could introduce a sub
sections called "Tutorials" (plural).
In the "Tutorials" folder we could have:
- General Introduction to Machine Learning Concepts
- Statistical Learning for Discovery and Inference on Numerical Data
(or whatever Gael want to call it).
- Text Classification and Clustering
- Image Classification and Denoising
- Some rule of thumbs for choosing the right scikit-learn algorithm
for the task
Each tutorial would come with it's own set of exercises and helpers to
download real-life datasets and turn them into workable scikit-learn
formatted input (e.g. extract the text from an online archive of PDF
files for instance or Wikipedia articles).
The main difference I see between tutorials and the rest of the
documentation & examples it that in tutorials we deliberately choose
to ignore / overlook alternative scikit-learn models, classes and
utilities and important information on the various implementations to
favour teachability and quick results over comprehensiveness.
Off-course we should add links from the end of each tutorial to the
reference documentation (e.g. in a section called "Learn more
about..." with bullet point list for each sub topics with 1-sentence
overview description of the linked content).
Furthermore, tutorial could add dependencies on other tools /
libraries for data preprocessing while might not want to introduce
them as dependencies of the scikit-learn project it-self.
http://twitter.com/ogrisel - http://github.com/ogrisel