Interesting discussion... It clearly shows that different people have
different points of view.
Here is the way I think of it:
I'd like sub-package names to reflect user goals, rather than
optimisation methods, or abstract classes of problems.
Obvious, this philosophy cannot really work for everything: we have to
balance it with using well-known problems and solutions.
However, I must say that I am not terribly enthousiastic with Olivier's
suggestion of breaking up the GLM in packages named after the
optimisation strategie used to solve a regression problem: for a
non-specialist, it will not be obvious that coordinate_descent.Lasso is
the same thing as least_angle.Lasso, and that they both solve a
regression problem. I addition, having them far appart in the import code
means that the user is less likely to 'guess' that he could/should change
optimisation algorithm depending on his data, for the same kind of task.
Grouping in classes of problem that are user-oriented seems preferable to
me. And while 'flat is better than nested', the problem is where to put
the branching. I'd prefer having fairly full sub-packages, as long as
they are named with a name that a user can identify. For instance, I
could see someone importing 'scikits.learn.cluster' and tab-completing on
it to see what clustering algorithms are available. This would be a bit
similar to the organisation of scipy. Also, I would favor having less
packages to import from and having more content in them, with this
content being imported directly in the __init__ of sub-packages.
On the other hand, too much generality is dangerous. Just after
suggesting linear_model, it appeared to me that it might be to general.
While it is true that PCA can be thought of as a manifold learning
problem, it is also a latent factor analysis problem, a dictionary
learning problem, a matrix factorisation problem... We shouldn't require
our users to understand the 'big picture' of machine learning to use the
Here are a few suggestions/gut feelings (I am given them numbers to
1. 'glm' becomes 'regression' with the same content
2. pca + fastica go in a 'decomposition' sub-package, in which NMF,
sparse-PCA, dictionnary learning will go.
3. I agree with 'hmm' -> 'hidden_markov'
4. I am wondering if gmm.py should go in a sub-package called 'mixture',
and be called 'gaussian' in it.
5. I don't know what to do of qda and lda. My gut feeling tells me they
sould go together. Any suggestion.
6. svm is the name of an optimisation algorithm, not a class of problems.
On the other hand, it is such a well known algorithm that people
expect to find it where it currently it.
7. I don't know what to do of sgd. It's a really cool optimization
technique. It's useful many places, but I can't think of where to fit
it in a task-oriented view. (that's why I vote to simply not change
8. I am a bit worried with the profusion of the word 'Gaussian', we could
have 'gaussian_process', 'gaussian_mixture', 'gaussian_graphs'. It
seems that we can avoid the two last ones with 'mixture.gaussian' and
using 'covariance' instead of ggm. However, 'Gaussian' and 'linear'
now raise warning flags for me, as being fairly non-informative words.
9. Do Gaussian processes belong to regression? I would tend to think that
they don't, as they are most often not used to solve the same problem
as what is in the glm package but with global view of the field, they
are certainly a regression.
T'is about time that I grab some sleep. How do people feel about the
Post by Olivier Grisel Post by Matthieu Brucher Post by Mathieu Blondel
Some modifications are more difficult. What to do of fastica, pca,
PCA could go inside the (soon available?) manifold module, as it is
used to reduce dimension. Everything ICA related perhaps also?
As I explained on the linear models case, I would rather have more
top-level modules of moderate sizes and complexities than a few big
modules with many independent algorithms inside.
We can and we must use the documentation to introduce the related
algorithms together and explain how their implementations differ while
having similar purposes rather that using a deep package / modules
hierarchy to achieve that goal. Flat is better than nested :)
Research Fellow, INSERM
Associate researcher, INRIA
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-78-35
Mobile: ++ 33-6-28-25-64-62