So, I am losing Internet connection in a few hours. I couldn't finish my
review. Sorry, it takes a while to dig in the code, and amidst a busy
conference it is a challenge.
Here are my notes on the code. To recap my general feeling: the code is
really going in the right direction. The example look right, although a
bit slow. I am still worried that there is a fairly high level of
indirection in the code: it is nested in terms of API, and has functions
calling functions, ... Finally, another general remark is that there are
many many place where the code looks fairly sub optimal (sparse matrices
with no optimisation on the sparse matrix type, no in-place operations in
core loops), and the code still is much, much slower than nipy's manifold
learning code. The combination of both the remaining complexity of the
code due to its abstraction and the slowness makes me think we shouldn't
integrate it right now, but rather iterate a bit more on it.
I am offline for 3 weeks starting now!
Post by Matthieu Brucher
As Alexandre and Gaël suggested, I've created a branch with just a few
- the Barycenter Mapping
- Laplacian Eigenmaps
- Diffusion Maps
- Hessian Eigenmaps
Cool, that is indeed the core functionnality that we want to get right:
easy to use, and fast.
Post by Matthieu Brucher
Two tests fail because of the missing parts.
I am not sure what that means? I didn't find time to run the tests.
The code is clearly making huge progress in terms of being accessible to
(semi) naive coders. I am starting to get the big picture of the code.
PEP8 say that keyword arguments in function calls should have no spaces
before and after '='.
I would prefer to avoid using numpy.linalg, but rather to favor
scipy.linalg: numpy.linalg is often compiled without proper blas support,
and is thus very slow.
numpy should be imported as 'np', and I would prefer to avoid the 'from
numpy import ...'. This is just a convention, but when it is used by
everybody, it makes life much easier, because anyone can dig in the
codebase without finding out how symbols are imported and what symbols
A lot of the docstrings are not properly formatted as in the numpy
It's a pitty that a lot of the functionnality cannot work on a similarity
graph, but rather on samples. I guess this is a general problem that
appears in other places in the scikit, and that the best option is to set
it appart, but we will need to address this in the long term.
Any reason why the Barycenter is not in the neighbors code? IMHO, it
should be a regression version of the Neighbors classifiers, and could be
named 'NeighborsBarycenter'. The best way to do this would probably be
the split Neighbors in a base class that has no predict, and derive from
this base class the current version of the KNN and your Barycenter. That
would do away with the need to pass in neigh_alternate_arguments.
Can we just get rid of this. This seems to me as an extra indirection and
I am not sure why it is there.
Besides, you are explicitely testing for isinstance(BaseEstimator). We
want to avoid this, because it means that it is impossible to use objects
in the scikit that have not been explicitely built for that purpose.
If we do this, and move the barycenter to where the nearest neighbor is,
we can completetly do away with the mapping subdirectory, and thus reduce
the depth of the tree by one, which will make it easier to apprehend.
embedding.tools.create_neighborer should not exist. It is an almost empty
function that is doing some hacking around the neighbors class.
embedding.tools.centered_normalized should be merged with the code that
is in scaler.
embedding.tools.dist2hd should be merged with
mean_shift.euclidian_distance, and both should be moved somewhere in
'barycenter' function should be renamed to 'barycenter_weights', because
right now, it not clear what the function does from its name. The word
'barycenter' is very much overloaded in the codebase. Besides, it tales a
'kwargs' that is unused. Finally, this function should be moved to the
same file as the Barycenter object.
I am not happy with the name of this file. It confuses me: I am not sure
anymore what is in the embedding directory anymore, if it is not in the
embedding file. Actually, after reading the code bit more, I have
realised that this is just a base class, and actually an abstract base
class. The name should reflect this, because right now it takes a lot of
code reading to realize this. Maybe embedding.py -> base_embedding.py and
`class Embedding` -> `class BaseEmbedding`
A small remark: default values for keyword arguments:
def __init__(self, n_coords, n_neighbors = None, neigh = None,
self.n_neighbors = n_neighbors if n_neighbors is not None else 9
In my opinion, a function like 'laplacian_maps' that takes a 'method'
function as an input argument is tedious to use. Any chance that you can
give is a default entry? This is in 'similarities.py', but it is not used
there, only in similarities_mds. I am not sure whether it is supposed to
be an internal use function -in which case, why is it not where it is
used?- or a function to be used by the user -in which case, why is it not
kwargs in most functions are impossible to use without dwelling in the