On Wed, Sep 15, 2010 at 5:31 PM, Gael Varoquaux
Post by Gael Varoquaux Post by Olivier Grisel
Fun. However, isn't the whole value of map-reduce the ability to move the
algorithms rather than the data, to avoid data IO bottlenecks? I don't
really see however mincemeatpy adresses this problem.
To build a fully-fledged heavy-duty data processing framework, it is
1. On-data/chunk-based processing
2. Cluster-enabled parallel computing
3. Data-management framework (such as locating algorithms with the
I would really like to separate these features from the scikit, for
multiple reasons, one being that I don't believe that they are specific
to machine learning.
It might be interesting in looking how the MDP pipeline (which is really
awesome in terms of dataflow programming) can be adapted to these needs
(I think it already has 1, and IMHO 2 should be added once IPython
parallel computing settles down). The MDP guys have relicensed their code
to BSD, they are collaborating closely with the scikit (Hi Pietro :>),
and they are already making efforts to integrate the scikit in their data
Indeed, MDP supports 1, but there is no magic here: the base class for
algorithms, Node, splits the equivalent of 'fit' in two: 'train' and
'stop_training'. 'train' receives data chunks during learning;
'stop_training' does most of the heavy duty and finalizes the fitting
of parameters. Algorithms that naturally support chuck learning
overwrite 'train': for example, PCA updates the covariance matrix and
average with the new data in 'train', and computes the principal
components in 'stop_training'. Other algorithms, like FastICA, always
need the whole data for every iteration. In that case, 'train' simply
stores the chunks in memory, and 'stop_training' does all the fitting.
I don't think 1 can be solved separately from the algorithms... At the
moment, the MDP-scikits wrappers support chunk-based learning only by
accumulating the data in memory.
In its current release, MDP supports 2 in the sense that multiple
chunks of data can be sent to different processors, mostly for nodes
that support 1. We have been talking about writing a pipeline based on
a general graph, in which case it should be possible to fit different
algorithms in parallel by splitting the graph in the algorithms that
are equally far from the input data. Could this become a common
I'm not sure what you mean by 3, but it doesn't sound like something
that MDP has...
Post by Gael Varoquaux
Just an idea!
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
Scikit-learn-general mailing list