Discussion:
[Scikit-learn-general] Embeddable pure python MapReduce runtime for multi machine clusters
Olivier Grisel
2010-09-15 13:46:14 UTC
Permalink
Interesting: http://remembersaurus.com/mincemeatpy/

"""
mincemeat.py is a Python implementation of the MapReduce distributed
computing framework.

mincemeat.py is:
Lightweight - All of the code is contained in a single Python file
(currently weighing in at <13kB) that depends only on the Python
Standard Library. Any computer with Python and mincemeat.py can be a
part of your cluster.
Fault tolerant - Workers (clients) can join and leave the cluster at
any time without affecting the entire process. (Master checkpointing
coming in future versions)
Secure - mincemeat.py authenticates both ends of every connection,
ensuring that only authorized code is executed. (TLS support coming in
future versions)
Open source - mincemeat.py is distributed under the MIT License, and
consequently is free for all use, including commercial, personal, and
academic, and can be modified and redistributed without restriction.
"""

We should keep that in mind in case we need a lightweight replacement
for hadoop. No distributed FS in mincemeat.py though.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2010-09-15 13:46:14 UTC
Permalink
Interesting: http://remembersaurus.com/mincemeatpy/

"""
mincemeat.py is a Python implementation of the MapReduce distributed
computing framework.

mincemeat.py is:
Lightweight - All of the code is contained in a single Python file
(currently weighing in at <13kB) that depends only on the Python
Standard Library. Any computer with Python and mincemeat.py can be a
part of your cluster.
Fault tolerant - Workers (clients) can join and leave the cluster at
any time without affecting the entire process. (Master checkpointing
coming in future versions)
Secure - mincemeat.py authenticates both ends of every connection,
ensuring that only authorized code is executed. (TLS support coming in
future versions)
Open source - mincemeat.py is distributed under the MIT License, and
consequently is free for all use, including commercial, personal, and
academic, and can be modified and redistributed without restriction.
"""

We should keep that in mind in case we need a lightweight replacement
for hadoop. No distributed FS in mincemeat.py though.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2010-09-15 13:46:14 UTC
Permalink
Interesting: http://remembersaurus.com/mincemeatpy/

"""
mincemeat.py is a Python implementation of the MapReduce distributed
computing framework.

mincemeat.py is:
Lightweight - All of the code is contained in a single Python file
(currently weighing in at <13kB) that depends only on the Python
Standard Library. Any computer with Python and mincemeat.py can be a
part of your cluster.
Fault tolerant - Workers (clients) can join and leave the cluster at
any time without affecting the entire process. (Master checkpointing
coming in future versions)
Secure - mincemeat.py authenticates both ends of every connection,
ensuring that only authorized code is executed. (TLS support coming in
future versions)
Open source - mincemeat.py is distributed under the MIT License, and
consequently is free for all use, including commercial, personal, and
academic, and can be modified and redistributed without restriction.
"""

We should keep that in mind in case we need a lightweight replacement
for hadoop. No distributed FS in mincemeat.py though.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Olivier Grisel
2010-09-15 13:46:14 UTC
Permalink
Interesting: http://remembersaurus.com/mincemeatpy/

"""
mincemeat.py is a Python implementation of the MapReduce distributed
computing framework.

mincemeat.py is:
Lightweight - All of the code is contained in a single Python file
(currently weighing in at <13kB) that depends only on the Python
Standard Library. Any computer with Python and mincemeat.py can be a
part of your cluster.
Fault tolerant - Workers (clients) can join and leave the cluster at
any time without affecting the entire process. (Master checkpointing
coming in future versions)
Secure - mincemeat.py authenticates both ends of every connection,
ensuring that only authorized code is executed. (TLS support coming in
future versions)
Open source - mincemeat.py is distributed under the MIT License, and
consequently is free for all use, including commercial, personal, and
academic, and can be modified and redistributed without restriction.
"""

We should keep that in mind in case we need a lightweight replacement
for hadoop. No distributed FS in mincemeat.py though.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Gael Varoquaux
2010-09-15 21:31:00 UTC
Permalink
Post by Olivier Grisel
Interesting: http://remembersaurus.com/mincemeatpy/
Fun. However, isn't the whole value of map-reduce the ability to move the
algorithms rather than the data, to avoid data IO bottlenecks? I don't
really see however mincemeatpy adresses this problem.

To build a fully-fledged heavy-duty data processing framework, it is
clear that sometime in the future, the following features would be handy:

1. On-data/chunk-based processing
2. Cluster-enabled parallel computing
3. Data-management framework (such as locating algorithms with the
data).

I would really like to separate these features from the scikit, for
multiple reasons, one being that I don't believe that they are specific
to machine learning.

It might be interesting in looking how the MDP pipeline (which is really
awesome in terms of dataflow programming) can be adapted to these needs
(I think it already has 1, and IMHO 2 should be added once IPython
parallel computing settles down). The MDP guys have relicensed their code
to BSD, they are collaborating closely with the scikit (Hi Pietro :>),
and they are already making efforts to integrate the scikit in their data
management framework.

Just an idea!

Gaël
Pietro Berkes
2010-09-15 22:47:41 UTC
Permalink
On Wed, Sep 15, 2010 at 5:31 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Olivier Grisel
Interesting: http://remembersaurus.com/mincemeatpy/
Fun. However, isn't the whole value of map-reduce the ability to move the
algorithms rather than the data, to avoid data IO bottlenecks? I don't
really see however mincemeatpy adresses this problem.
To build a fully-fledged heavy-duty data processing framework, it is
   1. On-data/chunk-based processing
   2. Cluster-enabled parallel computing
   3. Data-management framework (such as locating algorithms with the
      data).
I would really like to separate these features from the scikit, for
multiple reasons, one being that I don't believe that they are specific
to machine learning.
It might be interesting in looking how the MDP pipeline (which is really
awesome in terms of dataflow programming) can be adapted to these needs
(I think it already has 1, and IMHO 2 should be added once IPython
parallel computing settles down). The MDP guys have relicensed their code
to BSD, they are collaborating closely with the scikit (Hi Pietro :>),
and they are already making efforts to integrate the scikit in their data
management framework.
Indeed, MDP supports 1, but there is no magic here: the base class for
algorithms, Node, splits the equivalent of 'fit' in two: 'train' and
'stop_training'. 'train' receives data chunks during learning;
'stop_training' does most of the heavy duty and finalizes the fitting
of parameters. Algorithms that naturally support chuck learning
overwrite 'train': for example, PCA updates the covariance matrix and
average with the new data in 'train', and computes the principal
components in 'stop_training'. Other algorithms, like FastICA, always
need the whole data for every iteration. In that case, 'train' simply
stores the chunks in memory, and 'stop_training' does all the fitting.
I don't think 1 can be solved separately from the algorithms... At the
moment, the MDP-scikits wrappers support chunk-based learning only by
accumulating the data in memory.

In its current release, MDP supports 2 in the sense that multiple
chunks of data can be sent to different processors, mostly for nodes
that support 1. We have been talking about writing a pipeline based on
a general graph, in which case it should be possible to fit different
algorithms in parallel by splitting the graph in the algorithms that
are equally far from the input data. Could this become a common
project?

I'm not sure what you mean by 3, but it doesn't sound like something
that MDP has...

Pietro
Post by Gael Varoquaux
Just an idea!
Gaël
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
j***@gmail.com
2010-09-15 23:50:17 UTC
Permalink
Post by Pietro Berkes
On Wed, Sep 15, 2010 at 5:31 PM, Gael Varoquaux
Post by Gael Varoquaux
Post by Olivier Grisel
Interesting: http://remembersaurus.com/mincemeatpy/
Fun. However, isn't the whole value of map-reduce the ability to move the
algorithms rather than the data, to avoid data IO bottlenecks? I don't
really see however mincemeatpy adresses this problem.
To build a fully-fledged heavy-duty data processing framework, it is
   1. On-data/chunk-based processing
   2. Cluster-enabled parallel computing
   3. Data-management framework (such as locating algorithms with the
      data).
I would really like to separate these features from the scikit, for
multiple reasons, one being that I don't believe that they are specific
to machine learning.
It might be interesting in looking how the MDP pipeline (which is really
awesome in terms of dataflow programming) can be adapted to these needs
(I think it already has 1, and IMHO 2 should be added once IPython
parallel computing settles down). The MDP guys have relicensed their code
to BSD, they are collaborating closely with the scikit (Hi Pietro :>),
and they are already making efforts to integrate the scikit in their data
management framework.
Indeed, MDP supports 1, but there is no magic here: the base class for
algorithms, Node, splits the equivalent of 'fit' in two: 'train' and
'stop_training'. 'train' receives data chunks during learning;
'stop_training' does most of the heavy duty and finalizes the fitting
of parameters. Algorithms that naturally support chuck learning
overwrite 'train': for example, PCA updates the covariance matrix and
average with the new data in 'train', and computes the principal
components in 'stop_training'. Other algorithms, like FastICA, always
need the whole data for every iteration. In that case, 'train' simply
stores the chunks in memory, and 'stop_training' does all the fitting.
I don't think 1 can be solved separately from the algorithms... At the
moment, the MDP-scikits wrappers support chunk-based learning only by
accumulating the data in memory.
A question on this chunked throughput of data.

example from regression
Suppose chunked throughput is possible to, for example, estimate some
parameters while training, but there are no sufficient statistics that
can be accumulated for additional results, e.g. if we want to have a
fit statistic on the original data set.

Would the common pattern then be to go over the data (in chunks) a
second time to get the missing summary statistics for the follow up
analysis?

I was reading about the advantages of SAS for huge (but storable on a
single computer) data sets, so I'm curious about what the policies in
these cases are.

Thanks,

Josef
Post by Pietro Berkes
In its current release, MDP supports 2 in the sense that multiple
chunks of data can be sent to different processors, mostly for nodes
that support 1. We have been talking about writing a pipeline based on
a general graph, in which case it should be possible to fit different
algorithms in parallel by splitting the graph in the algorithms that
are equally far from the input data. Could this become a common
project?
I'm not sure what you mean by 3, but it doesn't sound like something
that MDP has...
Pietro
Post by Gael Varoquaux
Just an idea!
Gaël
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Pietro Berkes
2010-09-16 13:59:19 UTC
Permalink
I'm not sure if I understand you correctly, but I think what you are
asking is: what if the algorithm needs to collect some statistics from
the data that depend on *some other* statistics of the same data?

I forgot to mention that, but algorithm can define multiple train
phases. For example, Fisher Discriminant Analysis needs to go through
the data twice, once to collect the mean and number of points for each
class, and the second time to compute the overall and within-class
covariance matrices. The pipeline object, Flow, takes care of going
twice through the chunks (by rewinding the iterator that generates the
data).

For the applications that we were developing, the alternative of
storing the whole data at once was not feasible: most of the time, the
data had several thousands data points, and a dimensionality on the
order of 10^3-10^4. Storing the data in memory would already be a
problem, and computing x^T x, is very slow if done with all the data.

P.
Post by j***@gmail.com
Post by Pietro Berkes
Indeed, MDP supports 1, but there is no magic here: the base class for
algorithms, Node, splits the equivalent of 'fit' in two: 'train' and
'stop_training'. 'train' receives data chunks during learning;
'stop_training' does most of the heavy duty and finalizes the fitting
of parameters. Algorithms that naturally support chuck learning
overwrite 'train': for example, PCA updates the covariance matrix and
average with the new data in 'train', and computes the principal
components in 'stop_training'. Other algorithms, like FastICA, always
need the whole data for every iteration. In that case, 'train' simply
stores the chunks in memory, and 'stop_training' does all the fitting.
I don't think 1 can be solved separately from the algorithms... At the
moment, the MDP-scikits wrappers support chunk-based learning only by
accumulating the data in memory.
A question on this chunked throughput of data.
example from regression
Suppose chunked throughput is possible to, for example, estimate some
parameters while training, but there are no sufficient statistics that
can be accumulated for additional results, e.g. if we want to have a
fit statistic on the original data set.
Would the common pattern then be to go over the data (in chunks) a
second time to get the missing summary statistics for the follow up
analysis?
I was reading about the advantages of SAS for huge (but storable on a
single computer) data sets, so I'm curious about what the policies in
these cases are.
Thanks,
Josef
Post by Pietro Berkes
In its current release, MDP supports 2 in the sense that multiple
chunks of data can be sent to different processors, mostly for nodes
that support 1. We have been talking about writing a pipeline based on
a general graph, in which case it should be possible to fit different
algorithms in parallel by splitting the graph in the algorithms that
are equally far from the input data. Could this become a common
project?
I'm not sure what you mean by 3, but it doesn't sound like something
that MDP has...
Pietro
Post by Gael Varoquaux
Just an idea!
Gaël
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
j***@gmail.com
2010-09-16 14:29:28 UTC
Permalink
Post by Pietro Berkes
I'm not sure if I understand you correctly, but I think what you are
asking is: what if the algorithm needs to collect some statistics from
the data that depend on *some other* statistics of the same data?
I forgot to mention that, but algorithm can define multiple train
phases. For example, Fisher Discriminant Analysis needs to go through
the data twice, once to collect the mean and number of points for each
class, and the second time to compute the overall and within-class
covariance matrices. The pipeline object, Flow, takes care of going
twice through the chunks (by rewinding the iterator that generates the
data).
That's an example from what I have in mind, and would apply to
standard regression.
One pass to be able to estimate the regression parameters, on extra
pass to get the residual variance.
Most of the other results (eg. statistical tests on the parameters)
could be calculated from the summary statistics collected in these two
passes. We would have to pay more attention what the sufficient
statistics are for the various results and statistical tests we have.

Thanks, I think I get the main idea,

Josef
Post by Pietro Berkes
For the applications that we were developing, the alternative of
storing the whole data at once was not feasible: most of the time, the
data had several thousands data points, and a dimensionality on the
order of 10^3-10^4. Storing the data in memory would already be a
problem, and computing x^T x, is very slow if done with all the data.
P.
Post by j***@gmail.com
Post by Pietro Berkes
Indeed, MDP supports 1, but there is no magic here: the base class for
algorithms, Node, splits the equivalent of 'fit' in two: 'train' and
'stop_training'. 'train' receives data chunks during learning;
'stop_training' does most of the heavy duty and finalizes the fitting
of parameters. Algorithms that naturally support chuck learning
overwrite 'train': for example, PCA updates the covariance matrix and
average with the new data in 'train', and computes the principal
components in 'stop_training'. Other algorithms, like FastICA, always
need the whole data for every iteration. In that case, 'train' simply
stores the chunks in memory, and 'stop_training' does all the fitting.
I don't think 1 can be solved separately from the algorithms... At the
moment, the MDP-scikits wrappers support chunk-based learning only by
accumulating the data in memory.
A question on this chunked throughput of data.
example from regression
Suppose chunked throughput is possible to, for example, estimate some
parameters while training, but there are no sufficient statistics that
can be accumulated for additional results, e.g. if we want to have a
fit statistic on the original data set.
Would the common pattern then be to go over the data (in chunks) a
second time to get the missing summary statistics for the follow up
analysis?
I was reading about the advantages of SAS for huge (but storable on a
single computer) data sets, so I'm curious about what the policies in
these cases are.
Thanks,
Josef
Post by Pietro Berkes
In its current release, MDP supports 2 in the sense that multiple
chunks of data can be sent to different processors, mostly for nodes
that support 1. We have been talking about writing a pipeline based on
a general graph, in which case it should be possible to fit different
algorithms in parallel by splitting the graph in the algorithms that
are equally far from the input data. Could this become a common
project?
I'm not sure what you mean by 3, but it doesn't sound like something
that MDP has...
Pietro
Post by Gael Varoquaux
Just an idea!
Gaël
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Scikit-learn-general mailing list
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Loading...