[Scikit-learn-general] Introducing spark-sklearn, a scikit-learn integration package for Spark

Tim Hunter

2016-02-10 18:14:15 UTC

Hello community,
I would like to introduce a new package that should be of interest to
scikit-learn users who work with the Spark framework, or with a
distributed system.

It provides the following, among other tools:
- train and evaluate multiple scikit-learn models in parallel.
- convert Spark's Dataframes seamlessly into numpy arrays
- (experimental) distribute Scipy's sparse matrices as a dataset of
sparse vectors.

Spark-sklearn focuses on problems that have a small amount of data and
that can be run in parallel. Note this package distributes simple
tasks like grid-search cross-validation. It does not distribute
individual learning algorithms (unlike Spark MLlib).

If you want to use it, see instructions on the package page:
https://github.com/databricks/spark-sklearn

This blog post contains more details:
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html

Let us know if you have any questions. Also, documentation or code
contributions are much welcome (Apache 2.0 license).

Cheers

Tim and Joseph