How to Apply scikit-learn to Spark machine learning?

Last Update:2018-05-06 Source: Internet

Author: User

Tags pyspark databricks spark mllib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I recently wrote a machine learning program under spark and used the RDD programming model. The machine learning algorithm API provided by spark is too limited. Could you refer to scikit-learn in spark's programming model? I recently wrote a machine learning program under spark and used the RDD programming model. The machine learning algorithm API provided by spark is too limited. Could you refer to scikit-learn in spark's programming model? Reply: different from the above, I think it is possible to reference scikit-learn in PySpark, but it cannot be simply or roughly transplanted, instead, we need to convert the data structures in their respective environments accordingly.

We know that the most important data structure storage in scikit-learn operations is numpy ndarray, while RDD is the core storage in Spark operations. To put it bluntly, it is a MapReduce Based on Directed Acyclic graphs, the purpose of a graph is to Reduce the data transmitted between Map and Reduce, so it is very suitable for machine learning scenarios with repeated iterations. PySpark can provide useful APIs to compute map, reduce, join, filter, and other functional operations, but it cannot process numpy ndarray local storage.

So we just need to find a way Wolf(Distributed RDD) Putting on sheep(Local ndarray) Skin(Changed to ndarrayRDD ), Group(Scikit-learn. To put it simply, the key-value pairs of RDD is used to represent different parts of a multi-dimensional array, it also records the shape of the transformed sub-array and its various changes during calculation. For example, we can set an array to use a subset of its axes as the key, so we can mark the horizontal and vertical axes (axis = (0, 1 )) the five-dimensional array can be expressed as key-value pairs, where keys is a binary group and values is a three-dimensional array, which is made into an ndarrayRDD. Then, ndarrayRDD is constantly transposed and deformed to achieve continuous parallelization. In this case, we can use map, filter, reduce, and other functional operations in Python, plus the cache and unpersist methods in Spark to control the cache of RDD, which does not waste Spark's fast features, it also took advantage of Python and scikit-learn.

In this case, a more specific understanding depends on reading the source code. Yes, in fact, some of the attempts in this area have been done long ago, and some of them have been well developed.

GitHub-bolt-project/bolt: uniied interface for local and distributed ndarrays
Recommended! This is the first transformation method between a single machine and a distributed multi-dimensional array that I have seen. The key to the design is a swap method, that is, the key-value pairs of the ndarrayRDD I mentioned above are constantly transformed, and the value axes is moved to the key axes, and the value axes can be moved to the key axes separately. As a result, the number of splits increases and the concurrency naturally increases.

GitHub-thunder-project/thunder: scalable analysis of images and time series
Thunder is a package that can process massive amounts of data based on images. The distributed Part references the bolt. spark mentioned above.

GitHub-lensacom/sparkit-learn: PySpark + Scikit-learn = Sparkit-learn
This splearn is a promising package I think, because it provides three distributed data structures: arrayRDD, sparseRDD, dictRDD, and scikit-learn, to apply to transformed RDD.

GitHub-databricks/spark-sklearn: Scikit-learn integration package for Spark
Finally, let's talk about the Spark-sklearn developed by databricks. The development is not sufficient, and the functions are very limited. You can use grid search to perform cross-validation on parameters only when the dataset is in memory (that is, GridSearchCV in scikit-learn is used) while implementing parallelism, rather than implementing parallelism for each Learning Algorithm like MLlib; Spark MLlib is also available when the memory cannot hold a large dataset. O (zookeeper) o laxatives! The simple answer is: no.
The core of spark is RDD, which is a DAG version of map reduce. The implementation of machine learning algorithms in standalone and parallel versions is completely different, sklearn, as a standalone Algorithm Library, cannot be simply transplanted to spark.

2016.2.10 updated:
Github found a project GitHub-databricks/spark-sklearn: Scikit-learn integration package for Spark. To seamlessly integrate sklearn with spark, however, it seems that the function is still relatively simple. If you do not want to use the production team's camai to harvest at most, it provides the same API as scikit, but the internal implementation is completely different, the underlying data structure is different, and the upper-layer algorithm logic is also different. How can this problem be transplanted? The scikit package has never been used, but I agree with the conclusion that the spark api is too restrictive. Many of the parameters or usage we are used to are difficult to use in spark. However, my understanding of spark mllib is that his algorithm is basic, but the consideration of hash and network throughput is a highlight and a great highlight. If you have time and energy, you can improve the basic algorithm to the desired mode. The last scikit was supposed to be introduced in pyspark, but whether the distributed efficiency can be achieved depends on the scikit implementation method. Hope to be useful

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More