Recently writing a machine learning program under Spark, using the RDD programming model. There are too many limitations to the machine learning algorithm API that comes with spark. Want to ask each road warrior, can again spark programming model under reference Scikit-learn?
Reply content:
Unlike the above viewpoints, I think it is possible to refer to Scikit-learn under Pyspark, but not directly and rudely, but rather to do some transformations of the data structures in their respective environments accordingly.
We know that the most important data structure in the Scikit-learn operation is the NumPy Ndarray, while the most important storage in the spark operation is the RDD, which is plainly a mapreduce based on a forward-free graph. The goal of the graph is to reduce the data passed between map and reduce, so it is well suited for iterative machine learning scenarios. Pyspark can provide a very useful API to calculate functions such as map, reduce, join, filter, etc., but cannot handle the local storage of NumPy Ndarray.
So we're going to find a way
Wolf(Distributed's Rdd)
Put on the sheep(Local Ndarray)
the Skin(Into Ndarrayrdd),
mixed with sheep(Scikit-learn) will be able to eat meat. In a nutshell, it is using the RDD key pair (Key-value pairs) to represent different parts of a multidimensional array, and to record the shape of the transformed sub-array and the various changes that occur when it is calculated. For example, we set an array to use a subset of its axes as a key, so a five-dimensional array of horizontal and vertical axes (axis= (0,1)) can be represented as key-value pairs, where keys is a two-tuple, and values are three-dimensional arrays, It was made into a ndarrayrdd. Then Ndarrayrdd constantly transpose and deform, thus realizing the process of continuous parallelization. In this we can use the Python map, filter, reduce and other functions of the operation, plus spark in the cache, unpersist and other methods to control the RDD caching, also did not waste the rapid characteristics of spark, It also plays the advantages of Python and Scikit-learn.
I say so roughly, more specific understanding also depends on the direct reading source. Yes, in fact, this is an attempt that has already been made, and some have been developed quite well.
Github-bolt-project/bolt:unified interface for local and distributed ndarrays
Recommended! This is the first I saw on the multidimensional array of single-machine and distributed between the conversion method, the key to the design of the idea is a method called swap, which I mentioned above Ndarrayrdd key-value pairs constantly transform, the value axes moved to key axes, Value axes can be moved separately to key axes and so on, split is more and more, naturally more and more parallel.
Github-thunder-project/thunder:scalable analysis of images and time series
Thunder is a package that can handle massive amounts of data based on images, in which the distributed part refers to the Bolt.spark mentioned above.
Github-lensacom/sparkit-learn:pyspark + Scikit-learn = Sparkit-learn
This splearn is one of the most promising packages I have ever seen, as it provides three distributed data structures: Arrayrdd, Sparserdd, Dictrdd, and, correspondingly, rewriting scikit-learn to apply to the changed Rdd.
Github-databricks/spark-sklearn:scikit-learn Integration package for Spark
Finally, the Databricks personally developed this spark-sklearn. The development is not enough, the function is very limited, can only be in the data set in memory under the premise of using the grid search parameters to do cross-validation (that is, using the GRIDSEARCHCV inside the Scikit-learn) to achieve parallel, Instead of parallel to each learning algorithm as MLlib, Spark MLlib is used when memory Jiabuzhu a large data set. O (︶^︶) o laxatives! The simple answer is: No.
The core of Spark is the RDD, a DAG version of map reduce, the machine learning algorithm of the standalone and parallelized version of the implementation is completely different, sklearn as a stand-alone algorithm library can not be simply ported to spark
2016.2.10 Updated:
Found a project on GitHub Github-databricks/spark-sklearn:scikit-learn integration package for Spark
, the purpose is to seamlessly integrate sklearn and spark, but now it seems that the function is relatively simple private plots do not expect to use the production team comer because the harvest is to provide and scikit the same API, but the internal implementation is completely different, the underlying data structure is not the same, The algorithm logic of the upper layer is also different, how to transplant? It's just so cool. I haven't used scikit this package, but I agree with you that the Spark API limit is much more than that conclusion. Many of the parameters or usages that we used to use are not good enough in spark. But my understanding of spark mllib is that his algorithm is basic, but the hashing and network throughput considerations are bright spots, great highlights. If you have the time and effort, you can completely improve the basic algorithm to the pattern you want. The last time Scikit supposedly said should be introduced under the Pyspark, but can realize the distributed efficiency to see Scikit realization method. Humble opinion, Hope useful