As an open-source cluster computing environment, Spark has a distributed, fast data processing capability. The mllib in spark defines a variety of data structures and algorithms for machine learning. Python has the Spark API. It is important to note that in spark, all data is handled based on the RDD.
Let's start with a detailed application example of clustering Kmeans:
The following code is some basic steps, including external data, RDD preprocessing, training model, prediction.
#coding: utf-8from numpy import arrayfrom Math import sqrtfrom pyspark import Sparkcontextfrom pyspark.mllib.clustering im Port Kmeans, kmeansmodelif __name__ = = "__main__": sc = sparkcontext (appname= "Kmeansexample", master= ' local ') # SparkCo ntext # reads and processes data = Sc.textfile ("./kmeans_data.txt") print data.collect () Parseddata = Data.map (lambda line : Array ([Float (x) for x in Line.split (')])) # Training data Print parseddata.collect () clusters = Kmeans.train (parseddat A, k=2, maxiterations=10, runs=10, initializationmode= "random") #求方差之和 def error (point): Center = clusters.centers[clusters.predict (point)] return sqrt (sum ([X**2 to X in (Point-center)]) WSS SE = Parseddata.map (Lambda point:error (point)). Reduce (lambda x, y:x + y) print ("Within Set Sum of squared, error =" + STR (Wssse)) #聚类结果 def sort (point): Return Clusters.predict (point) Clusters_result = Parseddata.map (sort) # Save and load model # $example off$ print ' cluster result: ' Print clusters_result.collect () sc.stop ()
As you can see Using spark for machine learning, I called an external open source package, numpy, and used the array as the data structure. In Mllib, we have already defined a variety of data structures for machine learning, and the following is a brief introduction to the two DS that can be used in classification and regression analysis.
Sparse vector (sparsevector): A sparse vector is a vector that has many values that are 0 in the point-to-measure element.
The initialization and simplicity of the operation are as follows:
# coding:utf-8from pyspark.mllib.linalg Import *v0 = Sparsevector (4, [1, 2], [2, 3.0]) # sparse vector, first parameter is dimension, second parameter is a non 0-dimensional subscript Collection, the third parameter is a collection of values that are non-0 Dimensions v1 = sparsevector (4,{1:3, 2:4}) # The first parameter is a dimension, the second parameter is a dictionary of subscripts and dimensions print V0.dot (v1) # calculates dot product print v0.size< c2/># Vector Dimension Print v0.norm (0) # Returns the value of dimension 0 print V0.toarray () # Convert to Arrayprint v0.squared_distance (v1) # European Distance
The sparse vectors in spark can be initialized with a list or dict.
Vector tags (labeled point): Vector tags are in the combination of vectors and tags, classification and regression, labels can be used as categories in the classification, but also as the actual values in the regression.
From pyspark.mllib.regression import labeledpointdata = [ labeledpoint (1.0, [1.0, 1.0]), labeledpoint (4.0, [ 1.0, 3.0]), labeledpoint (8.0, [2.0, 3.0]), labeledpoint (10.0, [3.0, 4.0])]print Data[0].featuresprint data[0] . label
Here are some of the basic implementations of Mllib for regression analysis (linear regression, ridge regression):
# coding:utf-8from pyspark.mllib.regression Import labeledpointfrom pyspark.mllib.regression Import Linearregressionwithsgdfrom pyspark.context import sparkcontext#----------------linear regression--------------import NumPy as NPSC = Sparkcontext (master= ' local ', appname= ' Regression ') data = [Labeledpoint (1.0, [1.0, 1.0]), Labeledpoint (2.0, [1.0, 1.4]), Labeledpoint (4.0, [2.0, 1.9]), Labeledpoint (6.0, [3.0, 4.0])] # training Set LRM = Linearregressionwithsgd.train (sc.paral Lelize (data), iterations=100, Initialweights=np.array ([1.0,1.0])) Print lrm.predict (Np.array ([2.0,1.0])) # Using the trained regression model to predict the import OS, tempfilefrom pyspark.mllib.regression import Linearregressionmodelfrom pyspark.mllib.linalg Import Sparsevectorpath = Tempfile.mkdtemp () lrm.save (SC, path) # Save the model to external memory Samemodel = Linearregressionmodel.load (SC, PATH) # reads model print samemodel.predict (Sparsevector (2, {0:100, 1:150})) # Returns a single predictive value using a sparse vector as a data structure test_set = []for i in Ran GE: for J in range: Test_set.append (Sparsevector (2, {0:i,1:j}))Print Samemodel.predict (Sc.parallelize (Test_set)). Collect () # predicting multiple values, returning an RDD dataset print Samemodel.weights # return parameter #-------- ---------Ridge return------------------from pyspark.mllib.regression import ridgeregressionwithsgddata = [Labeledpoint (1.0, [ 1.0, 1.0]), Labeledpoint (4.0, [1.0, 3.0]), Labeledpoint (8.0, [2.0, 3.0]), Labeledpoint (10.0, [3.0, 4.0])]train_set = SC . Parallelize (data) RRM = Ridgeregressionwithsgd.train (Train_set, iterations=100, Initialweights=np.array ([1.0,1.0]) ) Test_set = []for i in range: for J in range: Test_set.append (Np.array ([I, j])) Print rrm.predict (sc.parallel Ize (Test_set)). Collect () Print rrm.weights
The code above just lets you understand the simple operation, the preprocessing of the data is not done on the basis of the RDD.
Here are some basic implementations of the classification algorithm:
# coding:utf-8from Pyspark Import sparkcontextfrom pyspark.mllib.regression import labeledpointprint '------- Logistic regression-------' from pyspark.mllib.classification import LOGISTICREGRESSIONWITHSGDSC = Sparkcontext (appname= "LRWSGD", master= ' local ') DataSet = []for i in range:-J in Range: Dataset.append ([i,j]) DataSet = Sc.paralleli Ze (DataSet) # Parallelization of data, converted to Rdddata =[labeledpoint (0.0, [0.0, 100.0]), Labeledpoint (1.0, [100.0, 0.0]),]LRM = Logisticregressionwithsgd.train (sc.parallelize (data), iterations=10) # The second parameter is the number of iterations of print lrm.predict (dataset). Collect () lrm.clearthreshold () print lrm.predict ([0.0, 1.0]) #----------------------------------------------------- -----from PYSPARK.MLLIB.LINALG Import sparsevectorfrom numpy Import arraysparse_data = [Labeledpoint (0.0, Sparsevector (2, {0:0.0, 1:0.0}), Labeledpoint (1.0, Sparsevector (2, {1:1.0})), Labeledpoint (0.0, Sparsevector (2, {0:1})), Labeledpoint (1.0, Sparsevector (2, {1:2.0}))]train = Sc.parallelize (sparse_data) LRM = Logisticregressionwithsgd.train (train, iterations=10) print lrm.predict (Array ([0.0, 1.0]) # Make predictions for a single array print Lrm.predict (Sparsevector (2, {1:1})) # Predicting a single sparse vector print '------svm-------' from pyspark.mllib.classification Import SVMWITHSGDSVM = Svmwithsgd.train (train,iterations=10) Print svm.predict (Sparsevector (2, {1:1.0})) print '------ Bayes------' from pyspark.mllib.classification import NAIVEBAYESNB = Naivebayes.train (train) Print nb.predict ( Sparsevector (2, {1:1}))
Copyright is all I have, (*^__^*) hahaha ~
Basic operation of machine learning using spark mllab (clustering, classification, regression analysis)