Spark MLlib-linear regression source code analysis

Last Update:2014-06-25 Source: Internet

Author: User

Tags spark mllib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Theoretical Basis

The Linear Regression (Linear Regression) problem belongs to the category of Supervised Learning, also known as Classification or Inductive Learning ); in this type of analysis, the data class labels in the training dataset are determined. The goal of machine learning is to set a training dataset, through continuous analysis and learning, a Classification Function or a Prediction Function is generated for a set of associated attributes and a set of class labels. This Function is called a Classification Model) or Prediction Model. The Model obtained through learning can be a decision tree, specification set, Bayesian Model, or a hyperplane. This model can be used to predict the feature vectors of input objects or classify object class labels.

In regression problems, we usually use the Least Squares method to iterate the weight of each attribute in the optimal feature, through the Loss Function or Error Function) defines to set the convergence status, that is, as the approximation parameter factor of the shaving descent algorithm.

2. Introduction to matrix vector library jblas

Because spark MLlib uses the linear algebra library of jlbas, it is helpful for analyzing and learning many MLlib algorithms in spark to learn basic operations in the jlbas library; the following describes basic operations in jlbas using the DoubleMatrix matrix in jlbas:

Val matrix1 = DoubleMatrix. ones (10, 1) // create a 10*1 matrix with all values of 1. val matrix2 = DoubleMatrix. zeros (10, 1) // create all 10*1 matrix matrix1.put (1,-10) val absSum = matrix1.norm1 () with 0 values () // val euclideanNorm = matrix1.norm2 () // Euclidean distance val matrix3 = (matrix1.addi (matrix2) val matrix4 = new DoubleMatrix (1, 10, (1 to 10 ). map (_. toDouble): _ *) // create the Double Vector object println ("print init value: matrix3 =" + matrix3) println ("print init value: matrix4 =" + matrix4) println ("matrix sub matrix:" + matrix3.sub (matrix4) + "," + matrix4.sub (10) // subtraction println ("matrix add matrix:" + matrix3.add (matrix4) + "," + matrix4.add (10) // addition operation println ("matrix mul matrix:" + matrix3.mul (matrix4) + "," + matrix4.mul (10 )) // multiplication println ("matrix div matrix:" + matrix3.div (matrix4) + "," + matrix4.div (10) // division println ("matrix dot matrix: "+ matrix3.dot (matrix4) // Vector Product val matrix5 = DoubleMatrix. ones (10, 10) println ("N * M Vector Matrix sub OP: \ n" + matrix5.subRowVector (matrix4) + "\ n" + matrix5.subColumnVector (matrix4 )) // multi-object subtraction operation println ("N * M Vector Matrix add OP: \ n" + matrix5.addRowVector (matrix4) + "\ n" + matrix5.addColumnVector (matrix4 )) // multi-object addition operation println ("N * M Vector Matrix mul OP: \ n" + matrix5.mulRowVector (matrix4) + "\ n" + matrix5.mulColumnVector (matrix4 )) // multi-object multiplication println ("N * M Vector Matrix div OP: \ n" + matrix5.divRowVector (matrix4) + "\ n" + matrix5.divColumnVector (matrix4 )) // multi-object division operation

3. Gradient Descent Algorithm

Gradient Descent algorithms are used to gradually reduce the order in the iteration process, and constantly update the feature weight vector to obtain the optimal feature weight vector that is infinitely close or fitting. There are two gradient descent algorithms, the first is the Batch Gradient Descent algorithm. This method is used to accumulate the weight vectors and then update them in batches. Generally, this method is not applicable to large-scale dataset processing; the other is the Stochastic Gradient Descent algorithm. In this way, weights are calculated and updated for each object in a given training data set, in some cases, it is easy to converge to the local optimal solution. The Spark MLlib library mainly uses the random gradient descent algorithm. To better understand the implementation of the random gradient algorithm (all algorithms whose class suffixes end with SGD) in the MLlib library, the following is a Demo of Linear Fitting Using the random gradient algorithm:

Def sgdDemo {val featuresMatrix: List [List [Double] = List (1, 4), List (2, 5), List (5, 1 ), list (4, 2) // Feature Matrix val labelMatrix: List [Double] = List (19, 26, 19, 20) // real value vector var theta: list [Double] = List (0, 0) var loss: Double = 10.0 for {I <-0 until 1000 // number of iterations if (loss> 0.01) // convergence condition loss <= 0.01} {var error_sum = 0.0 // total error var j = I % 4 var h = 0.0 for (k <-0 until 2) {h + = featuresMatrix (j) (k) * theta (k)} // computing class label error_sum = labelMatrix (j) of the j object in the test Dataset) -h // calculate the error value between the class label and the calculated class label in the test dataset var cacheTheta: List [Double] = List () for (k <-0 until 2) {val updaterTheta = theta (k) + 0.001 * (error_sum) * featuresMatrix (j) (k) cacheTheta = updaterTheta +: cacheTheta} // update the weight vector cacheTheta. foreach (t => print (t + ",") print (error_sum + "\ n") theta = cacheTheta // update error rate var currentLoss: double = 0 for (j <-0 until 4) {var sum = 0.0 for (k <-0 until 2) {sum + = featuresMatrix (j) (k) * theta (k)} currentLoss + = (sum-labelMatrix (j) * (sum-labelMatrix (j ))} loss = currentLoss println ("loss->>>>" + loss/4 + ", I >>>>>" + I )}}

4. MLlib linear regression source code analysis

The Linear Regression Algorithms available in MLlib include LinearRegressionWithSGD, RidgeRegressionWithSGD, and LassoWithSGD. Main classes involved in MLlib regression analysis include GeneralizedLinearAlgorithm and GradientDescent. The following mainly analyzes the implementation of LinearRegressionWithSGD.

Step 1: before using the LinearRegressionWithSGD algorithm, resolve the input data to the RDD elastic distributed data set of the LabeledPoint object that contains the class label and feature vector.

Step 2: Call the train method in the LinearRegressionWithSGD companion object to transmit the RDD set created in step 1 and the maximum number of iterations. in train, a new LinearRegressionWithSGD object is created, initialize the gradient descent calculation using the least square Gradient Descent Algorithm SquaredGradient and updated weight vector using SimpleUpdater. Execute the run method in the parent class GeneralizedLinearAlgorithm for weight vector and interception parameter calculation, returns the model attribute weight vector obtained from training.

Implementation of the train method in the LinearRegressionWithSGD companion object

Def train (input: RDD [LabeledPoint], numIterations: Int, stepSize: Double, // The default step is 1 miniBatchFraction: Double) // The batch factor used for each drop-down, the default value is 1: LinearRegressionModel = {new LinearRegressionWithSGD (stepSize, numIterations, miniBatchFraction ). run (input )}

Implementation of the run method in LinearRegressionWithSGD

Def run (input: RDD [LabeledPoint], initialWeights: Array [Double]): M = {// Check the data properties before running the optimizer if (validateData &&! Validators. forall (func => func (input) {// pre-verify the validity of the Input data. In validators, all methods for verification are stored. throw new SparkException ("input validation failed. ")} // Prepend an extra variable consisting of all 1.0's for the intercept. val data = if (addIntercept) {// determine whether to add intercept input. map (labeledPoint => (labeledPoint. label, 1.0 +: labeledPoint. features)} else {input. map (labeledPoint => (labeledPoint. label, labeledPoint. features)} // convert the object to a tuples (class tag, feature) val initialWeightsWithIntercept = if (addIntercept) {0.0 +: initialWeights} else {initialWeights} // initialize the weight feature val weightsWithIntercept = optimizer. optimize (data, initialWeightsWithIntercept) // return the optimized weight val (intercept, weights) = if (addIntercept) {(weightsWithIntercept (0), weightsWithIntercept. tail)} else {(0.0, weightsWithIntercept)} logInfo ("Final weights" + weights. mkString (",") logInfo ("Final intercept" + intercept) createModel (weights, intercept) // create a model using the calculated weight vector and intercept}

Optimizer. optimize (data, weight) is the core of attention implementation. The oprimizer type is GradientDescent. The optimize method mainly calls the runMiniBatchSGD method of the GradientDescent companion object to return the optimal feature weight vector generated by the current iteration.

Optimize Method implementation in GradientDescentd object

Def optimize (data: RDD [(Double, Array [Double])], initialWeights: Array [Double]): Array [Double] ={// returns the optimized weight vector, and the error val (weights, stochasticLossHistory) = GradientDescent. runMiniBatchSGD (data, gradient, updater, stepSize, numIterations, regParam, miniBatchFraction, initialWeights) weights}

Implementation of the runMiniBatchSGD method in the GradientDescent companion object

Def runMiniBatchSGD (data: RDD [(Double, Array [Double])], gradient: Gradient, // SquaredGradient-square shaving descent algorithm updater: Updater, // SimpleUpdater stepSize: Double, // 1.0 numIterations: Int, // 100 regParam: Double, // 0.0 miniBatchFraction: Double, // 1.0 initialWeights: Array [Double]): (Array [Double], array [Double]) = {val stochasticLossHistory = new ArrayBuffer [Double] (numIterations) val nexamples: Long = data. count () val miniBatchSize = nexamples * miniBatchFraction // Initialize weights as a column vector // create a one-dimensional vector. The first parameter is the number of rows, and the second parameter is the number of columns, the third parameter starts with var weights = new DoubleMatrix (initialWeights. length, 1, initialWeights: _ *) var regVal = 0.0 for (I <-1 to numIterations) {/*** use the square Gradient Descent Algorithm * gradientSum: * lossSum: The sum of errors of the selected iterations */val (gradientSum, lossSum) = data. sample (false, miniBatchFraction, 42 + I ). map {case (y, features) => // (TAG, feature) val featuresCol = new DoubleMatrix (features. length, 1, features: _ *) val (grad, loss) = gradient. compute (featuresCol, y, weights) // (feature, Tag, feature Attribute Weight Vector)/*** class SquaredGradient extends Gradient {override def compute (data: DoubleMatrix, label: double, weights: DoubleMatrix): (DoubleMatrix, Double) = {val diff: Double = data. dot (weights) -label // calculate the difference between the class label of the current computing object and the actual class label value val loss = 0.5 * diff // The current square gradient descent value val gradient = data. mul (diff) (gradient, loss)} */(grad, loss) // feature weight vector and error of the current training object }. reduce (a, B) => (. _ 1. addi (B. _ 1),. _ 2 + B. _ 2) // calculate the sum of sum and difference of the new feature weight vectors of the training data selected in this iteration/*** NOTE (Xinghao ): lossSum is computed using the weights from the previous iteration * and regVal is the regularization value computed in the previous iteration as well. */stochasticLossHistory. append (lossSum/miniBatchSize + regVal) // miniBatchSize = number of objects in the sample * batch factor, and regVal is the regression factor val update = updater. compute (weights, gradientSum. div (miniBatchSize), stepSize, I, regParam) // weights: Weight Factor Set in the attribute vector, regParam: regression parameter, stepSize: calculation step size, I: current iterations/*** class SimpleUpdater extends Updater {*/*** weihtsOld: feature weight vector after the previous iteration * gradient: feature weight vector * stepSize: iteration step size * iter: Number of current iterations * regParam: regression parameter */override def compute (weightsOld: DoubleMatrix, gradient: DoubleMatrix, stepSize: double, iter: Int, regParam: Double): (DoubleMatrix, Double) = {val thisIterStepSize = stepSize/math. sqrt (iter) // use the reciprocal of the square root of the current number of iterations as the factor for approaching (decreasing) This iteration. val normGradient = gradient. mul (thisIterStepSize) (weightsOld. sub (normGradient), 0) // returns the feature weight vector updated after this shaving descent} **/weights = update. _ 1 regVal = update. _ 2 // use SimpleUpdater value 0} logInfo ("GradientDescent finished. last 10 stochastic losses % s ". format (stochasticLossHistory. takeRight (10 ). mkString (",") (weights. toArray, stochasticLossHistory. toArray )}

In MiniBatchSGD, iterative sampling is implemented for the input data set. SquaredGradient is used as the gradient descent algorithm and SimpleUpdater is used as the update algorithm, the sample data set is continuously iterated to find the optimal feature weight vector solution.

The official test code is as follows:

  def linearRegressionAPITest(sc: SparkContext) {    val url = "/Users/yangguo/hadoop/spark/mllib/data/ridge-data/lpsa.data"    val data = sc.textFile(url)    val parseData = data.map { line =>      val parts = line.split(',')      LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x => x.toDouble).toArray)    }    val numIterations = 20    val model = LinearRegressionWithSGD.train(parseData, numIterations)    val valuesAndPreds = parseData.map { point =>      val prediction = model.predict(point.features)      (point, prediction)    }    valuesAndPreds.foreach { case (v, p) => print("[" + v.label + "," + p + "]"); v.features.foreach(base => print(base + "--")); println("\n") }    val isSuccessed = valuesAndPreds.map { case (v, p) => math.pow((p - v.label), 2) }.reduce(_ + _) / valuesAndPreds.count    println(isSuccessed)  }

Refer:

[1]. http://rdc.taobao.org /? P = 2163

Http://cs229.stanford.edu/notes/cs229-notes1.pdf.

Http://blog.sina.com.cn/s/blog_62339a2401015jyq.html.

Http://blog.csdn.net/pennyliang/article/details/6998517.

[5]. http://en.wikipedia.org/wiki/Lasso_ (statistics) # Lasso_method

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More