**You are welcome to reprint it. Please indicate the source, huichiro.**

Summary

This article briefly describes the implementation of the linear regression algorithm in Spark mllib, involves the theoretical basis of the linear regression algorithm itself and linear regression parallel processing, and then reads the code implementation part.

Linear Regression Model

The main purpose of the machine learning algorithm is to find the model that best interprets the data. This model is a hypothetical function. The step-by-step derivation basically follows this idea.

- Assume that the function
- In order to find the best hypothetical function, we need to find a reasonable evaluation standard. In general, the loss function is used as the evaluation standard.
- Launch target functions based on Loss Functions
- Now the problem is transformed into how to find the optimal solution of the target function, that is, the optimization of the target function.

Specifically to linear regression, the above is converted

Gradient Descent Method

The gradient descent method can be used for Least Squares to obtain the optimal solution of the loss function.

Algorithm Implementation

Random Gradient Descent

Regularization

How can we solve these problems? The shrinkage method can be used, which is also called regularization ). Ridge Regression and lasso regression. By adding a penalty constraint to the Least Squares estimation, the estimation of some coefficients is 0.

Code Implementation of Linear Regression

The above describes some mathematical basics. When we use code to implement these mathematical theories, the most important thing is to grasp the corresponding hypothetical functions and optimization algorithms, are there any corresponding regularization rules.

For linear regression, these are all clear, respectively

- Y = a * x + B hypothesis Function
- Random Gradient Descent Method
- Ridge Regression or lasso method, or nothing

The spark mllib code implementation for linear regression is also based on this step. The class diagram is as follows:

Function call path

Train-> Run, Run function processing logic

- Use the optimization algorithm to obtain the optimal solution, optimizer. Optimize
- Create a regression model based on the optimal solution, createmodel

Runminibatchsgd is the place where gradient and loss are truly computed.

`def runMiniBatchSGD( data: RDD[(Double, Vector)], gradient: Gradient, updater: Updater, stepSize: Double, numIterations: Int, regParam: Double, miniBatchFraction: Double, initialWeights: Vector): (Vector, Array[Double]) = { val stochasticLossHistory = new ArrayBuffer[Double](numIterations) val numExamples = data.count() val miniBatchSize = numExamples * miniBatchFraction // if no data, return initial weights to avoid NaNs if (numExamples == 0) { logInfo("GradientDescent.runMiniBatchSGD returning initial weights, no data found") return (initialWeights, stochasticLossHistory.toArray) } // Initialize weights as a column vector var weights = Vectors.dense(initialWeights.toArray) val n = weights.size /** * For the first iteration, the regVal will be initialized as sum of weight squares * if it‘s L2 updater; for L1 updater, the same logic is followed. */ var regVal = updater.compute( weights, Vectors.dense(new Array[Double](weights.size)), 0, 1, regParam)._2 for (i (c, v) match { case ((grad, loss), (label, features)) => val l = gradient.compute(features, label, bcWeights.value, Vectors.fromBreeze(grad)) (grad, loss + l) }, combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) => (grad1 += grad2, loss1 + loss2) }) /** * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration * and regVal is the regularization value computed in the previous iteration as well. */ stochasticLossHistory.append(lossSum / miniBatchSize + regVal) val update = updater.compute( weights, Vectors.fromBreeze(gradientSum / miniBatchSize), stepSize, i, regParam) weights = update._1 regVal = update._2 } logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format( stochasticLossHistory.takeRight(10).mkString(", "))) (weights, stochasticLossHistory.toArray) }`

The most important part in the above Code is the use of the aggregate function. Let's take a look at the definition of the aggregate function.

`def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = { // Clone the zero value since we will also be serializing it as part of tasks var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance()) val cleanSeqOp = sc.clean(seqOp) val cleanCombOp = sc.clean(combOp) val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp) val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult) sc.runJob(this, aggregatePartition, mergeResult) jobResult }`

The aggregate function has three input parameters: zerovalue, seqop, and combop.

- Seqop is executed in parallel, and tasks on executors are used for computing.
- Combop is executed in serial mode. The combop operation is called in the tasksucceeded function of jobwaiter.

To further deepen our understanding of the aggregate function, let's take a small example. After spark-shell is started, run the following code:

`Val z = SC. parallelize (List (1, 2, 3, 4, 5, 6), 2) Z. aggregate (0) (math. max (_, _), _ + _) // The result is 9res0: Int = 9.`

Take a closer look at the log output at runtime. The job submitted by aggregate is composed of a stage (stage0). Because the entire dataset is divided into two partitions, two tasks are created for stage0 for parallel processing.

Leastsquaregradient

After finishing the execution of the aggregate function, let's go back and continue with the gradient. compute function that makes up seqop.

Leastsquaregradient is used to calculate the gradient and error. Note that the cumgraident in cmopute will return the changed result. Here, the formula is based on Q (w) in cost-function)

`Class leastsquaresgradient extends gradient {override def compute (data: vector, label: Double, weights: vector): (vector, double) = {Val brzdata = data. tobreeze Val brzweights = weights. tobreeze Val diff = brzweights. DOT (brzdata)-label Val loss = diff * diff Val gradient = brzdata * (2.0 * diff) (vectors. frombreeze (gradient), loss)} override def compute (data: vector, label: Double, weights: V Ector, cumgradient: vector): Double = {Val brzdata = data. tobreeze Val brzweights = weights. tobreeze // dot indicates the dot product. It is a binary operation that accepts two vectors on the real number R and returns a real number scalar. The result is the standard inner product of the Euclidean space. // The dot product of two vectors is written as a · B. The result of dot multiplication is called dot product, also known as the number product Val diff = brzweights. DOT (brzdata)-label // The following sentence completes y + = A * x brzaxpy (2.0 * diff, brzdata, cumgradient. tobreeze) diff * diff }}`

The breeze-related functions frequently appear in the above Code, so you will be curious about what it is for you.

The real point is not odd. It is calculated by a large number of moment arrays and vectors, to better support and encapsulate these computations, the breeze library is introduced.

Breeze, epic and Puck are three columns in scalanlp. The specific parameters are www.scalanlp.org.

Regularization process

The weight coefficient is updated based on the gradient and error obtained from this iteration. At this time, the regularization rule is used. That is, the following statement triggers the update of the weight coefficient.

` val update = updater.compute( weights, Vectors.fromBreeze(gradientSum / miniBatchSize), stepSize, i, regParam)`

Take ridge regression as an example to check the code implementation in the update process.

`class SquaredL2Updater extends Updater { override def compute( weightsOld: Vector, gradient: Vector, stepSize: Double, iter: Int, regParam: Double): (Vector, Double) = { // add up both updates from the gradient of the loss (= step) as well as // the gradient of the regularizer (= regParam * weightsOld) // w‘ = w - thisIterStepSize * (gradient + regParam * w) // w‘ = (1 - thisIterStepSize * regParam) * w - thisIterStepSize * gradient val thisIterStepSize = stepSize / math.sqrt(iter) val brzWeights: BV[Double] = weightsOld.toBreeze.toDenseVector brzWeights :*= (1.0 - thisIterStepSize * regParam) brzAxpy(-thisIterStepSize, gradient.toBreeze, brzWeights) val norm = brzNorm(brzWeights, 2.0) (Vectors.fromBreeze(brzWeights), 0.5 * regParam * norm * norm) }}`

Result Prediction

After calculating the weights and intercept, you can create a linear regression model linearregressionmodel and use the predict function of the model to predict the observed values.

`class LinearRegressionModel ( override val weights: Vector, override val intercept: Double) extends GeneralizedLinearModel(weights, intercept) with RegressionModel with Serializable { override protected def predictPoint( dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double = { weightMatrix.toBreeze.dot(dataMatrix.toBreeze) + intercept }}`

Note that the linearregression constructor requires the weight (weights) and intercept as the input parameter. to make predictions for new variables, you must call predictpoint.

A complete sample program

Run the following statement in spark-shell to try it out.

`import org.apache.spark.mllib.regression.LinearRegressionWithSGDimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.linalg.Vectors// Load and parse the dataval data = sc.textFile("mllib/data/ridge-data/lpsa.data")val parsedData = data.map { line => val parts = line.split(‘,‘) LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(‘ ‘).map(_.toDouble)))}// Building the modelval numIterations = 100val model = LinearRegressionWithSGD.train(parsedData, numIterations)// Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()println("training Mean Squared Error = " + MSE)`

Summary

Again, find the corresponding**Assume that the function**, Used for evaluation**Loss Function**,**Optimal Solution**,**Regularization rules**