Classification algorithm--Parallel logic regression algorithm

Classification algorithm--Parallel logic regression algorithm _ Data Mining

Last Update:2018-08-22 Source: Internet

Author: User

Tags scalar

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Logical regression (logistic regression, (LR) is a very common classification algorithm in machine learning, which has been widely used in the field of the Internet, whether it is in the advertising system for CTR estimation, the recommended conversion rate in the system, the identification of garbage content in the anti-spam system ... can see its figure. LR is favored by the majority of users for its simple principle and application universality. In actual situation, because of the limitation of single machine processing ability and efficiency, the process of solving LR problem often needs to be paralleled in the course of using large-scale sample data, this paper discusses the implementation of LR from the angle of parallelization.

1. LR's basic principle and solution method

In the LR model, the feature weight vector is used to weighted the values of different dimensions of eigenvector, and it is compressed into the range of 0~1 by a logical function as the probability of the sample being a positive sample. The logical function is the curve as shown in Figure 1.

Fig. 1 Logical function curve

Given m training samples, of which xj={xji|i=1,2,... n} is an n-dimensional real vector (eigenvector, all vectors in this article are not described as column vectors); YJ values are +1 or 1, for category labels, +1 for positive samples, and 1 for negative samples. In the LR model, the probability that the first J sample is a positive sample is:

The W is the characteristic weight vector of n-dimensional, which is called the Model parameter of the solution in LR problem.

To solve the LR problem, we find a suitable feature weight vector w, so that the positive sample in the training set, the value as large as possible, for a negative sample in the training set, this value is as small as possible (or

As large as possible). to represent by joint probability:

The upper-type log and minus sign are equivalent to:

Formula (1)

The formula (1) is the objective function of LR solution.

Find the appropriate w to minimize the objective function f (w), is an unconstrained optimization problem, the common approach to solve this problem is to randomly give an initial W0, through iteration, calculate the descent direction of the objective function in each iteration and update W until the objective function is stabilized at the smallest point. As shown in Figure 2.

Figure 2 Basic steps for solving the optimal objective function

The difference between the different optimization algorithms lies in the calculation of the target function's descent direction dt. The descent direction is obtained by first-order reciprocal (gradient, gradient) of the objective function under the current W, and the second derivative (Haisen matrix, Hessian). The common algorithms are gradient descent method, Newton method and Quasi-Newton method.

(1) Gradient descent method (gradient descent)

The gradient descent method directly uses the direction of the target function in the opposite direction of the gradient of the current w:

Which is the gradient of the objective function, the calculation method is:

Formula (2)

(2) Newton method (Newton Methods)

The Newton method uses two Taylor to expand the approximate objective function under the current W, and then uses the approximate function to solve the descending direction of the objective function:. Where BT is the Haisen matrix of the objective function f (W) at Wt. The search direction is also called the Newton direction.

(3) Quasi-Newton method (Quasi-Newton Methods):

The quasi-Newton method only requires the gradient of the objective function to be computed in each iteration, and an approximate Haisen matrix is found to compute the Newton's direction by fitting the method. The earliest quasi-Newton method was DFP (1959 by W. C. Davidon proposed, and by R. Fletcher and M. J. Powell to be perfected). DFP inherits the advantages of Newton's fast convergence rate, in order to avoid the problem that every iteration of Newton method needs to recalculate the Haisen matrix, we need to update the Haisen matrix of the last iteration by using the gradient, but the disadvantage is that the inverse of the Haisen matrix needs to be computed in each iteration to get Newton's direction.

BFGS is made up of C. G. Broyden, R. Fletcher, D. Goldfarb and D. F. Shanno each independent invention of a method, only need to incrementally calculate the Haisen matrix inverse ht=bt-1, avoid the matrix in each iteration inverse. The Newton direction in BFGS is expressed as:

L-bfgs (Limited-memory BFGS) solves the problem that n*n Jehaisen inverse matrix should be saved after each iteration in BFGS, only two sets of vectors and a set of scalars can be saved for each iteration:

In the T-iteration of L-BFGS, only two-step loops are required to calculate Newton's direction in increments:

2. Implementation of parallel LR

It can be seen from the method of solving the problem of logistic regression that the gradient descent method, Newton method and Quasi-Newton method are the most basic steps, and l-bfgs the method of calculating Newton direction by two-step cycle, and avoids the calculation of Haisen matrix. So the parallelization of the logical regression is mainly the parallelization of the target function gradient computation. From the formula (2), it can be seen that in the gradient vector calculation of objective function, only the point multiplication and addition between vectors are needed, so it is easy to divide each iterative process into independent calculation steps, and calculate the results by different nodes independently.

The M-sample label is formed as a m-dimensional label vector, and M n-dimensional eigenvector forms a m*n sample matrix, as shown in Figure 3. Each behavior of the feature matrix is a characteristic vector (M line), which is classified as a feature dimension (n column).

Fig. 3 Sample label Vectors & eigenvectors

If the sample matrix is divided by row, the sample eigenvector is distributed to different compute nodes, the point multiplication and summation of the samples are completed by each compute node, then the calculation result is merged, then the "LR by line parallel" is realized. The problem of sample number is solved by line-by-side LR, but in reality there will be a logic regression for high dimensional eigenvector (such as the feature dimension in AD system is as high as billions), which can not meet the demand of this kind of scene by parallel processing in line. Therefore, it is necessary to divide the high dimensional eigenvector into several small vectors to solve the problem by column.

(1) Data segmentation

Suppose all compute nodes are arranged into m rows n columns (m*n a compute node), divided by the rows of samples, each compute node is assigned m/m a sample eigenvector and a classification label, and the Eigenvector is split by the column, and the feature vectors on each node are assigned n/n dimension characteristics. As shown in Figure 4, the characteristics of the same sample correspond to the same row numbers of the nodes, and the same characteristics of different samples correspond to the same column numbers of the nodes.

Fig. 4 Data segmentation in parallel LR

The eigenvectors of a sample are split into nodes in different columns of the same row, namely:

Where xr,k represents the K vector of line R, X (R,c), and K represents the component of xr,k on the C-column node. Similarly, a WC is used to represent the component of the eigenvector W on the C-column node, namely:

(2) Parallel computing

The gradient formula (formula (2)) of the objective function is observed, which relies on two computational results: The eigenvector of the feature weight vector wt and the dot multiplication of the eigenvector xj, and the multiplication of scalar and eigenvector XJ. The gradient computation of the objective function can be divided into two parallel computational steps and two result merging steps:

① nodes Parallel Computing point multiplication, calculation, wherein k=1,2,..., m/m, which represents the dot multiplication of the K eigenvector and the feature weight component on the node (r,c) in the T iteration, wc,t the component of the feature weight vector on the C-column node in the T-iteration.

② the same node merge point multiplier result for the line number:

The calculated point multiplication result needs to be returned to all compute nodes in the row, as shown in Figure 5.

Fig. 5 Point Multiplication Result merge

③ each node is independent of the scalar and eigenvector multiplication:

G (r,c), T can be understood as the component of the target function gradient vector calculated from a partial sample on the R row node on the C-column node.

④ the same node as the column number:

Gc,t is the component of the gradient vector GT of the objective function on the C-column node, which is merged to obtain the gradient vector of the objective function:

This process is shown in Figure 6.

Fig. 6 The result of gradient calculation is merged

Combining the above steps, the parallel LR calculation process is shown in Figure 7. Comparing Figure 1 and Figure 7, the parallel LR is actually in the process of solving the optimal solution of the loss function, in order to find the gradient direction of the loss function in the direction of the parallel processing, and in the use of gradient to determine the descent direction of the process can also be used in parallel (such as L-BFGS in the two-step cycle method to find the Newton direction).

Figure 7 Parallel LR calculation Flow 3. Experiment and Results

Using MPI, the parallel LR is realized based on gradient descent method (MPI_GD) and L-bfgs (MPI_L-BFGS), and the training efficiency of three methods is compared with liblinear. Liblinear is an open source library, which includes Tron LR (Liblinear's developer Chih-jen Lin created the Tron method in 1999, and in the paper shows that Tron is more efficient than L-BFGS in a stand-alone case). Since liblinear is not parallel (in fact it can be modified), the experiment is carried out on a single machine, and the MPI_GD and MPI_L-BFGS all adopt 10 processes.

The experimental data are 2 million training samples, the dimension of Eigenvector is 2000, and the ratio of positive and negative samples is 3:7. The classification effect of MPI_GD, MPI_L-BFGS and Liblinear was compared by using 10 crossover method. The result, as shown in Figure 8, is almost indistinguishable.

Fig. 8 Comparison of the classification effect will increase the training data from 100,000 to 2 million, compared to three methods of training time consuming, results such as 9,MPI_GD due to slow convergence speed, although the use of 10 processes, stand-alone performance is still weaker than liblinear, basically need 30 rounds around the iteration to achieve convergence Mpi_l-bfgs only need 3~5 wheel iteration to converge (close to liblinear), although each round iteration needs extra overhead to compute Newton's direction, its convergence rate is much faster than MPI_GD, and because of the multiple process parallel processing, the time is also far less than liblinear.

Figure 9 Training time-consuming comparison

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More