Methodology of logical regression training model under large data

Source: Internet
Author: User
Keywords algorithm function nbsp;

In today's Society of data inflation, the value of http://www.aliyun.com/zixun/aggregation/13584.html ">" is becoming more and more prominent. How to effectively excavate the effective information in massive data has become a common problem in every field. Based on the actual demand of the Internet enterprises, the technology companies have started to acquire the information contained in the massive data by using the algorithms of machine learning, data mining and artificial intelligence, and have achieved good results.

Today's society has changed from the lack of information in the past to an era of information flooding. Due to the popularization of network and related applications, the network data gradually presents the trend of "mass, high dimension", how to use the existing machine learning or data mining algorithms to obtain effective information has become the focus of academia and industry. Domestic large data technology service provider percent the company has applied machine learning related technologies to large data analysis, in the percentage of a group purchase site, we selected 10 based on the characteristics of goods and users, combined with the classification algorithm in machine learning, built a user-recommended classifier. In the actual application process, the group purchase site hits average increase 19%, the lower single rate of 42%, directly below the single rate to raise nearly one times, thus achieving the purpose of improving the recommendation effect.

In this article, we will use machine learning classical algorithm logic regression model as a predictive model, combined with the current percentile for group purchase website development Classification model as a concrete example, explain how in the "massive, high-dimensional" data effective training model.

What is a logical regression model?

The logic regression model (Logic regression, LR) in machine learning algorithm, hereinafter referred to as LR model, is an algorithm widely used in real scene. The object of this article is based on the two-meta logic regression prediction model, which is the classifier-identified class label. Suppose the training set data is, of which,, the training set can be regarded as a matrix, because in this article is mainly aimed at the high dimensional data, but because of the existence of dummy variables, the data has a large number of 0/1 values, so the whole training set can be considered as a high dimensional sparse matrix.

Before introducing how to train the model, we first introduce the logistic regression model briefly. The logical regression model is a method based on discriminant, it assumes that the class instance is linear and can be divided, and the final prediction model is obtained by directly estimating the parameters of discriminant type. The logistic regression model is not modeled on the class conditional density, but on the class condition ratio. Suppose the class condition log likelihood ratio is linear:

Using the Bayesian formula, we have:

So that we can get a logical regression model:

As an estimate.

Training Logistic regression model

When we decide to use the LR model and select the initial feature set, our next step is how to get the best evaluation parameters so that the LR model obtained by training can get the best classification effect. This process can also be seen as a search process, that is, in the solution space of an LR model, how to find a solution that best matches the LR model we designed. In order to obtain the best LR model, we need to design a search strategy and consider what criteria to choose the optimal model.

How to choose the best LR model, the intuitive idea is to evaluate the prediction model by the result of the prediction model and the matching degree of the real value. In the field of machine learning, the loss function (loss function) or cost function is used to calculate the degree of match between the predicted result and the real worth. The loss function is a nonnegative real value function, and different loss functions can be designed according to different requirements. In this article, as a loss function, the predictive model F is based on the predictive value of the test instance x, and Y is the true class label value of the test instance X.

The loss functions commonly used in machine learning include the following:

0-1 loss function: Square loss function: Absolute loss function: Logarithmic loss function or logarithmic likelihood loss function:

Since the input and output of the model (X,Y) are random variables followed by the joint distribution P (x,y), the expectation of the loss function is:

The desired formula above represents the theoretical predictive model for the loss of the joint distribution P (x,y) in the mean sense, called the risk function (disorientated functions) or the desired loss (expected loss). The loss function and the risk function are actually to measure the classification ability of the predictive model, but the former is considered from the microscopic level, and the latter is considered from the macroscopic (mean) point. So we can get the average loss on the training dataset, called Empirical Risk (Empiricalrisk) or experience loss (empirical loss), which is recorded as:

It is the expected loss of the predictive model about the joint distribution, and the average loss of the model about the training samples. According to the large number theorem in statistics, experience loss can be used as expected loss when the sample size is large. But in the course of training model, because of the problem of noise data or data migration in the data, the generalization of training model is very poor, which is the famous problem of excessive fitting in machine learning. In order to solve this problem, it is necessary to deal with the rule, artificially increase the constraint condition, add a regularization term (regularizer) or a penalty term (penalty term) representing the complexity of the model to the empirical risk function, which is called the structural risk minimization (structural Disorientated minimization, SRM, can be expressed using the following formula:

It is used to punish the complexity of the model, the more complex the model F, the greater the complexity, is the coefficient, to weigh the empirical risk and the complexity of the model.

In machine learning, a total of three types of methods are used to design the relevant empirical risk functions:

When the design model is very simple and the data volume is very large, given a set of parameters, the maximum likelihood evaluation method (Maximum likelihood estimation, MLE) training can be used to get the relevant model parameters.

When the design model is very complex, there are implied variables. In this case, the EM algorithm can be used to evaluate the parameters of the model. Generally divided into two steps, first given the parameters, for the implied variables to do the expectation, the inclusion of implicit variable likelihood function; the second step, using the Mle method, evaluates the parameter value and updates the corresponding parameter value;

When the model is not very complex, but the data is very small, and has a certain priori knowledge, you can use Bayesian statistical methods to evaluate the model parameters, that is, the so-called maximum posterior probability (Maximum A posteriori,map). Firstly, based on prior knowledge, a priori statistic distribution of the parameters to be evaluated is given, then the posterior distribution (posterior probability) of the parameters is deduced according to the Bayesian formula, and finally the posterior probability is maximized to obtain the corresponding parameter value.

Because this article is about "high-dimensional, massive" training data, and using a relatively simple LR model as a predictive model, so we use the Mle method in the training model, the design of the relevant empirical risk parameters; second, due to the sufficient training data Therefore, the corresponding model-based penalty (regularization) is not added to the empirical function, and its specific risk function is shown in our model as follows:

The following question is transformed into an unconstrained optimization problem. When based on the mass data training model, it is necessary to consider how to use the training model efficiently. In the actual development process, the individual thinks can improve the training model efficiency from two aspects. The first is to optimize the storage structure of the data in memory, especially for the "high-dimensional, sparse" matrix, in this experiment we applied the sparse matrix format in the matrix package in R, which greatly improved the computational efficiency of the algorithm. Secondly, we need to select the relevant iterative algorithm to speed up the convergence of empirical risk function. Here are some common iterative algorithms:

Newton-Raphaesen iterative algorithm in Newton iterative algorithm, the algorithm needs to compute the Haisen matrix, so the algorithm takes a lot of time, the iterative time is longer.

Quasi-Newton iterative algorithm, using the approximate algorithm to compute the Haisen matrix, so as to reduce the time of each iteration of the algorithm, improve the efficiency of the operation of the algorithm. There are two kinds of classical algorithms in quasi-Newton algorithm: BFGS algorithm and L-BFGS algorithm. The BFGS algorithm uses all the original historical results to approximate the Haisen matrix, although it improves the efficiency of the whole algorithm, but because of the need to save a lot of historical results, the algorithm is limited by the size of memory, which limits the application range of the algorithm. , and L-bfgs is precisely for the BFGS consumption memory, only limited calculation results, greatly reduce the algorithm for memory dependence.

In practical application, we choose the iterative algorithm based on the actual demand and the characteristics of the data itself, in this experiment we choose the Newton-Raphaesen iterative algorithm and the L-BFGS algorithm as the LR model iterative algorithm.

Attribute Selection

When the learning algorithm iteration is complete, we can get the weights of the corresponding attributes. The next task we need to check the significance between the existing attributes and the response variables, and verify the attribute set corresponding to the training model, and remove the characteristic that the significance does not conform to the threshold value. Because the Mle method is used when constructing the risk function, the Wald test can be used to validate the calculated parameters. Before using Wald test, make sure that the difference between the expected value and the estimate conforms to the normal distribution. Wald the general form of statistical variables:

This represents the estimate value, representing the expected value, representing the estimate variance. In this experiment we set the original hypothesis to mean that the existing property is not related to the response variable, so the Wald statistic value of this experiment can be expressed as:

Which is the actual estimated parameter value, is the standard variance. Because the Wald statistic corresponds to the card square distribution, the P value can be computed using the card square distribution, and if the P value is greater than the specified threshold, then the original assumption can be considered to be valid, that is, the property is significantly unrelated to the response variable, and the variable is saved otherwise. In the actual training process, each time the attribute significance is checked, only the P-value maximum and the artificial threshold are selected to compare; If the selected p value is not greater than the threshold, then the model is trained, or the corresponding property of the selected p value is deleted, and the forecast model is updated. Re-study the updated forecast model, speculate the corresponding weight value, and then again Wald test validation of each attribute. Repeat the above procedure until the P value of the Wald test for any of the variables is less than the artificially set threshold value. The training process for this entire model is over.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.