first, the expression form ofL1
In machine learning, almost no one knows L1 regular and L2 Regular,L1 Regular and L2 are the role of parameter control, the model plays a role in the constraints, to prevent overfitting. But L1 and L2 are also different,L1 is more likely to produce sparse solutions, so that some parameters equal to 0, while the L2 is not such an advantage, only to make the parameters nearly 0. Using this advantage can make the L1 have the function of feature selection, if the coefficient of some features is 0, the dimension feature has no effect on the model, so it can be discarded.
L1Regular andL2compared with more advantages,L1the relative optimization of the regularL2in regular terms, it becomes even more difficult. ForL2Regular, because the regular term is the derivative, so the blog in the gradient-based optimization algorithm, such as gradient descent method, Newton method, Quasi-Newton method(DFPalgorithm,BFGSalgorithm,L-bfgsalgorithm)can be directly used to solve a problem withL2the problem of regular optimization. L1Regular items are non-conductive, so the preceding algorithms cannot be solved directly, so they need to be modified to be used to solve aL1constrained optimization problems. There are two main expressions with L1 regular expression:
1.Convex-constraint Formulation
Where the loss function is represented
2.soft-regularization
When the appropriate parameters are chosen, the two expressions are equivalent.
second, the method of processing big Datadue to the large amount of data, may have exceeded the size of the memory, at this time can not load all the data into memory to participate in the calculation, there are two main ways to deal with big data problems
- Parallel batch learning on many machines
- Stream-based online learning
1. Flow Online learning processthe truncated gradient method to be introduced in this paper(truncated Gradient)is the second strategy adopted. The flow of on-line learning algorithms flows roughly as follows:
- For a sample to arrive after;
- We calculate the corresponding output of the sample by using the weighted vector.
- For the actual label of the sample, so that the loss under the weight is calculated;
- Update the current weights according to a strategy:.
2. Random Gradient Descent methodRandom Gradient descent(Stochastic Gradient descent)is the simplest online learning algorithm, and its basic update strategy is:
among them, the expression of learning rate, usually can be taken as a constant:
It can also be taken as a function of an iterative algebra:
where the current iteration algebra is represented. three, truncated gradient method (truncated Gradient) As mentioned above,L1 can make certain features have a coefficient of 0, which has the capability of feature selection, which is called sparsity (sparsity). the L1 is capable of generating sparse solutions. In order to be able to produce sparse solutions while using online learning, the most direct idea is to use truncation method, truncation, that is, by a threshold to control the size of the coefficient, if the coefficient is less than a threshold value to set the coefficient to 0, this is the meaning of simple truncation. 1. Simple truncation (easy coefficient rounding) the meaning of simple truncation is given a threshold value, in the course of online learning, no step in a truncation, truncation refers to the factor is less than the threshold value is directly assigned to 0, the specific form is as follows:
where the threshold is expressed, the specific form of the function is as follows:
wherein, is an indicative function, the specific form is as follows:
The main disadvantage of this method is that it is difficult to solve the problem that is worth choosing, followed by simple truncation, a little too violent. 2,l1-regularized subgradient ( secondary gradient ) The concept of sub-gradient will be covered in another article,l1-regularized subgradient form is also more intuitive, the specific form is as follows:
where the function is a symbolic function, its specific form is as follows:
The main disadvantage of such a sub-gradient method is that the sparse solution can be produced in very few cases, the main reason is that the addition and subtraction of the two parts can be equal to 0 probability is very small. 3. Truncated gradient method (truncated Gradient) In the simple truncation method, the direct truncation is too violent, in the truncated gradient method, the truncated step is moderated appropriately, and its specific update formula is as follows:
where, called the gravity parameter (gravity parameter), The specific form of the truncation function is as follows:
similar to simple truncation, the parameters are updated every other time , with the following formula:
among them,. The sparse degree can be controlled by adjusting parameters and parameters, and the larger the parameters and parameters , the more sparse the solution.
Reference Documents[1] Sparse Online learning via truncated Gradient[2] on-line optimization solution (online optimization) bis: truncated gradient method (TG)
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Optimization algorithm-truncated gradient method (TG)