Ftrl Detailed online learning algorithms widely used by major companies
Now online learning and CTR often use logistic Regression, while traditional batch algorithms are not able to handle very large datasets and online data streams efficiently. Google has three years (2010-2013) from theoretical research to practical engineering implementation of the Ftrl (follow-the-regularized-leader) algorithm, The performance of convex optimization problems such as logistic regression with non-smooth regularization items (such as 1 norm, model complexity control and sparsity) is excellent, and reportedly domestic major internet companies are applied to the actual products in the first time, and our system uses this algorithm. Here to Ftrl related development background and some of the guidance points of the implementation of the project to do some introduction, convex optimization of the theoretical details do not do a detailed introduction, interested can go to consult the corresponding paper, the relevant paper list will be appended after the text. Machine learning is not my professional direction at school, but the accumulation of the foundation is not too bad, and a lot of things are also interlinked, delve into the basic meaning can also be understood. Of course, there are inaccurate places to welcome everyone to discuss correct.
This article will be divided into three parts, if the theoretical background is not interested in, you can directly see the 3rd part of the project implementation (this part of the Google13 years of engineering paper introduced very detailed):
- Related background: Including general problem description, batch algorithm, traditional on-line learning algorithm, etc.
- The truncated Gradient, Fobos and RDA (regularized Dual averaging), which are closely related to Ftrl, are briefly introduced.
- Ftrl theoretical formulas and engineering implementations (the engineering implementation part that is not interested in the cause and effect and theoretical aspects can be seen directly in this section)
First, relevant background
"Problem description"
For the loss function + regularization of the structural Risk minimization optimization problem (logistic regression is also this form) has two equivalent descriptions, taking 1 norm as an example, respectively:
A, unconstrained optimization form of soft regularization formulation:
b, convex optimization problem with constraint convex constraint formulation:
When G is chosen reasonably, the two are equivalent. This paper describes the two forms of the problem, the reason is that the following unconstrained optimization and constrained optimization problems of different algorithms, for different forms of description, there will be a series of related algorithms.
Bulk (Batch) algorithm
Each iteration of the batch algorithm calculates the entire training data set (for example, the global gradient), with the advantage of precision and convergence, and the disadvantage of not being able to handle large datasets efficiently (at which time the global gradient calculation is too expensive) and cannot be applied to the data stream for online learning. Here are the unconstrained optimization forms and constraints optimization (with the above problem description can correspond) Two aspects of a simple introduction to some of the traditional batch algorithm.
A, unconstrained optimization form: 1, the global gradient drop, a very common algorithm, do not elaborate, each step to find a goal function of the global gradient, with non-increment learning rate to iterate; 2, Newton Method (tangent approximation), LBFGS (secant quasi-Newton, using the previous iteration results approximate hessian black plug matrix inverse matrix, BFGS seems to be the initials of several names) and other methods. Newton and Quasi-Newton methods generally have a good effect on smooth regular constraint terms (e.g. 2 norm), which is said to be the best method to solve the logistic regression problem of 2 norm constraint, and it is widely used, but when the objective function is L1 with non-smooth and irreducible points, the Newton method is weak and needs to be modified theoretically. Interested can go to search for unconstrained optimization of the relevant numerical calculation of the book, I did not further study the relevant details, here do not focus.
b, inequality constrained convex optimization form: 1, the traditional inequality constrained optimization algorithm interior point method, etc. 2, projection gradient descent (constrained optimization), GT is subgradient, the visual meaning is that after each iteration, the iteration results may be located outside the constraint set, The result of the iteration is then taken as a new iteration result on the constrained convex set (the symbol in the second formula identifies the projection to x):
"Online Algorithm"
As mentioned above, the batch algorithm has its own limitations, and the online learning algorithm is characterized by: each training sample, the sample produced by the loss and gradient of the model iteration once, one data to train, so can deal with large data volume training and online training. Commonly used are online gradient descent (OGD) and random gradient descent (SGD), such as the essence of the "problem description" in the above loss function of the single data (W,zi) gradient decline, because the direction of each step is not the global optimal, So the overall presentation would be a seemingly random descent route. The typical iteration formula is as follows:
Here is a mixture of regularization items: for example, it might be a mixture of 1-norm and 2-norm strong bumps (you'll see in fact that many of these are mixed regularization formats and have some visual meaning). In the iterative formula: The GT is the loss function (single point of the loss, not the sum) of the subgradient, the addition to the GT is the gradient of the second term in the mixed regularization, the projection set C is constrained space (for example, may be 1 norm of the constrained space), and the projection gradient described above similar approach.
The advantage of the gradient descent method is that the accuracy is really good, but the lack of related paper mainly mentions two points:
1, simple on-line gradient descent is difficult to produce a truly sparse solution, sparsity in machine learning is very important things, especially when we do engineering applications, sparse features will greatly reduce predict memory and complexity. This is actually very easy to understand, plainly, even if the addition of L1 norm (L1 norm can introduce sparse solution of a simple example can produce see PRML that book of the second chapter, I in a blog in the PPT also probably mention), because it is floating point arithmetic, the training of the W vector is also difficult to appear absolute 0. Here, you may want to say that it is not easy, when the calculated w corresponding to the value of the dimension is very small, we are forced to zero is not sparse it. Right, in fact, many people are doing so, behind the truncated gradient and Fobos are similar to the application of ideas;
2, for the non-micro-point of the iteration will have some problems, what is the specific problem, there is an article paper said: The iterates of the Subgradient method is very rarely at the points of Non-diffe Rentiability. I looked around for a long while did not see understand, have the familiar classmate can guide.
Second. truncated Gradient, Fobos and RDA (regularized Dual averaging)
As mentioned above, sparsity is a very important thing in machine learning, here are three common ways to do sparse solutions:
1) Simple addition of L1 norm
– Limitations As mentioned above, a+b two float number is difficult to absolutely zero, can not produce truly sparse feature weights
2), on the basis of 1 norm to do truncation, the most intuitive no technical content of thought, then set a threshold, do truncation to ensure sparse, can be combined with L1 norm-simple truncation method, each online training K data truncation once, to the OGD iteration results, every k step to do a truncation 0: But the simple truncation method has a problem: small weight, It may be a useless feature, or it may be that the feature has just been updated once (such as the beginning of the training phase, or the number of samples containing the feature in the training data is very small), and the simple rounding technique is too aggressive to disrupt the theoretical completeness of the online training algorithm. -Simple truncation on the basis of less aggressive truncated gradient (09 of work), in fact, the following Fobos can also be attributed to this category:
3), Black-box wrapper approaches:
– The Black-box method removes some features and then retrain to see if the removed feature is valid. – You need to run the algorithm multiple times on the dataset, so it's not practical
The following will mention Fobos (Forward-backward splitting method, in fact, should be called Fobas, historical reasons) and RDA, because the Ftrl in the back is actually the equivalent of combining the advantages of these two algorithms:
A, Fobos,google and Berkeley 09 years of work:
– can be seen as a special form of truncated gradient-the basic idea: similar to the projected Subgradient method, but the iterative process of each data is decomposed into an empirical loss gradient descent iteration and an optimization problem. Decomposition of the second optimization problem, there are two: the first 2 norm that one can not be too far away from the first step loss loss iteration results, the second is a regularization term, used to limit the complexity of the model to suppress overfitting and do sparse. This optimization problem has some special properties, thus guaranteeing the sparse and theoretical completeness of the final result, and the specific details are interesting to see the corresponding paper. I pay more attention to the intuitive meaning and engineering implementation, ignoring the theoretical aspects of the content.
B, RDA (regularized dual averaging), the work of Microsoft in 10 , more theoretical, here directly slightly past, only a brief introduction to its characteristics:
– Non-gradient descent method, which belongs to a more general application of a primal-dual algorithmic schema – overcomes the exploiting problem structure,especially for prob that the SGD class method lacks Lems with explicit regularization. – Better trade-off between accuracy and sparsity
OK, background and some foreshadowing finally finished, the following key into the Ftrl part ...
Third, Ftrl (Follow-the-regularized-leader)
"Development History"
The theory advancement and engineering application of Ftrl first of all to thank this person: H. Brendan McMahan, Google, this guy has been guarding the pit for three years, until the 13 engineering paper out. The development process and basic description are as follows:
–10 The theoretical paper of the year, but did not explicitly support the regularization of the iteration; 11 proved regret bound and introduced generic regularization items; In 11 another paper revealed Ogd, Fobos, RDA and other Ftrl relationships; paper in 13 gives engineering implementation , and it comes with detailed pseudocode, which begins to be applied on a large scale.
– can be seen as a mixture of RDA and Fobos, but under L1 norm or other non-smooth regularization, Ftrl is more effective than the previous two
"Basic thought and iterative formula"I drew a simple diagram: A comparison of iterative formulas with other on-line algorithms (in fact, ogd how to step to a similar form of the iterative formula of the process, limited to the time, here does not elaborate, and finally I will attach an article to do a share of the PPT, there is, interested can download to see), Different methods in this uniform descriptive form, the difference points only in the second and third items of treatment:-First: gradient or cumulative gradient; – Second: L1 regularization; The third item: This cumulative addition limits the new iteration result x not too far from the iterated solution (also known as the meaning of proximal in ftrl-proximal), or too far from 0 (central), which is actually a requirement of low regret.
"Engineering Implementation"Everyone on the above that a big lump of the cause and effect and the formula are not interested, OK, it's okay, Google very intimate in 13 years gave a very strong engineering paper, in fact, most companies use Ftrl, do not care about the above that a large section of things, directly according to the pseudo-code to write, tune, It's good to see the results. Our company began to do so, haha, but people always want a little curiosity is not, dig into the cause and the basic theoretical formula feeling is quite different. Logic regression under the pseudo-code of the Per-coordinate Ftrl_proximal as follows, in the expression of the formula on the basis of a number of transformations and implementation of the trick, the details of paper, we do in their own implementation, you can in the actual data set on the parallel acceleration: The four parameters are set together with the guidance in the paper and repeated experiments to find a set of parameters appropriate to their own problems. Here I would like to mention that the above-mentioned
per-coordinate, which means
Ftrl is a separate training update for each dimension of W, with different learning rates for each dimension, which is the one before lamda2 in the above code. Compared to all feature dimensions in W using a unified learning rate,
This method considers the heterogeneity of the training sample's distribution on different characteristics., if the training sample containing a dimension feature of W is rare and each sample is precious, the training rate corresponding to the feature dimension can be maintained on its own, and each sample containing that feature can take a significant step forward in the sample's gradient without forcing the same pace as the other feature dimensions.
"Memory saving strategy in engineering implementation"Here is an introduction to some of the memory-saving implementation details that Google has to mention
–L1 Norm Plus Strategy, training results w very sparse, in W do predict save memory, very intuitive, not elaborate
- The probabilistic feature inclusion, which is seldom seen in the training data, is discarded online, but for online set, it is very expensive to pre-process the full data to see which features appear to be very small or useless. So if you want to do thinning when training, you should think of some online methods (Ftrl separately updated w dimensions, each dimension of different steps, per-coordinate)
1) Poisson Inclusion: the training samples from a certain dimension feature are accepted and updated with the probability of p;
2) Bloom Filter Inclusion: Use the Bloom filter to do a certain feature from the probability of K-times only updated
2. Re-encode floating-point numbers
1) Feature weights do not need to be stored with 32bit or 64bit floating-point numbers, storage wasted space 2) 16bit encoding, but attention should be paid to the impact of rounding technology on regret 3. Train several similar models 1) to the same training data series, while training several similar Model 2) These models have their own unique feature, there are some shared feature 3) starting point: Some feature dimensions can be unique to each model, while others share the Features that can be trained with the same data. 4. Single Value Structure (the company is said to have done so in practice, the big data volume can also guarantee a good AUC)
1) Multiple model common one feature storage (for example, in CBase or Redis), each model is updated with this common feature structure 2) for a certain model, for a certain dimension of the eigenvector he is training, Computes an iterative result directly and makes an average with the old value
5. Use the number of positive and negative samples to calculate the sum of the gradients (all the model has the same N and P)
6. subsampling Training Data 1) In practice, CTR is far less than 50%, so positive samples are more valuable. By subsampling the training data set, you can significantly reduce the size of the training data set by 2) The positive sample is all mined (at least one ad is clicked on the query data), and the negative sample is sampled using a proportional R (query data with no ads clicked). But training directly on this sample will result in a larger biased prediction 3) Solution: When training, multiply a weight on the sample. The weights are multiplied directly above the loss, and thus the gradients multiply by this weight.
sampling to reduce the number of negative samples, in training and then use weights to compensate for negative samples, very good idea. "References" I probably marked the main content of each of the paper, interested can have a selective look, if only focus on the implementation of the project, look at the red of the article is OK:
[1] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. JMLR, 10, 2009. (truncated gradient of paper)
[2] H. B. McMahan. Follow-the-regularized-leader and Mirror descent:equivalence theorems and L1 regularization. In Aistats, (Fobos, RDA, Ftrl and other methods of comparison paper)
[3] L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. In NIPS, (RDA method)
[4] J. Duchi and Y. Singer. Efficient learning using Forward-backward splitting. In advances in neural information processing Systems, pages 495{503. (Fobos method)
[5] H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, Jeremy Kubica, Ad Click prediction:a View from the trenches, Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Di Scovery and Data Mining (KDD) (This article is the engineering paper)
[6] H. Brendan McMahan. A unied analysis of regular-ized dual averaging and composite mirror descent with implicit updates. Submitted (Ftrl theory development, regret bound and addition of generic regularization items)
[7] H. Brendan McMahan and Matthew Streeter. Adap-tive bound optimization for online convex optimiza-tion. Incolt, 2010 (beginning of the theoretical paper)
The following is attached to me in the group to share the PPT, interested can see: http://pan.baidu.com/s/1eQvfo6e
Original: http://www.cnblogs.com/EE-NovRain/p/3810737.html
Ftrl Detailed online learning algorithms widely used by major companies