The most common optimization algorithms for machine learning

Last Update:2016-08-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Gradient Descent method (Gradient descent)

The gradient descent method is the simplest and most commonly used optimization method. The gradient descent method is simple, and when the objective function is a convex function, the solution of the gradient descent method is the global solution . Under normal circumstances, the solution is not guaranteed to be the global optimal solution, the gradient descent method is not necessarily the fastest speed. the optimization idea of gradient descent method is to use the current position negative gradient direction as the search direction, because this direction is the fastest descent direction of the current position, so it is also called the "steepest descent method". The closer the steepest descent method is to the target value, the smaller the step, the slower the progression.

In machine learning, two kinds of gradient descent methods are developed based on the basic gradient descent method, namely the stochastic gradient descent method and the batch gradient descent method.

For example, for a linear regression (Linear Logistics) model, suppose the following H (x) is a function to fit, J (theta) is a loss function, Theta is a parameter, to iterate over the value of the solution, Theta solves the function H (theta) that is finally going to fit. where M is the number of samples in the training set, n is the number of features.

Batch gradient descent BGD (batch Gradient descent)

or replace the 1/m with step a.

(3) from the above formula can be noted that it is a global optimal solution, but each iteration step, will be used to the training set all the data, if M is large, then it is conceivable that the iterative speed of this method will be quite slow. So, this introduces another approach.

Random gradient descent (stochastic gradient descent) or incremental gradient descent (incremental gradient descent)

(3) The random gradient descent is to iterate through each sample to update once, if the sample size is very large (for example, hundreds of thousands of), then perhaps only tens of thousands of or thousands of of the sample, it has been theta iterative to the optimal solution, compared to the batch gradient above the lower, iterative need to use a hundred thousand of training samples, One iteration is unlikely to be optimal, and if you iterate 10 times, you need to traverse the training sample 10 times. However, one of the problems associated with SGD is that the noise is more bgd, making SGD not each iteration toward the overall optimization direction.

The random gradient descent uses only one sample per iteration, the iteration is calculated as N2, and when the number of samples is large, the velocity of the random gradient descent iteration is much higher than that of the batch gradient descent method. The relationship between the two can be understood as follows: The stochastic gradient descent method is at the expense of a small fraction of the accuracy and an increase in the number of iterations, in exchange for an overall optimization efficiency improvement. The increased number of iterations is much smaller than the sample count.

Summary of the batch gradient descent method and the stochastic gradient descent method:

　　The batch gradient descent---Minimize the loss function of all training samples, so that the final solution is the global optimal solution, that is, the solution parameter is to minimize the risk function, but it is inefficient for large-scale sample problem.

　　Random gradient descent---Minimize the loss function of each sample, although not every iteration of the loss function is toward the global optimal direction, but the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution, suitable for large-scale training samples.

2 Newton method and Quasi-Newton method (Newton ' s method & Quasi-Newton Methods)

In essence, Newton's method is the second-order convergence, and the gradient descent is the first convergence, so Newton's method is faster. If more commonly said, for example, you want to find a shortest path to the bottom of a basin, gradient descent method each time only from your current position to choose one of the most slope direction, Newton method in the choice of direction, not only will consider whether the slope is large enough, but also consider whether you take a step, the slope will become larger. So, it can be said that Newton's method than the gradient descent method to see a little farther, can go to the bottom faster.

The case of spreading to vectors

Where the expression is represented in? (θ) The partial derivative of the pair; h is a n*n matrix called the Hessian matrix . The expression for the Hessian matrix is:

The advantages and disadvantages of Newton's method are summarized:

Advantages: Second order convergence, fast convergence speed;

Disadvantage: Newton method is an iterative algorithm, each step needs to solve the objective function of the Hessian matrix inverse matrix, the calculation is more complex.

Quasi-Newton method (Quasi-Newton Methods)

the essence of the quasi-Newton method is to improve Newton's method to solve the problem of the inverse matrix of the complex Hessian matrix every time, it uses the positive definite matrix to approximate the inverse of the Hessian matrix, thus simplifying the complexity of the operation. the Quasi-Newton method, like the steepest descent method, requires only the gradient of the objective function to be known at each iteration. By measuring the change of gradient, a model of the objective function is constructed to produce the super-linear convergence. This type of method is much better than the steepest descent method, especially for difficult problems. In addition, because the quasi-Newton method does not need information of the second derivative, it is sometimes more effective than Newton's method. Today, the Optimization software contains a large number of quasi-Newton algorithms to solve unconstrained, constrained, and large-scale optimization problems.

　　The commonly used quasi-Newton method has DFP algorithm and BFGS algorithm (http://blog.csdn.net/qq_27231343/article/details/51791138)

3. Conjugate gradient method (conjugate Gradient)

Conjugate gradient method is a method between the steepest descent method and Newton's method, it only needs to use the first derivative information, but overcomes the disadvantage of the steepest descent method convergence slow, and avoids the disadvantage that Newton method needs to store and compute the Hesse matrix and find the inverse, the conjugate gradient method is not only one of the most useful methods to solve the large scale linear equations, is also one of the most effective algorithms for solving large-scale nonlinear optimization. In various optimization algorithms, the conjugate gradient method is very important. Its advantage is that the required storage capacity is small, has step convergence, high stability, and does not require any external parameters.

(Https://en.wikipedia.org/wiki/Conjugate_gradient_method#Example_code_in_MATLAB)

4. Heuristic Optimization method

Heuristic method refers to the method that people take when they solve the problem and find it according to the rule of experience. It is characterized by the use of past experience in the solution of problems, the selection of methods that have been effective, rather than the systematic and determined steps to seek answers. There are many kinds of heuristic optimization methods, including classical simulated annealing method, genetic algorithm, ant colony algorithm, particle swarm algorithm and so on.

There is also a special optimization algorithm called multi-Objective optimization algorithm, which is mainly aimed at simultaneous optimization of multiple targets (two and more than two) optimization problem, which is more classical algorithm has NSGAII algorithm, moea/d algorithm and artificial immune algorithm.

The most common optimization algorithms for machine learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The most common optimization algorithms for machine learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The most common optimization algorithms for machine learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support