Preface:
Fista (A fast iterative shrinkage-thresholding algorithm) is a fast iterative threshold shrinkage algorithm (ISTA). Both Fista and Ista are based on the idea of gradient descent, making a more intelligent (smarter) choice during the iterative process to achieve a faster iteration speed. The theory proves that the convergence rate of Fista and Ista is O (1/k2) and O (1/k) respectively.
This post begins with the traditional method of solving the optimization problem "gradient descent", then introduces Ista, then rises to Fista, and finally to its application (mainly in the blur aspect of the image and feature matching). The main references of the article are as follows:
[1] A Fast iterative shrinkage-thresholding algorithm for Linear inverse problems.
[2] proximal Gradient descent for L1 regularization
[3] linear regression and gradient
--------------------------------------------------------------------------------------------------------------- ----------------------------------------
--------------------------------------------------------I'm a split line ---------------------------------------------- ----------
1. Gradient Descent method
Consider the following linear conversion problem: B = Ax + W (1)
For example, in the image blur problem, A is a fuzzy template (from the non-blurred image through conversion), B is a fuzzy image, W is noise. Also, A and b are known, X is the coefficient to be asked.
The traditional method of solving this problem is the least squares, the thought is very simple and rough: Make the reconstruction error | | ax-b| | 2 min. That
to f (x) = | | ax-b| | 2 derivation, the derivative can be: F ' (x) = 2AT (ax-b). For this problem, the minimum value (the function f (x) is the convex function and the minimum value) can be obtained by making the derivative zero.
1) If a is a non-singular matrix, i.e. a reversible, then the exact solution of the problem can be x=a-1b.
2) If a is a singular matrix, i.e. a is not reversible, then the problem has no exact solution. Back to the second, we ask for an approximate solution to be good, | | ax-b| | 2<=?.
Where, | | x| | _1 is a penalty term used to normalize the parameter x. This example uses the L1 norm as a penalty, and it is expected that X is as sparse as possible (as few as 0 elements), i.e. B is a sparse representation of a. | ax-b| | 2<= is the constraint condition, that is, the minimum reconstruction error. The question (3) can also be described as:
Equation (4) is a general sparse representation of the optimization problem. Expect the refactoring error to be as small as possible, with as few parameters as possible.
Note: Penalty items can also be L2 or other norms.
1.1 Defect of gradient descent method
Consider the more general situation, we will discuss the gradient descent method. The problem with unconstrained optimization is as follows:
The gradient descent method is based on the observation that if the real value function f (x) is micro and defined at point A, then the function f (x) is in the opposite direction of point a along the gradient-? F (a) fell fastest.
Based on this, we assume that f (x) is continuously micro (continuously differentiable). If there is a small enough value t>0 make x2 = x1-t? F (a), then:
F (x1) >= f (x2)
The core of the gradient descent method is to find the sequence {XK} by the formula (6), making F (XK) >= F (xk-1).
The process of gradient descent is described in detail:
As you can see: The initial points are different and the minimum values obtained are different. Because the gradient descent method solves the local minimum, the influence of the initial value is greater. If the function f (x) is a convex function, then the local minimum value is also the global minimum. At this point, the initial points only have an effect on the iteration speed.
Looking back at the equation (6), we use step TK and derivatives? F (XK) to control the amount of x change at each iteration. Take a look at the picture above and color the colorful one. For each iteration, we certainly want the value of f (X) to drop as quickly as possible, so that we can get the minimum value of the function more quickly. Therefore, the choice of step TK is important.
If the step TK is too small, the number of iterations to find the minimum value is very slow, or the iteration is very slow, and the step size is too large, then there will be overshoot the minimum phenomenon, that is, constantly hovering around the minimum value, jumping to jump, as shown:
However, TK finally works on xk-1 and gets xk. Therefore, the more naïve thought should be: The number of sequence {XK} is as small as possible, that is, every iteration step as large as possible, the function value is reduced as much as possible. So is the choice of the sequence {XK}, how to better choose each point xk, so that the function value faster approaching its minimum value.
--------------------------------------------------------------------------------------------------------------- --------------------------------------
----------------------------------------------------------I'm a split line -------------------------------------------- ------------
The idea of Ista and fista solving the minimization problem is based on the gradient descent method, and their optimization lies in the choice of {XK}. Here we do not tell proof, only to speak of thought. If you want to see the proof, see resources [1].
2.ISTA Algorithm
ISTA (iterative shrinkage-thresholding algorithm), an iterative threshold shrinkage algorithm.
Start with the unconstrained optimization problem, which is the above equation (5):
At this time, we also assume that f (x) satisfies the Lipschitz continuous condition, that is, the derivative of f (x) has a lower bound, the minimum lower bound is called Lipschitz constant L (f). At this point, for any l>=l (f), there are:
Based on this, the function values can be approximated near the point xk:
In each iteration of the gradient descent, the point at which the approximate function at the point xk-1 is taken as the starting point of the next iteration is XK, which is the so-called proximal regularization algorithm (wherein, TK=1/L).
The above method is only suitable for solving non-constrained problems. But ista to solve the problem of optimization with penalty, introduce norm normalization function g (x) to constrain parameter x, as follows:
Using a more general two approximation model to solve the above optimization problem, at point Y,f (x): = F (x) + g (x) The approximate function of two times is:
The minimum value of the function is expressed as p_l is a shorthand for proximal (near-end operator) :
Ignore its constant entry f (y) and? F (y), these have and have no effect on the result. The combination of the formula (11) and (a), PL (y) can be written as:
Obviously, the basic iterative steps when using ISTA to solve constrained optimization problems are:
The basic iterative steps for fixed step Ista are as follows (step t = 1/l (f)):
However, the disadvantage of fixed step ista is that Lipschitz constant L (f) is not necessarily known or computable. For example, the optimization problem of L1 norm constraints, whose Lipschitz constants depend on the maximum eigenvalue of ATA. And for large-scale problems, it is very difficult to calculate. Therefore, use the following ista with backtracking (backtracking):
It is provedby theory that the convergence speed of Ista is O (1/k), while the Fista convergence rate is O (1/K2). in practical applications, Fista is also significantly faster than ISTA. The proof process is still read in this article: [1].
3.FISTA
Fista (A fast iterative shrinkage-thresholding algorithm) is a fast iterative threshold shrinkage algorithm (ISTA).
The difference between Fista and Ista is the selection of the starting point y of the approximate function in the iterative step. ISTA uses the approximate function minimum point xk-1 of the previous iteration, and Fista uses a different method to calculate the position of Y. The theory proves that the convergence speed can reach O (1/k2). The basic iterative steps for fixed-step Fista are as follows:
Of course, given the same problem as Ista: when the scale of the problem is large, the Lipschitz constant that determines the step length is computationally complex. Fista, like Ista, also has its backtracking algorithm. There is no difference between Fista and ista on this issue, which also says that the difference between Fista and Ista is simply the selection of the starting point of the approximate function at each iteration. More succinctly: Fista chooses the sequence {XK} in a smarter way, making it more quickly approaching the minimum of the problem function f (x) based on the iterative process of the gradient descent thought.
The basic iterative steps for the Fista algorithm with backtracking are as follows:
It is worth noting that, in each iteration, when calculating the starting point of the approximate function, Fista uses the results of the first two iterations xk-1,xk-1, and makes a simple linear combination to generate the approximate function start point yk of the next iteration. The method is very simple, but the effect is very good. This, of course, is supported by theory.
--------------------------------------------------------------------------------------------------------------- --------------------------------------
----------------------------------------------------------I'm a split line -------------------------------------------- ------------
Application of 4.ista&fista (to blur)
Lasso is a classical object equation in image processing.
The second 1 norm limits the sparsity of X, which has been said before and is not described here.
For example, in the image blur problem, the known blurred image B, and the Fuzzy function r, we want to restore the blurred image I. The relationships of these variables can be expressed as i*r=b, where * is convolution. In the ideal state, B does not have any noise, then the problem is very simple. Based on the convolution theorem, the convolution of two functions in the time domain equals the multiplication of the frequency domain, then we only need to ask for the Fourier transform of B and R, then divide the Fourier transform of I and then restore it to the time domain. However, in general, the fuzzy image B contains noise, which makes the operation in the frequency domain unstable, so more time, we hope to find out by the following equation I
Where the fuzzy operator R is represented as a matrix, I and B are 1-dimensional vectors, and function p as the canonical term. We decompose the I wavelets, I=wx, where W is the wavelet, and X is the coefficient of the small wavelet bases. We know that the wavelet representation of the image is sparse, then the objective equation becomes the lasso form.
Which A=RW. Now the problem is that this equation because of the existence of L1 norm, not everywhere can be micro, if using subgradient method, convergence speed will be very slow.
4.1 Lasso problem with ISTA solution
So we use ISTA (iterative shrinkage-thresholding algorithm). This algorithm can solve the minimization problem of the above f+g form, but Ista is suitable for solving the following form problems:
1. The objective equation is a form of f+g
2.f and G are convex, f is conductive, G doesn't matter.
3.G needs to be simple enough (coordinate descent that can be split to make the coordinates drop)
So, first we look at the general recursive descent of F. we can refer to the (13) formula to get:
At this point we can see that if G is a splitting function (such as the L1 norm), we can reduce the coordinate of each dimension, that is, the problem of the minimum value of n-dimensional, to the minimum value of n 1D. We find that, if so, the problem has an analytic solution, that is, the iteration of each step can be written as:
Which is called shrinkage operator.
4.2 Lasso problem with Fista solution
Fista is actually the Ista application Nestrerov acceleration. An iterative step for an ordinary Nestrerov accelerated recursive descent is:
Applied to Fista, it is to replace the 3rd step with the iterative step of Ista. It can be proved that fista can achieve convergence speed of 1/(t^2). (T is the number of iterations) through the following experiments can be seen, the same iterative 300 times, left (ISTA) is still not convergent, the image is still blurred. The image on the right (Fista) has basically reverted to the blurred original.
--------------------------------------------------------------------------------------------------------------- --------------------------------------
----------------------------------------------------------I'm a split line -------------------------------------------- ------------
Application of 5.ista&fista (feature matching)
F is the transformation function between U and V. Then we can see that:
Through the above function. We can get:
=========================
Reference from:
1, Junhao_wu : http://www.cnblogs.com/JunhaoWu/p/Fista.html
2, Beyond algorithm is math:http://blog.csdn.net/iverson_49/article/details/38354961
reprint Please specify: pual-huang+ address
The origin of Fista: from gradient descent method to Ista & Fista