I. Sparse Coding sparse Encoding
The Sparse Coding Algorithm is an unsupervised learning method. It is used to search for a group of "super-complete" base vectors to more efficiently represent sample data. The purpose of the Sparse Coding Algorithm is to find a set of base vectors so that we can represent the input vectors as linear combinations of these base vectors:
That is
Although Principal Component Analysis (PCA) technology allows us to easily find a set of "complete" base vectors, what we want to do here is to find a group"Super complete"Base vector to represent the input vector (that is,K>N). The advantage of the super-complete foundation is that they can more effectively find the structures and patterns hidden in the input data. However, for an ultra-complete basis, the coefficientAIIt is no longer uniquely determined by the input vector. Therefore, in the sparse encoding algorithm, we add another criterion"Sparsity"Degeneracy.
Here, we define "sparsity" as: there are only a few non-zero elements or only a few elements far greater than zero. Requirement coefficientAISparse means: for a group of input vectors, we only want to have as few coefficients as possible as far as zero. There is a reason to use sparse components to represent our input data, because the vast majority of sensory data, such as natural images, can be expressed as the superposition of a few basic elements, in an image, these basic elements can be a surface or a line. At the same time, for example, the process of analogy with the primary visual cortex has also been improved.
We haveMInput vector sparse encoding cost function is defined:
HereS(.) Is a sparse cost function, which is used to evaluateAI"Penalty ". We can interpret the first item of the target function of sparse Encoding As a reconstruction item, which forces the sparse encoding algorithm to provide a linear expression with a high degree of fitting for the input vector, the second term of the formula is "sparse penalty", which makes the expression "sparse ". Constant λ is a transformation volume that controls the relative importance of these two formulas.
Although the most direct measurement standard for "sparsity" is"L0 "paradigm, but this is not microfeasible, and it is usually difficult to optimize. In reality, sparse cost functionsS(.) The general choice isL1. paradigm cost function (For details, refer to my previous blog post Sparse Coding study notes)
In addition, it is very likely that the decreaseAIOr add to a large constant to make the sparse penalty very small. To prevent such events, we limit them to be smaller than a constant.C. The complete format of the sparse encoding cost function that contains restrictions is as follows:
2. theoretical examples of olshausen and Field
For example, to generate edge detector at the bottom of the image feature extraction, the job here is to select some small patches from randomly in natural images, through these patch generation can describe their "base", that is, the basis consisting of 8*8 = 64 basis on the right (the method of selecting the base can refer to the http://blog.csdn.net/abcjennifer/article/details/7721834 ), given a test patch, we can obtain it through the linear combination of basis according to the formula above, and sparse matrix is A. There are 64 dimensions in a, of which there are only 3 non-zero items, it is called "sparse".
Here, we may wonder why we use the underlying layer as an edge detector? What is the upper layer? Here is a simple explanation. You will understand that edge detector can describe the entire image because of edge in different directions. Therefore, edge in different directions is the basis of the image ......
The result of the Upper-layer basis combination is the upper-layer combination basis ...... (For details, refer)
As shown in:
For other examples, see the following text (article 2)
How is the foundation of natural images selected? Where does the cost function of sparse encoding come from? This is the basic theoretical analysis.
To find a sparse, super-complete base vector set to overwrite our input data space. Now we can use the sparse encoding algorithm as a "generate model" from the perspective of probability ".
We regard Natural Image Modeling as a linear superposition.KIndependent Source features and supplementary noise:
Our goal is to find a set of feature base vectors so that the image distribution function is as close as possible to the empirical distribution function of input data. One way is to minimize the KL divergence between and. The KL divergence is as follows:
No matter how we choose it, the empirical distribution function is a constant, that is, we only need to maximize the logarithm likelihood function. Assume that V is a Gaussian white noise with variance σ 2, then there is the following formula:
To determine the distribution, we need to specify a prior distribution. Assuming that our feature variables are independent, we can break down the anterior probability:
In this case, we add the "sparse" hypothesis-assuming that any image is combined by a relatively small number of source features. Therefore, we hope thatAIThe probability distribution near zero is raised, and the peak value is very high. A convenient parametric prior distribution is:
HereS(AI) Is a function that determines the shape of a prior distribution.
After the sum is defined, we can write the probability distribution of the data under the defined model:
Then, our problem is simplified to search:
<.> Indicates the expected value of input data.
Unfortunately, the calculation of points is usually difficult to achieve. Even so, we notice that if the distribution (for the corresponding) is steep enough, we can use the maximum value to estimate the above points. The estimation method is as follows:
As before, we can reduceAIOr increase to increase the probability estimation value (becauseP(AI). Therefore, we need to add a limit to the feature vector to prevent this situation.
Finally, we can define an energy function for linear model generation to reexpress the original cost function:
Where λ = 2σ 2β, and constants with little relationship have been hidden. Because the maximum logarithm likelihood function is equivalent to the minimum energy function, we can reexpress the original optimization problem:
Using probability theory for analysis, we can find and chooseL1. Penalty and punishment as functionsS(.), Which corresponds to the use of the Laplace probability and the prior probability.
Now, the explanation of probability is complete.
Iii. Learning Algorithms
The sparse encoding algorithm is used to learn the base vector set, which is composed of two independent optimization processes. The first is to use training samples one by one to optimize the coefficient.AIThe second is to process multiple samples at a time to optimize the base vector.
If you useL1. As a sparse penalty function, the learning process is simplifiedL1. The Least Square Method of normalization in the paradigm. This function is convex in the domain, there are already many technical methods to solve this problem (convex optimization software such as Cvx can be used to solve the problem of L1 regularization Least Square Method ). IfS(.) It can be microscopic. For example, if it is a logarithm penalty function, you can use a gradient algorithm-based method, such as the gradient method.
UseL2 paradigm constraints are used to learn base vectors. It can also be simplified to a least square problem with quadratic constraints, and its problem functions are also convex in the domain. Standard convex optimization software (such as Cvx) or other iterative methods can be used to solve the problem. Although there are more effective methods available, such as solving the lagizdual function ).
... To be continued
Reference http://deeplearning.stanford.edu/wiki/index.php/%E7%A8%80%E7%96%8F%E7%BC%96%E7%A0%81
Http://blog.csdn.net/abcjennifer/article/details/7804962
Interpretation of Sparse Coding probability (based on olshausen and Field Theory in 1996)