Sparse Coding:
This section briefly introduces Sparse Coding, because Sparse Coding is also an important branch in deep learning and can also extract good features of a dataset. The content of this article is to refer to the Stanford deep learning Tutorial: Sparse Coding, Sparse Coding: autoencoder interpretation. For the Chinese tutorial, see sparse encoding and self-encoding expression of sparse encoding.
Before that, we need to have some knowledge about convex optimization. Baidu encyclopedia explained that "convex optimization" refers to a special optimization, it refers to the optimization problem where the target function is convex and the defined domain obtained by the constraint condition is convex set. That is to say, the target function and the constraint condition are both convex.
Now, let's start with Sparse Coding. Sparse Coding is a linear combination of input sample set X into multiple elements, and the coefficients above the base represent the features of input samples. The decomposition formula is as follows:
In general, the number K of the base must be very large, at least greater than the number n of elements in X, this combination makes it easier to learn the internal structure and features of input data. In fact, in a common PCA algorithm, we can find a group of bases to break down X, but the number of bases is small. Therefore, we can obtain that the decomposed coefficient A can be uniquely determined, in Sparse Coding, K is too large to be greater than N, and its decomposition coefficient A cannot be uniquely determined. The general practice is to make a sparse constraint on coefficient A, which is the source of the Sparse Coding Algorithm. In this case, the system's corresponding cost function (represented by the loss function in the previous blog post, and the price function will be used in the future, so the translation is more appropriate) is expressed as follows:
The first item is to reconstruct the generation value of input data X, and the second item's (.) is the coefficient penalty of the decomposition coefficient. Lamda is the weight of two costs and is a constant. However, there is still a problem. For example, we can reduce the coefficient A to a small value and increase the value of each base to a large value, so that the value of the first item remains unchanged, the sparse penalty for the second item is still very small and cannot reach our goal-only a few of the decomposition coefficients are far greater than 0, rather than most of the coefficients are larger than 0 (although not too large ). The general method to solve this problem is to impose a constraint on the values in the base set. The system cost function after the constraint is:
Explanation of the probability of Sparse Coding:
It mainly explains the Sparse Coding method from the perspective of probability. However, this part of content is really not quite clear and can only be explained by your own rough understanding. If the error is taken into account, the expression of input sample X after Sparse Coding decomposition is as follows:
However, our goal is to find a group of basis operators so that the probability of input sample data appearing is the closest to the empirical distribution probability of input sample data. If we use KL distance to measure similarity, the minimum KL distance is met, that is, the minimum expression value below:
Because the probability of the empirical distribution function of input data is a fixed value, the minimum value of the above formula is equivalent to the maximum value.
After derivation steps such as the prior estimation of parameter A and the estimation of the Function integral value, it is equivalent to finding the minimum value of the following energy function:
This is closely related to the price function formula of Sparse Coding.
So far, we should know that the actual use of Sparse Coding is very slow, because even if we have learned the basic knowledge of the input dataset during the training phase, in the test phase, we still need to use the convex optimization method to obtain its feature value (that is, the system value before the base combination ), therefore, this is slower than the general forward neural network (a general forward algorithm only needs to perform multiplication with a matrix, then perform addition, and obtain a function value in a few steps ).
The autoencoder of Sparse Coding explains:
First, let's take a look at the LK norm number of vector X. Its value is: From this we can see that the l1 norm is the sum of each element, and the L2 norm is the Euclidean distance from the vector to the far point.
The price function for expressing Sparse Coding in the form of a matrix is as follows:
As mentioned above, the base value S is also subjected to sparse penalty, which is constrained by L1 norms and prevents the matrix A from being too large, it is bound by the square of the L2 norm. However, the l1 norm at the base value cannot be obtained at 0. Therefore, you cannot use a method similar to gradient descent to obtain the optimal parameter for the above cost function. Therefore, in order to export at 0, the formula can be changed to the following:
Topology Sparse Coding:
Topology Sparse Coding is mainly used to imitate adjacent neurons in the human cerebral cortex to extract a similar feature, therefore, in deep learning, the features we want to learn also have such a "topological order. If we randomly arrange features into a matrix, we want adjacent features in the matrix to be similar. That is to say, the l1 norm of the sparse penalty items of the original feature coefficients is changed to the sum of the l1 norm punishments of different groups, and there are overlapping values between these adjacent groups, therefore, as long as the overlapping part of the value changes, it means that the penalty values of the respective groups will change, which reflects the features similar to the human brain cortex. Therefore, the system's cost function is:
After the form is changed to matrix:
Summary:
In actual programming, the following tips can be used to write accurate optimization function code and quickly and properly converge to the optimal value:
- The input sample set is divided into several small mini-batches. The advantage of this is that the number of samples in the input system is reduced and the running time is much shorter during each iteration, it also increases the overall convergence speed. (I haven't figured out the reason yet ).
- The initialization value of s cannot be randomly assigned. Generally, follow these steps:
Finally, the steps for actually optimizing this function are roughly as follows:
- Random InitializationA
- Repeat the following steps until convergence
- Select a small mini-batches randomly.
- Follow the methods described aboveS.
- Based onAThe solution can be minimized.J(A,S)S
- According toSThe solution can be minimized.J(A,S)A
References:
Sparse Coding
Sparse Coding: autoencoder Interpretation
Sparse Coding
Sparse Coding self-Encoding