Reprint Deep Learning: Eight (sparsecoding sparse coding)

Source: Internet
Author: User

Reprint http://blog.sina.com.cn/s/blog_4a1853330102v0mr.html

Sparse Coding:

This section will briefly describe the next sparse coding (sparse encoding), because sparse coding is also an important branch of deep learning, as well as extracting good features from datasets. The content of this article is refer to the Stanford Deep Learning Tutorial: Sparse coding,sparse coding:autoencoder Interpretation, corresponding to the Chinese course see sparse coding, Sparse coding self-coding expression.

Before, we need to have some understanding of convex optimization, Baidu Encyclopedia is interpreted as: "Convex optimization" refers to a more special optimization, refers to the objective function as a convex function and constrained by the constraints of the definition of a convex set of the problem, that is, the objective function and constraints are "convex".

OK, now let's start with a brief introduction to the sparse coding, sparse coding is a linear combination of decomposing the input sample set X into multiple primitives, and then the coefficients preceding these bases represent the characteristics of the input sample. The decomposition formula is expressed as follows:

In general, the number of requests for the base k is very large, at least more than the number of elements in X is greater than N, because such a base combination to more easily learn the intrinsic structure and characteristics of the input data. In fact, in the common PCA algorithm, it is possible to find a group of bases to decompose X, but the number of the base is smaller, so you can get the decomposition of the coefficient a can be uniquely determined, and in the sparse coding, K is too large, much larger than N, its decomposition coefficient A can not be uniquely determined. The general approach is to make a sparse constraint on factor A, which is the source of the sparse coding algorithm. At this time the system corresponding to the cost function (the previous blog is used to represent the loss function, the unification of the later use of cost function, the sense that this translation more appropriate) expression is:

The first of these is to reconstruct the generation value of the input data x, the second item s (.) For the coefficient penalty of decomposition coefficients, LAMDA is the weight of two kinds of cost, and is a constant. However, there is still a problem, for example, we can reduce the coefficient a to very small, and the value of each base is increased to a large, so that the cost of the first item remains basically the same, and the second one of the sparse penalty is still very small, not up to the purpose we want--only a few coefficients in the decomposition coefficient is far greater than Not most of the coefficients are larger than 0 (though not much larger). The common way to solve this problem is to have a constraint on the values in the base set, and the system cost function after the constraint is:

This method is called sparse Coding. In layman's words, a signal is expressed as a set of linear combinations of bases, and the signal can be expressed by requiring fewer bases. "Sparsity" is defined as: only a few non-0 elements or only a few elements far greater than 0. The requirement factor AI is sparse meaning that: for a set of input vectors, we only want to have as few coefficients as possible far greater than 0. There is a reason to choose to use the sparse component to represent our input data, because the vast majority of sensory data, such as natural images, can be represented as a superposition of a small number of basic elements that can be polygons or lines in an image. At the same time, the analogy with the primary visual cortex has thus been improved (the human brain has a large number of neurons, but for some images or edges there are only a few neuron excited, others are suppressed).

Sparse coding algorithm is a unsupervised learning method, which is used to find a set of "super-complete" base vectors to represent the sample data more efficiently. Although the form of principal component Analysis (PCA) allows us to conveniently find a set of "complete" base vectors, what we want to do here is to find a set of "super-complete" base vectors to represent the input vectors (that is, the number of base vectors is greater than the number of dimensions of the input vectors). The advantage of super-complete base is that they can more effectively find out the structure and pattern implicit in the input data. However, for ultra-complete bases, the coefficient AI is no longer determined by the input vectors alone. Therefore, in the sparse coding algorithm, we add a criterion of "sparsity" to solve the degradation (degeneracy) problem caused by super-completeness. (Please refer to the detailed procedure: UFLDL Tutorial Sparse coding)

Probability explanation of Sparse coding:

Mainly from the perspective of probability to explain the sparse coding method, but this part of the content is really not very clear, can only talk about their own understanding. If the error is taken into account, the input sample x after sparse coding decomposition of the expression is as follows:

Our goal is to find a set of base ф, so that the probability of the input sample data is the most similar to the empirical distribution of the input sample data, if the KL distance to measure its similarity, is to meet their KL distance is the smallest, that is, the following expression value is the smallest:

Since the probability of the empirical distribution function of the input data is a fixed value, it is equivalent to the maximum to find the minimum value.

After a priori estimation of the parameter A and the estimation of the function integral value, the derivation steps are finally equivalent to the following minimum energy function values:

And this is very good and sparse coding the cost function formula to link up.

So far we should know that the actual use of sparse coding is very slow, because even though we have learned the base ф of the input data set during the training phase, we still have to get the eigenvalues (i.e., the coefficient values in front of the base combination) by the convex optimization method during the testing phase. So this is slower than the normal forward neural network (the general forward algorithm simply uses a matrix to do the multiplication, and then add, to find a function value, such as a few steps can be completed).

Sparse Coding's Autoencoder explanation:

The first step is to look at the number of LK specifications for Vector X, whose value is: The L1 norm is the sum of each element, and the L2 norm is the Euclidean distance from the vector to the far point.

The cost function for expressing sparse coding in the form of matrices is as follows:

As previously mentioned, this is also a sparse penalty for the base value s, which is constrained by the L1 norm, while also preventing the coefficient matrix A from being too large, which is constrained by the square of the L2 norm. However, the L1 norm at the base value cannot be derived at 0 points, so it is not possible to use a similar method of gradient descent to find the optimal parameter for the above cost function, so the formula can be converted to the following in order to be guided in 0:

Topology Sparse coding:

Topological sparse coding is mainly to imitate the adjacent neurons in the human cerebral cortex to extract a similar feature, so in deep learning we want to learn the characteristics of such a "topological order" nature. If we randomly arrange the features into a matrix, we want the adjacent features in the matrix to be similar. That is, the sparse penalty term L1 norm of the original characteristic coefficients is changed to the sum of different group L1 norm penalty, and these neighboring groups have overlapping values, so as long as the overlapping part of the value change means that the penalty value of the respective group will also change, which shows similar characteristics of the human brain cortex, So at this point the cost function of the system is:

In the form of a matrix as follows:

Summarize:

In the actual programming, in order to write the correct optimization function code and can quickly and properly converge to the optimal value, you can use the following techniques:

    1. The benefit of dividing the input sample set into smaller mini-batches is that the number of samples in the input system is reduced at each iteration, the running time is much shorter, and the overall convergence speed is improved. (I haven't figured out the reason yet).
    2. The initialization value of s cannot be given randomly. This is usually done in the following way:

Finally, the steps are roughly as follows when actually optimizing the cost function:

    1. Random initialization of A
    2. Repeat the following steps until convergence
      1. Randomly select a small mini-batches.
      2. Follow the previous approach to s.
      3. Based on agiven in the previous step, the solution can minimize the s of J(a,s)
      4. Based on the sobtained in the previous step, the solution can minimize J(a,s) a

Reprint Deep Learning: Eight (sparsecoding sparse coding)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.