Stanford UFLDL Tutorial Sparse coded self-coding expression

Source: Internet
Author: User
Tags constant square root
Sparse coded self-coding expression
Contents [Hide] 1 sparse encoding 2 topological sparse coding 3 sparse Coding Practice 3.1 sample batches as "Mini block" 3.2 Good s initial value 3.3 operational algorithm 4 Chinese-English translator
Sparse Encoding

In the sparse self-coding algorithm, we try to get a set of weighted parameter W (and the corresponding intercept B), which allows us to obtain sparse eigenvector σ (Wx + b), which is useful for refactoring input samples.


Sparse coding can be seen as a variant of the sparse self-coding method, which attempts to directly learn the feature set of the data. Using the base vectors corresponding to this feature set, the learned feature set is transformed from the feature space to the sample data space, so that we can reconstruct the sample data with the learned feature set.


Specifically, in sparse coding algorithms, there are sample data X for us to perform feature learning. In particular, to learn a sparse feature set that represents sample data, and a base vector a that transforms a feature set from a feature space to a sample data space, we can construct the following objective function:

(Is the LK norm of X, equivalent to. The L2 norm is known as the Euclidean norm, and the L1 norm is the sum of the absolute values of the vector elements.


The first part of the above is the error caused by using the base vector to reconstruct the feature set into sample data, and the second part is the sparsity penalty (sparsity penalty term), which is used to guarantee the sparsity of the Special solicitation.


However, as the objective function shows, its constraint is not strong-scale a by the constant scale of a and then the inverse of the constant scaling s, the result will not change the error size, but will reduce the sparse cost (the second item of the expression) value. Therefore, additional constraints need to be added for each Aj in A. The problem becomes:


Unfortunately, because the objective function is not a convex function, the gradient method cannot be used to solve this optimization problem. However, in the case of a given a, minimizing J (A,s) solves S is convex. Similarly, the given S minimization J (a,s) solution A is also convex. This indicates that a and s can be solved by alternating s and a respectively. Practice has shown that this strategy has achieved very good results.


However, the above expression poses another challenge: you cannot use a simple gradient method to implement constraints. Therefore, in practical problems, this constraint is not sufficient to become a "weight decay" ("Weight Decay") item to ensure that each item of a is small enough. So we get a new target function:

(Note the third item in the formula, which is equivalent to the sum of the squares of a)


This objective function brings the last problem that the L1 norm does not affect the application of gradient method at 0 points. Although it is possible to avoid this problem through other non-gradient descent methods, this paper solves this challenge by using the approximate value "smoothing" the L1 norm. In lieu, the L1 norm is smoothed, where ε is the "smoothing parameter" ("Smoothing parameter") or "Sparse parameter" ("Sparsity parameter") (if ε is greater than X, the value of X-+ε is dominated by ε and its square root is approximate to ε). "Smoothing" comes in handy when you refer to topological sparse coding below.


Therefore, the final objective function is:

(yes Shorthand)


The objective function can be iteratively optimized by the following procedure: Random initialization A repeats the following steps until convergence: according to a given in the previous step, the solution can minimize J (a,s) s based on the previous step, and solve a that can minimize J (a,s)


Observing the modified target function J (a,s), given s, the objective function can be simplified to (because the L1 paradigm of S is not a function, so it can be ignored). The simplified objective function is a simple two-time term about a, so it is easy to get a derivative of a. A quick way of this derivation is matrix calculus (the RELATED LINKS section lists the content related to the matrix calculus). Unfortunately, given A's condition, the objective function does not have such a derivation method, so the minimization step of the objective function can only use gradient descent or other similar optimization methods.


Theoretically, the characteristic set (the base vector of a) obtained by solving the optimization problem of the objective function by the above iterative method is similar to that obtained by the sparse self-coding learning. In practice, however, there are a few tricks that need to be used in order to get better algorithm convergence, and these techniques are described in detail in the sparse coding practice section behind sparse coding practices. Using the gradient descent method to solve the objective function is also a bit tricky, and the use of matrix calculus or a reverse propagation algorithm can help solve such problems. Topological sparse coding

With sparse coding, we are able to get a set of features that represent the sample data. But let's find some inspiration, and we want to learn to get a set of features with some sort of "order". For example, visual features, as mentioned earlier, V1 neurons in the cerebral cortex are able to detect the edges in a particular direction, while the neurons (physiologically) are Cheng (Hypercolumns), and in the Super-column, neighboring neurons detect the edges in a similar direction, When a neuron detects a horizontal edge, its neighboring neurons detect a slightly deviated edge from the horizontal direction, and along the super-column, the neurons can detect a margin that is significantly different from the horizontal direction.


Inspired by this example, we want to learn the characteristics of such a "topological order" nature. What does this mean for the characteristics we want to learn? Intuitively, if the "neighbor" feature is "similar", it means that if a feature is activated, then the adjacent feature will be activated.


In particular, suppose that we (arbitrarily) organize the features into a square. We want the adjacent features in the matrix to be similar. The way to do this is to group adjacent features according to the Smoothed L1 paradigm penalty, if grouped by a 3x3 phalanx, the groupings are usually coincident, so the 3x3 area starting at row 1th of line 1th is a grouping, and the 3x3 area starting at row 2nd of line 1th is another grouping, and so on 。 Eventually, such groupings form a circle, as if the matrix were a circular surface, so each feature is grouped in the same number of times. Thus, the sum of all the L1 penalty values of all grouped groups is replaced by the smoothed L1 penalty value, and the new objective function is given as follows:


In fact, "grouping" can be done by "grouping matrix" V, so the line R of the Matrix V identifies which features are divided into the R group, that is, if the R group contains feature c Vr,c = 1. By grouping matrices to make the calculation of gradients more intuitive, using this grouping matrix, the objective function is rewritten as:

(Order, equivalent to

Dr,c
R C

)


The objective function can be solved using the iterative method mentioned earlier. The features obtained by the sparse coding are similar to those obtained by sparse coding, except that the characteristics of the topological sparse coding are arranged in some way with "order".


Sparse Coding Practices

As mentioned above, although the theory behind sparse coding is very simple, it is necessary to write accurate implementation code and to converge to the optimal value quickly and properly.


Review the simple iterative algorithm that was mentioned earlier: random initialization a repeats the following steps until it converges to the optimal value: according to a given in the previous step, the solution can minimize J (a,s) s based on the previous step, solving a that can minimize J (a,s)


The result of this algorithm is not satisfactory, even if it does get some results. Here are two faster and more optimized convergence techniques: Batch samples as "mini-blocks" with good s-initial values


Batch samples as "mini-blocks"

If you perform a simple iterative algorithm on a large data set (for example, 10,000 patches) at once, you will find that each iteration takes a long time, and therefore it takes a long time for the algorithm to achieve convergence results. To improve the convergence rate, you can choose to run the algorithm on the mini-block. Each iteration, instead of executing the algorithm on all 10,000 patchs, uses a mini-block, which randomly selects 2000 patches from 10,000 patches and executes the algorithm on the mini-block. This can be done stone-first, increasing the speed of each iteration, because now each iteration is executed on only 2000 patches instead of 10,000, and second, and more important, it increases the speed of convergence (see Todo for reasons).


good S-Initial value

Another important technique for faster and more optimal convergence is to find the initial value of a good feature matrix s before solving s using gradient descent (or other methods) based on the target function, given a. In fact, unless an optimal matrix S is found before optimizing the optimal value of a, the random initialization of the S value during each iteration results in poor convergence. Here's a good way to initialize S: Make (x is the matrix representation of patches in the mini-block) each feature in S (each column of s), divided by its norm in a on the Chunki vector. That is, if Sr,c represents the r characteristic of the C sample, then AC represents the C-base vector in a, then


Undoubtedly, such initialization facilitates the improvement of the algorithm, as the first step above wants to find the satisfied matrix S; The second step is to normalize s for the sake of maintaining a smaller sparse penalty value. This also indicates that the initialization of S with only one step of the above steps rather than two steps will severely affect the performance of the algorithm. (TODO: This link will give a more detailed explanation of why this initialization can improve the algorithm)


Operational Algorithms

With the above two techniques, the sparse coding algorithm is modified as follows: Random initialization A repeats the following steps until convergence randomly selects a mini block with 2000 patches as described above, initializes s according to the previous step given a, the solution can be minimized J (A,s) s according to the previous step, Solving a that minimizes J (a,s)

By the above method, the local optimal solution can be obtained relatively quickly.


Chinese- English control sparse coding sparse Coding self-coding Autoencoder objective function Objective function sparse cost sparsity costs reverse propagation backpropagation gradient-based GR adient-based non-convex non-convex weight decay weight decay topological sparse coding topographic sparse coding topological order topographically ordered smoothing one-norm penalty smoot Hed L1 penalty mini block Mini-batches convergence speed the rate of convergence gradient drops gradient descent

Local optimal solution Optima From:http://ufldl.stanford.edu/wiki/index.php/%e7%a8%80%e7%96%8f%e7%bc%96%e7%a0%81%e8%87%aa%e7 %bc%96%e7%a0%81%e8%a1%a8%e8%be%be

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.