Sparse Coding: autoencoder Interpretation

Source: Internet
Author: User
Sparse Coding

In the sparse self-encoding algorithm, we try to get a set of weight parameters.W(And corresponding intercept)BThrough these parameters, we can obtain the sparse feature vector σ (WX+B). These feature vectors are very useful for refactoring Input samples.

Sparse encoding can be seen as a deformation of the sparse self-encoding method, which attempts to directly learn the feature set of data. Using the base vector corresponding to this feature set, we can convert the learned feature set from the feature space to the sample data space, so that we can reconstruct the sample data with the learned feature set.

To be exact, the sparse encoding algorithm contains sample data.XFor us to learn features. In particular, learn a sparse feature set used to represent sample dataS, And a base vector that converts a feature set from the feature space to the sample data space.A, We can build the following objective functions:

(It is the LK norm of X, which is equivalent. L2 norm is a well-known euclidean norm. l1 norm is the sum of the absolute values of vector elements)

The first part of the above formula is the error produced by restructuring feature sets into sample data using the basis vector, and the second part is the sparsity penalty term, which is used to ensure the sparsity of feature sets.

However, as shown in the target function, it is not highly restrictive-scaling by constant ratioAAt the same time, scale down according to the reciprocal of this constant.SThe result will not change the error size, but will reduce the value of the sparse cost (second item of the expression. ThereforeAEach itemAJAdd additional constraints. Problem changed:

Unfortunately, the objective function is not a convex function, so the gradient method cannot be used to solve this optimization problem. HoweverATo minimizeJ(A,S) SolvingSIs convex. Similarly, givenSMinimizeJ(A,S) SolvingAIt is also convex. This indicates that it can be fixed by alternatingSAnd a respectivelyAAndS. Practice shows that this policy has achieved remarkable results.

 

However, the above expression brings about another challenge: you cannot use a simple gradient method to implement constraints. Therefore, in practice, this constraint is not enough to become a "Weight decay" ("Weight decay") to ensure that each item of A is small enough. In this way, we get a new target function:

(Note that the third item in the above formula is equivalent to the sum of squares of items)

This objective function brings about the last problem, that is, the l1 norm at cannot slightly affect the application of the gradient method. Although this problem can be avoided through other non-gradient descent methods, this problem is solved by using the approximate "Smoothing" l1 norm method. Use the parameter instead to smooth L1 norms. Here, ε is the "smoothing parameter" or "sparsity parameter" (If ε is much greaterX, ThenX+ The value of ε is dominated by ε, and its square root is similar to ε ). The "smoothing" will be useful when it comes to topology sparse encoding.

Therefore, the final target function is:

(Abbreviated as "yes)

The target function can be iteratively optimized through the following process:

  1. Random InitializationA
  2. Repeat the following steps until convergence:
    1. Based onAThe solution can be minimized.J(A,S)S
    2. According toS, And the solution can be minimized.J(A,S)A

Observe the modified Target FunctionJ(A,S), GivenSThe target function can be simplifiedSThe L1 paradigm of is notA). The simplified target function is aboutASoAIt is easy to evaluate. A quick method for this derivation is matrix calculus. UnfortunatelyABut the target function does not have such an optimization method. Therefore, the minimum step of the target function can only be gradient descent or other similar optimization methods.

Topology sparse Encoding

 

Through sparse encoding, we can obtain a set of feature sets used to represent sample data. We hope that the learned features will also be of this "topological order" nature. What does this mean for the features we want to learn? Intuitively, if "adjacent" features are "similar", it means that if a feature is activated, the adjacent features will also be activated.

Specifically, let us (randomly) Organize features into a square matrix. We want the adjacent features in the matrix to be similar. The method to achieve this is to divide adjacent features into groups based on the smooth L1 paradigm penalty. If the adjacent features are grouped by 3x3 phalanx, they are replaced, and their groups are usually overlapped, therefore, the 3x3 area starting with 1st rows and 1st columns is a group, and the 3x3 area starting with 1st rows and 2nd columns is another group, and so on. In the end, such a group will form a circle, as if this matrix is a ring surface, so each feature is grouped with the same number of times. Then, replace the sum of L1 penalty values of all smoothing groups with the smoothing L1 penalty values. The new objective function is as follows:

 

In fact, you can use the "Group Matrix"VComplete, so the MatrixVTheRThe row identifies which features are assigned toRGroup, that is, ifRGroup FeaturesCThenVR,C= 1. Grouping through grouping matrix makes gradient calculation more intuitive. With this grouping matrix, the target function is rewritten:

This target function can be solved using the iteration method mentioned previously. The features obtained by topology sparse encoding are similar to those obtained by sparse encoding, but the features obtained by topology sparse encoding are arranged in order in some way.

Sparse Coding practices

 

As mentioned above, although the theory behind sparse encoding is very simple, it requires some skills to write accurate implementation code and quickly and properly converge to the optimal value.

Let's look back at the simple iterative algorithms mentioned earlier:

  1. Random InitializationA
  2. Repeat the following steps until convergence to the optimal value:
    1. Based onAThe solution can be minimized.J(A,S)S
    2. According toSThe solution can be minimized.J(A,S)A

The result is not satisfactory even if some results are obtained. The following are two faster and more optimized convergence techniques:

  1. Batch samples into "Mini blocks"
  2. GoodSInitial Value
Batch samples into "Mini blocks"

If you execute a simple iteration algorithm on a large-scale dataset (such as 10000 patches) at a time, you will find that each iteration takes a long time, therefore, this algorithm takes a long time to converge. To increase the convergence speed, you can choose to run the algorithm on the mini block. During each iteration, this algorithm is not executed on all the 10000 patchs, but a mini block is used, that is, 10000 patches are randomly selected from the 2000 patches, then execute this algorithm on the mini block. In this way, we can achieve the first, improving the speed of each iteration, because each iteration is only executed on 2000 patches instead of 10000. Second, it is more important, it increases the speed of convergence

 

Good SInitial Value

Another important technique to achieve faster and more optimized convergence is:AUsing Gradient Descent (or other methods) based on the target function.SWe found a good feature matrix.S. In fact, unless the OptimizationAAn optimal matrix has been found before the optimal valueSOtherwise, random Initialization is performed during each iteration.SThe value will lead to poor convergence. The following is an initializationSBetter method:

  1. Order (XIs the Matrix Representation of patches in the mini block)
  2. SEach feature in (S), Divided byAThe norm of the base vector. That is, ifSR,CIndicatesCNumber of samplesRFeature, thenACIndicatesATheCBase vectors

Undoubtedly, such initialization will help improve the algorithm, because the first step above is to find a satisfied matrix.SStep 2:SNormalization is performed to maintain a small sparse penalty value. This also indicates that only one step of the above step is used, not the two stepsSInitialization seriously affects the algorithm performance. (Todo: this link will explain in more detail why such initialization can improve the algorithm)

Runable Algorithm

With the above two techniques, the sparse encoding algorithm is modified as follows:

  1. Random InitializationA
  2. Repeat the following steps until convergence
    1. Randomly select a mini block with 2000 Patches
    2. Initialize as described aboveS
    3. Based onAThe solution can be minimized.J(A,S)S
    4. According toSThe solution can be minimized.J(A,S)A

Through the above method, you can obtain the local optimal solution relatively quickly.

Sparse Coding: autoencoder Interpretation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.