Stanford UFLDL Tutorial Sparse Coding

Stanford UFLDL Tutorial Sparse Coding _stanford

Last Update:2018-08-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sparse coded Contents [hide] 1 sparse 2 probability interpretation [based on 1996 Olshausen and Field theory] 3 Learning Algorithm 4 Chinese-English translator sparse coding

Sparse coding algorithm is a unsupervised learning method, which is used to find a set of "super complete" base vectors to represent sample data more efficiently. The aim of the sparse coding algorithm is to find a set of base vectors that allow us to represent the input vectors as linear combinations of these base vectors:

Although the form of principal component Analysis (PCA) makes it easy to find a set of "complete" base vectors, what we want to do here is to find a set of "super complete" base vectors to represent the input vectors (i.e., K >n). The advantage of the hyper-complete base is that they can more effectively identify structures and patterns that are implied within the input data. However, for the ultra complete base, the coefficient AI is no longer uniquely determined by the input vector. Therefore, in the sparse coding algorithm, we add a standard "sparsity" to solve the problem of degradation (degeneracy) caused by hyper-completeness.

Here, we define "sparsity" as: Only a few 0 elements or only a few elements far greater than 0. Requirement factor AI is sparse meaning: for a set of input vectors, we just want to have as few coefficients as possible far greater than 0. There is a reason to choose to use a sparse component to represent our input data, because most sensory data, such as natural images, can be represented as a superposition of a small number of basic elements, which can be surfaces or lines in the image. At the same time, the analogy with the primary visual cortex has also been elevated.

We define the sparse encoded cost function with M input vector as:

Here S (.) is a sparse cost function that is "penalized" by AI that is far greater than 0. We can interpret the first term of the sparse-coded target function as a refactoring term, which forces the sparse coding algorithm to provide a linear expression of high fit for the input vector, while the second term of the formula is "sparse penalty", which makes the expression "sparse". A constant λ is a transform that controls the relative importance of the two formulas.

Although the most direct measure of "sparsity" is the "L0" paradigm (), this is not differentiable and is often difficult to optimize. In practice, the universal selection of the sparse cost function s (.) is the L1 paradigm cost function and the logarithmic cost function.

In addition, it is possible to reduce the AI or increase to a large number of constants, so that sparse penalties become very small. To prevent such an event from happening, we will limit it to less than a constant c. The complete form of the sparse encoded cost function containing the restriction conditions is as follows:

Probability interpretation [based on the theory of Olshausen and field in the 1996]

So far, the sparse coding we have considered is to find a sparse, hyper-complete base vector set to cover our input data space. Now in a different way, we can use the sparse coding algorithm as a "generation model" from the angle of probability.

We consider the natural image modeling problem as a linear superposition of elements including K independent source characteristics and additive noise ν:

Our goal is to find a set of feature base vectors that make the distribution function of the image as close to the empirical distribution function as the input data. One way to achieve this is to minimize the KL divergence between and, this KL divergence is expressed as follows:

Because no matter how we choose, the empirical distribution function is constant, which means we only need to maximize the logarithmic likelihood function. Assuming that ν is a Gaussian white noise with variance σ2, it has the following formula:

To determine the distribution, we need to specify a priori distribution. Assuming that our characteristic variables are independent, we can decompose a priori probability into:

At this point, we join the "sparse" hypothesis-assuming that any image is grouped by a relatively small number of source features. Therefore, we hope that the AI probability distribution near the 0 value is raised, and the peak is very high. A convenient parametric prior distribution is:

Here s (AI) is a function that determines the shape of a prior distribution.

When defined and then, we can write the probability distribution of the data under the defined model:

So, our problem is simplified to look for:

Here <. > represents the expected value of the input data.

Unfortunately, it is often difficult to compute the integral of a pair. Nevertheless, we note that if the distribution (for the corresponding) is steep enough, we can use the maximum value to estimate the above integral. The estimation method is as follows:

As before, we can increase the estimate of probability by reducing the AI or the increase (because P (AI) rises steeply around 0 values). So we have to add a limit to the eigenvector to prevent this from happening.

Finally, we can define the energy function of a linear generative model, and then rephrase the original cost function as:

Where λ= is 2σ2β, and the constants that are not related are hidden. Since maximizing the logarithmic likelihood function is equivalent to minimizing the energy function, we can rephrase the original optimization problem as:

Using the probability theory to analyze, we can find that the choice of L1 penalty and penalty as function S (.), corresponding to the use of Laplace probability and Cauchy prior probability.

Learning Algorithms

The method of learning the base vector set using sparse coding algorithm is composed of two independent optimization processes. The first one is to use training samples to optimize the coefficient AI, and the second is to process multiple samples at once to optimize the base vector.

If the L1 paradigm is used as a sparse penalty function, the learning process is simplified to solve the least squares problem of regularization by the L1 paradigm, this problem function is convex in the domain, there are many technical methods to solve this problem (such as CVX convex optimization software can be used to solve the problem of least squares of L1 regularization). if S (.) is differentiable, such as a logarithmic penalty function, a gradient based method, such as a conjugate gradient method, can be used.

Using the L2 paradigm constraint to learn the base vector can also be reduced to a least squares problem with two constraints, and the problem function is convex in the domain. Standard convex optimization software (such as CVX) or other iterative methods can be used to solve, although there are more effective methods, such as the solution of Lagrange to even function (Lagrange dual).

According to the previous description, sparse coding has an obvious limitation, that is, even if you have learned a set of base vectors, if in order to "encode" the new data samples, we must perform the optimization process again to get the required coefficients. This significant "real time" consumption means that even in testing, the implementation of sparse coding requires high computational costs, especially compared to a typical feedforward architecture algorithm.

Sparse coded Sparse coding unsupervised learning unsupervised method complete base Over-complete bases principal component Analysis PCA sparsity sparsity degenerate degeneracy cost function Function Refactoring term reconstruction term sparse penalty sparsity penalty paradigm Norm model generative modeling linear superposition linear superposition additive noise Addi tive noise eigenvector basis feature vectors empirical distribution function The empirical distribution KL divergence KL divergence logarithmic likelihood function the Log-likelihood Gauss White Noise Gaussian white noise a priori distribution the prior distribution prior probability prior source characteristic source probability energy function regular regularized least squares least squares convex optimization software convex optimization software conjugate gradient method conjugate gradient two times constraint methods quadratic Aints Lagrange to even function the Lagrange dual

Feedforward structure algorithm feedforward architectures from:http://ufldl.stanford.edu/wiki/index.php/%e7%a8%80%e7%96%8f%e7%bc%96%e7%a0%81

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Stanford UFLDL Tutorial Sparse Coding _stanford

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Stanford UFLDL Tutorial Sparse Coding _stanford

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support