Stanford UFLDL Tutorial Depth Network Overview _stanford

Source: Internet
Author: User
Depth network overview Contents [hide] 1 Overview 2 Depth Network Advantages 3 training Depth Network difficulties 3.1 data acquisition Problem 3.2 Local extremum problem 3.3 Gradient dispersion problem 4 Layer Greedy training method 4.1 data get 4.2 better local extreme value 5 Chinese-English translator overview

In previous chapters, you have built a three-layer neural network that includes input, hidden, and output layers. Although the network is very effective for mnist handwritten digital databases, it is a very "shallow" network. The "shallow" here refers to the feature (the active value of the hidden layer) that is obtained only by using a single layer of calculated cells (hidden layers).


In this section, we begin to discuss a deep neural network, a neural network that contains multiple hidden layers. By introducing a depth network, we can compute more complex input features. Because each hidden layer can transform the output of the previous layer nonlinearly, the depth neural network has more expressive power than the "shallow" network (for example, it can learn more complex functional relationships).


It is noteworthy that when training the depth of the network, each layer of the hidden layer should use a non-linear activation function. This is because the combination of multiple linear functions is essentially the ability to express linear functions (for example, by combining several linear equations together to produce only another linear equation). Therefore, in the case that the activation function is linear, the depth network containing multiple hidden layers does not increase the expression ability compared with the single hidden layer neural network.


The advantages of a deep network

Why should we use a deep network? The main advantage of using a depth network is that it can express a much larger set of functions than a shallow network in a more compact and concise way. Officially, we can find functions that can be expressed succinctly in a layer of network (the simplicity here is that the number of hidden units is only a polynomial relationship to the number of input units). But for a layer-only network, these functions cannot be expressed succinctly unless it uses the number of hidden layers that are indexed to the number of input units.


To give a simple example, we intend to construct a Boolean network to compute parity codes for input bits (or for XOR or operation). Suppose that each node in the network can perform logical "or" operations (or "and not" operations), or logical "and" operations. If we have a network consisting of only one input layer, one hidden layer and one output layer, then the number of nodes required by the parity function is exponential with the size of the input layer. But if we build a deeper network, the scale of the network can be just a polynomial function.


When processing objects are images, we can use the depth network to learn the "partial-whole" decomposition relationship. For example, the first layer can learn how to combine the pixels in an image to detect edges (as we did in previous exercises). The second layer can combine edges to detect longer outlines or simple "target parts". At a deeper level, these contours can be further combined to detect more complex features.


The last point to mention is that the cerebral cortex is also divided into multiple layers to calculate. For example, visual images in the human brain is handled in a number of stages, first of all, into the cerebral cortex of the "V1" area, and then immediately followed into the cerebral cortex "V2" area, and so on.


Difficulties in training deep networks

Although the theoretical simplicity and strong expressive power of deep networks have been discovered decades ago, researchers have not until recently made much progress in training deep networks. The problem is that the main learning algorithms used by the researchers are: first randomly initialize the weights of the depth network, and then use the supervised objective function to train on the labeled training set. For example, a gradient descent method is used to reduce the training error. However, this method usually does not work very well. There are several reasons for this:


Data access issues

Using the methods mentioned above, we need to rely on tagged data for training. However, tagged data is often scarce, so for many problems it is difficult to get enough samples to fit the parameters of a complex model. For example, given the strong expressive power of a deep network, training on inadequate data can lead to a fit.


Local extremum problem

The use of supervised learning methods to train shallow networks (only one hidden layer) usually allows the parameters to converge to a reasonable range. However, when you use this method to train the depth of the network, it does not achieve good results. In particular, the use of supervised learning methods to train neural networks usually involves solving a highly convex optimization problem (e.g. minimizing training errors, where parameters are parameters to be optimized). For the depth network, the search region of the convex optimization problem is filled with a large number of "bad" local extremum, so the gradient descent method (or the conjugate gradient descent method, L-BFGS, etc.) does not work well. Gradient dispersion problem

Gradient Descent method (and related L-BFGS algorithm, etc.) in the use of random initialization weights in the depth of the network is not good technical reason is: the gradient will become very small. In particular, when using the reverse-propagation method to compute the derivative, the amplitude of the backward-propagating gradient (from the output layer to the initial layer of the network) decreases dramatically as the depth of the network increases. As a result, the derivative of the overall loss function is very small in relation to the weights of the first layers. In this way, when the gradient descent method is used, the weights of the first layers change so slowly that they are not able to learn effectively from the sample. This problem is often referred to as "gradient dispersion".


The problem closely related to the gradient dispersion problem is that when the last layers of the neural network contain enough neurons, the individual layers may be sufficient to model the tagged data without the help of the first layers. Therefore, the performance of the entire network trained by the random initialization method for all layers will be similar to the performance of the trained shallow network (the shallow network consisting only of the last layers of the depth network).


A layer of greedy training method

So how do we train the deep Web? Step by step Greedy training method is a way to achieve a certain success. We will elaborate on the details of this approach in later chapters. In simple terms, the main idea of a layered greedy algorithm is to train only one layer of the network at a time, that is, we first train a network that contains only one hidden layer, and then start training a network with two hidden layers only after this network training is finished, and so on. In each step, we fix the front layer that has already been trained, and then add the layer (that is, the output of the former that we have trained as input). Each level of training can be supervised (for example, the classification error for each step is used as the objective function), but more often unsupervised methods (such as automatic encoders, we'll give details in the later chapters). The weights given by these layers of individual training are used to initialize the weight of the final (or full) depth network, and then "fine-tune" the entire network (that is, putting all the layers together to optimize the training errors on the tagged training set).


The success of a layer of greedy training methods is attributable to the following:


Data acquisition

Although the cost of getting tagged data is expensive, it is easy to get a lot of data without tags. The potential of the Self-learning method (self-taught learning) is that it can learn better models by using a lot of data without tags. Specifically, the method uses no tag data to learn the best initial weights for all layers, excluding the final classification layer used to predict labels. Compared with the pure supervised learning method, this self-learning method can make use of much more data and can learn and discover the patterns existing in the data. Therefore, this method can generally improve the performance of classifiers.


Better local extremum

When the network is trained with no label data, the initial weights of each layer are in better position in the parameter space than the random initialization. We can then proceed from these positions to fine-tune the weights further. From the empirical point of view, it is more likely to converge to a better local extremum point with these positions as starting points, because no label data already provides a priori information about the patterns contained in a large number of input data.


In the next section, we will detail how to do a layer by step greedy training.


Deep Network Deep Networks depth Neural network Deep neural Networks Nonlinear Transformation non-linear transformation activation function Simple surface Represent compactly "partial-integral" decomposition Part-whole decompositions target parts parts of objects of high non-convex optimization problem highly non-convex Problem conjugate gradient conjugate gradient gradient dispersion diffusion of gradients layer greedy training method greedy layer-wise training Automatic encoder Autoencoder fine fin E-tuned Self-learning method self-taught learning FROM:HTTP://UFLDL.STANFORD.EDU/WIKI/INDEX.PHP/%E6%B7%B1%E5%BA%A6%E7%BD%91%E7%BB %9c%e6%a6%82%e8%a7%88

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.