Gradient Based Learning
1 Depth Feedforward network (Deep Feedforward Network), also known as feedforward neural network or multilayer perceptron (multilayer PERCEPTRON,MLP), Feedforward means that information in this neural network is only a single direction of forward propagation without feedback mechanism.
2 Rectifier Linear unit (rectified linear Unit,relu), has some beautiful properties, more suitable than the sigmoid function when the hidden unit, its activation function is: g (z) =max{0,z}
3 cost function: 3 The mean absolute error and mean square error obtained by the Variational method can not be well applied to the gradient based optimization method, for example, when a neuron tends to saturate, its gradient will be very small, so the cross entropy cost function is generally used. : c=1/mσyloga+ (1-y) log (1-a), A is the output of neurons. Cross-entropy cost function can avoid the above gradient saturation problem, and can have a big price when we predict the error. Output Unit
4 Linear Unit : the output unit based on affine transformation is called a linear element and is commonly used to produce the mean value of the conditional Gaussian distribution, because the linear model is not saturated , and the gradient based algorithm will work better.
5) based on the two classification Bernoulli output distribution sigmoid unit :
Let's say we use linear units to learn: P (y=1|x) =max{0,min{1,wtx+b}}
We cannot use gradient descent to train it efficiently. Any time the wtx+b is outside the unit interval, the output of the model will have a gradient of 0 for its parameters. We want to use a method to ensure that whenever the model gives the wrong answer, there is always a strong gradient.
The output unit defined by the sigmoid function can satisfy this point: Y =σ (wtx+b), where σ is the sigmoid function. The specific derivation see book P158 page, here enclose deep learning Chinese PDF address.
6) based on the Multinoulli output distribution of the Softmax unit : Multinoulli distribution can be said to be Bernoulli distribution extension, its characteristics and Bernoulli distribution similar, both will be saturated, But the sigmoid function has a single output that is saturated when its input is extremely negative or extremely positive, and for Softmax, it has multiple output values. These output values may be saturated when the difference between input values becomes extreme. form of Softmax: Softmax (z) i=exp (zi)/σexp (ZJ)
its log log softmax (z) i=zi-logσexp (ZJ)
Note that the first item indicates that the input Zi always contributes directly to the cost function. Because this is not saturated, learning can still be done even if the contribution of Zi to the second item is small. When maximizing logarithmic likelihood, the first encourages Zi to be pushed higher while the second encourages all Z to be depressed. In order to have an intuitive understanding of the second logσexp (ZJ), it is noted that this can be roughly approximated to MAXJ ZJ This approximation is based on the importance of any MAXJ (ZK) that is significantly less than ZJ zk,exp. The intuition we can get from this approximation is that the negative logarithm likelihood cost function always strongly punishes the most active incorrect predictions . The
Many objective functions other than the logarithmic likelihood do not work on the Softmax function. Specifically, the objective function, which does not use logarithms to counteract the exponent in the Softmax, causes the gradient to disappear when the variable of the exponential function takes a very small negative value, and thus is unable to learn.
observes that the output of the Softmax is unchanged when all inputs are added with an identical constant. For this export the stable version of Softmax: Softmax (z) =softmax (Z-maxizi), the transformed form allows us to evaluate the Softmax function with only a small numerical error, even when z contains a very positive or a very negative number.
Ps:softmax, this function is closer to the Argmax function than the Max function. The term "soft" derives from the Softmax function which is continuously differentiable. The result of the "Argmax" function is expressed as a one-hot vector (only one element is 1 and the rest is a vector of 0), not continuous and differentiable. The Softmax function thus provides a "softened" version of the Argmax. The corresponding softening version of the Max function is Softmax (z) Tz. It might be best to call the Softmax function "Softargmax", but the current name is a deep-rooted habit.
7 Variance (heteroscedastic) model : According to the different x to predict the different variance of y, the typical way to achieve it is to use precision rather than variance to represent the Gaussian distribution, in the case of multidimensional variables, the most common is the use of a diagonal precision matrix diag (Beta) (positive definite, the eigenvalues are the reciprocal of the eigenvalues of the covariance matrix). This formula applies to gradient descent, because the logarithmic-likelihood formula of a β-parameterized Gaussian distribution involves only multiplication and addition. The gradients of multiplication, addition and logarithmic operations are well behaved. The variance is related to division and may result in a gradient instability.
8) Multi-peak regression and mixed density network : The real value of the predictive conditional distribution P (Y | x), which has multiple peaks in the Y space for the same x value.
A neural network that mixes Gaussian as its output is often referred to as a mixed-density network. The network needs to output three values: and the three values satisfy a certain constraint, these include: The C in the first term is the latent variable, and the Multinoulli distribution is formed on n different components; the second term assigns a different weight to each component according to the probability of the sample being generated by each component when the negative logarithmic likelihood is obtained; the third , unlike a Gaussian component is a diagonal matrix.
The gradient based optimization method for mixed condition Gaussian (as the output of neural network) may be unreliable: One solution is gradient truncation (clip gradient), and the other is heuristic gradient scaling. Hidden Cells
Some hidden units are not differentiable at all the input points, and the gradient based learning works well, in part because neural network training algorithms usually do not reach the minimum value of the cost function, but only significantly reduce its value. The computer uses a minimum amount that is rounded to 0 when it calculates that the underlying value is really 0.
PS: Unless otherwise stated, most of the hidden cells can be described as accepting input vector x, calculating affine transformations, and then using a nonlinear function g (z) that acts on each element. Most of the hidden cells differ only in the form of the activation function g (z).
1 Rectifier Linear Unit (rectified Linear Unit,relu): A cell with G (z) =max{0,z} as the active function, which is generally used as a hidden layer.
Characteristics:
1. The rectifier linear unit in the activation state its derivative can maintain a large;
2. Gradient not only large and always (active state constant is 1)
3. The two derivative of rectification operation is almost everywhere 0
As you can see, Relu is not able to receive gradients at 0, and one of its drawbacks is that they cannot learn from a gradient based approach to a sample that activates them to 0 (input less than 0).
2) Relu Extensions : In order to solve some of the drawbacks of Rulu, it is mainly based on the slope of a non-0 in z<0 to ensure that there is a gradient at less than 0 o'clock: g (z,α) I=max{0,zi}+αimin (0,zi), This allows them to behave much more linearly and make optimizations simpler.
Absolute Value rectification : Fixed αi=1, mainly used for object recognition of images.
Seepage linear rectifier unit (leaky Relu) : Fixed αi to a small value
parameterized Relu : Learn αi as a model parameter.
3maxout Unit : divides z into groups with k values, each unit outputting the largest element of one group .
Advantages:
The 1.maxout element can be used to learn piecewise linear convex functions with up to k segments, and when K is large enough, the maxout element can approximate any convex function with arbitrary reading. The Maxout unit can therefore be considered as learning the activation function itself and not just the relationship between cells.
2. In some cases, fewer parameters are required to gain some statistical and computational advantages. Specifically, if the characteristics described by n different linear filters can be summed up with the maximum value of each group of K features without losing information, then the next layer can be given a weight of less than k times.
3. Each unit is driven by multiple filters, with some redundancy to help them resist a phenomenon known as disaster forgetting .
Disadvantage: each unit needs K weight vector to parameterize instead of one, requires more regularization, but in the training set is very large and the number of blocks per unit is very low, you can not use regularization can also work properly.
4)sigmoid and hyperbolic tangent functions
Sigmoid:g (z) =σ (z) =1/(1+exp (-Z))
Hyperbolic tangent: g (z) =tanh (z) =2σ (2z)-1
The extensive saturation of the sigmoid unit (which is only sensitive to input when Z is close to 0 o'clock) makes it very difficult to learn based on gradients. For this reason, it is not encouraged to use them as hidden units in the Feedforward network. As output units, they can be compatible with gradient based learning, and if an appropriate cost function is used to counteract the saturation of sigmoid. In this, the hyperbolic tangent function behaves better.
5 Other hidden units :
H=cos (wx+b) on a 1.MNIST dataset ;
2.softmax Unit (sigmoid extension in multiple classifications, although it is generally used as an output unit)
3. Radial base function (radial basis FUNCTION,RBF): This function is more active when X is near template W, and it is difficult to optimize for most x saturation to 0;
4.softplus function : g (a) =log (1+exp (a)), smooth version of the rectifier linear unit. However, there is no relu effect.
5. Stiff hyperbolic tangent function : g (a) =max ( -1,min (1,a)) Structural design
1 structure refers to the overall structure of the network, and how they are connected, in these chain structures, the main structural consideration is to select the depth of the network and the width of each layer.
2) General approximation Properties : A feedforward network with hidden layers provides a general approximation framework to approximate nonlinear functions.
The General approximation theorem (Universal approximation theorem) shows that a feedforward neural network with a linear output layer and at least one layer has any kind of "squeezing" The hidden layer of the activation function of nature (such as the logistic sigmoid activation function) , as long as the network is given a sufficient number of hidden units , it can be approximated by arbitrary precision Any Borel measurable function from a finite dimensional space to another finite-dimensional space.
That is, as long as we have enough MLP to be able to represent this function, but this does not guarantee that our training algorithm will be able to learn this function, because our optimization algorithm can not find the expected function of the parameter value or because of the fitting and chose the wrong function .
There are some functional families that can be effectively approximated when the depth of a network is greater than a value D , and a model that is much larger than the previous one is needed when the depth is limited to less than or equal to D.
A function represented by a depth-rectified network may require a shallow network (a hidden layer) to denote a number-level hidden unit . Rather, they illustrate that piecewise linear networks (which can be obtained by rectifying nonlinear or maxout units) can represent the number of regions as exponential functions of the depth of the network.
The number of linear areas that can be described by a depth-rectified network with d input, a depth of L, and n units per hidden layer is:
In Maxout networks with K filters per unit, the number of linear regions is: O (k (l-1) +d)
3) Reverse propagation (back propagation), neural network with the core algorithm to solve the parameters, similar to SGD, in the past to solve the gradient in the direction of the parameter to reduce the cost function of the approach.
Input x provides the initial information and then propagates to the hidden unit of each layer, resulting in output Y. This is called forward propagation (forward propagation). In the training process, forward propagation can continue until it produces a scalar cost function, J (θ). The reverse propagation (back propagation) allows the information of the cost function to flow backwards to compute the gradient (just the method used to compute the gradient).
Algorithm principle and Formula derivation can refer to https://my.oschina.net/findbill/blog/529001
(4) cross-entropy loss function (cross entropycost function): Logistic regression and loss functions used in neural networks, in general form:c=-1/nσ[y lna+ (1-y) ln (1-a)]. It can be proved that the two-time cost function has one item in the parameter update about the derivative of the activation function, in some cases, if the neuron is saturated, the gradient will be close to 0, and even if the error is large, the value of the function will be too small to cause the parameter to be updated slowly, and the cross entropy can be Any false classification of the case makes the loss function very large (intuitive understanding, when the parameter is updated with cross entropy, the size of the update step is independent of the gradient size of the neuron and only is related to the difference between the predicted value and the truth value, so that the parameters can converge faster in the process of reverse propagation.