http://www.deeplearningbook.org/
The 6th Chapter Deep Feedforward Networks
Deep Feedforward Networks is also known as feedforward neural Networks or multi-layer perceptrons (MLPs), which is a very important depth learning model. The goal of Feedforward networks is to fit a function f*, such as a classifier,
y=f* (x) maps the input x to the category Y,feedforward networks defines a mapping function y=f (x;θ) and then lets Theta learn to get the best fit for the function.
This model is called Feedforward because the function information to be fitted is obtained from x by calculating the output Y, there is no feedback from the output, if feedforward neural networks includes feedback connection, Then it will be called recurrent neural networks (the tenth chapter will speak)
The Feedforward neural networks model is represented by a directed acyclic graph (DAG) that represents what functions it consists of, such as we now have three function f1,f2,f3, connected to a chain, to form F (x) =f3 (F2 (F1 (x))), This chain-like structure is the most commonly used neural network, in this case, F1 is called the first layer of the network, F2 is called the second layer of the network, and so on. The total number of layers in this chain is the depth of the model, which leads to the birth of the term deep learning, the last layer of the Feedforward networks is called the output layer, and in the training of neural networks, we bring f (x) closer to f* (x), The training data for each sample X has a label to make y≈f* (x).
The dimension of the middle hidden layers determines the width of the model, and instead of imagining the layer as a vector---vector function, we can imagine that the layer is a function of many cells to represent the vector---tensor. It is advisable to think that feedforward networks is a function fitting machine.
In order to extend the linear model cannot represent the nonlinear function of x, we can apply the linear model to the conversion input φ (x) of x, where φ (x) is a nonlinear transformation, then the problem is what φ is:
1, one option is that we choose a universal φ, such as an infinite-dimensional φ based on the RBF model, if φ has enough dimensions, we can have enough capacity to fit the training set, but the performance of the test set is poor.
2, another option is that we write φ manually, so that people need to have different ways of writing each of the different tasks.
3, the deep learning strategy is to let Φ learn, so that we have a model y=f (X,Θ,W) =φ (x,θ) ^t W, we now have a parameter θ we can learn through a lot of functions, and the parameter W can let Φ (x) map to the desired result, which is an example of deep learning, Φ is the hidden layer. Thus, in φ (x,θ), we find the optimal theta, we can use many types of φ (x,θ), so that it is more common application.
Feedforward networks introduced the hidden layer, which allows us to choose the activation function, we also need to design the network layer, how each layer is connected, how many units per layer, we introduce the back-propagation algorithm, to calculate the gradient of this function.
6.1 Example: Learning XOR
The XOR function, that is, exclusive or is a two-dollar operation, X1 and x2, only when the two number one is 1 and the other is 0, the output is 1, and the other output is 0.
This XOR function also provides the f* (x) We want to fit, our model has a function y=f (x;θ), and our learning algorithm is to change Θ so that f approaches f*
In this case, we will not consider the statistics, we just want the network to output the correct results at these four points, x={[0,0]^t, [0,1]^t, [1,0]^t, [1,1]^t} We are trained with these four points.
We can solve this problem with a regression problem, using a mean square error (MSE) loss function. We'll see other models that are more appropriate to fit the two-dollar operation later.
Now we have to choose our model and imagine that we have chosen a linear model, that is, Theta is composed of W and B,
We get the w=0 and B=1/2 by solving the normal equations, so this linear model will all output 1/2, why is this? Because a linear model cannot fit an XOR function,
We then introduced a very simple feedforward networks, with a layer of hidden layers, a layer hidden layers with two units,
Then apply recti?ed linear Unit (ReLU) as the activation function
The final result is what we want.
6.2 Gradient-based Learning
There is little difference between designing and training a neural network and training any other machine learning model.
The biggest difference between the linear model and the neural network is that the nonlinearity of the neural network causes the most interesting loss function to become non-convex, which means that the neural network is usually trained using an iterative and gradient-based optimizer to approximate the cost function to a very low value (the "budget" and "loss"), Instead of training linear regression models or training logistic regression and SVM-based convex optimization algorithms, we use the method of solving equations. The convex optimization algorithm can converge from any initial value (very robust in practice, but with numerical problems). The random gradient descent is applied to the non-convex loss function, which does not necessarily converge and is sensitive to the initial value. For feedforward Neural networks, it is important to initialize the ownership value weight to a small random value, the offset value Biase is initialized to 0 or a small positive number, and the iterative gradient-based optimization algorithm is used for training feedforward Networks and almost all other depth models (8.4 chapters will speak).
For the machine learning model, in order to apply gradient descent we have to choose a cost function, we have to choose how to represent the output of this model, we now reconsider.
6.2.1Cost function
An important aspect of deep neural network design is the choice of cost function.
In most cases, we define a distribution P (y|x;θ) and we simplify the maximum likelihood of the body, which means we use the cross-entropy (mutual entropy) between the training data and the model predictions as the cost function.
The cost function of the entire training neural network is usually combined with one of the main regularization cost functions.
6.2.1.1 Learning Conditional distributions with Maximum Likeihood
Most modern neural networks are trained with maximum likelihood, which means that the cost function is negative log-likelihood, that is, the cross-entropy between the training data and the model distribution,
This cost function gives us a
Because P is different, the cost function is different.
A recurrent theme of neural network design is that the cost function gradient must be large enough, and sufficiently predictable, to serve as a guide to a good learning algorithm, with negative log-likelihood helping us, that is, a lot of output units include an exp function, This exp function can saturate when the parameter is very negative, while the Log-likelihood log function offsets the exp of the output unit.
In practice, an infrequently cross-entropy cost is the property of a large maximum likelihood estimate, and usually does not have a minimum value. For discrete output variables, most models cannot represent a probability of 0 or 1, but can be forcibly approached. Logistic regression is an example of an output variable of a real value that, if the model can control the density of the output distribution, can be assigned to the correct training data output at a high density, resulting in cross-entropy approaching an infinite negative value, and regularization techniques can avoid unrestricted feedback.
6.2.1.2 Learning Conditional statistics
Instead of training a full probability distribution P (y|x;θ), we want to train only the y of a condition statistic when the input is x.
For example, we have a predictor F (x;θ) that wants to predict the mean of Y.
We use a neural network sufficiently strong enough to think that the neural network can represent any F, so that we can see the cost function as a functional rather than a function, A functional can be understood as mapping a function to a real number, so that we can think of training as choosing one rather than some parameter. Solve an optimization problem for function we use a mathematical tool called Calculus of variations, and we can get two results with this tool:
First Result:
So we can train an infinite number of sample data generated from the distribution to minimize the MSE cost function, generating a function at each x to predict the minimum value of y
A second result:
Being able to predict the median of a given x under Y, this cost function we call mean absolute error.
Unfortunately, MSE and mean absolute error perform poorly with gradient optimization, which is a cross-entropy cost function that is more prevalent than MSE or mean absolute error, even if it is not necessary to predict a full distribution P (y|x).
6.2.2 Output Units
The selection of the cost function is closely related to the output cell, and most of the time, we only use the cross-entropy between the data distribution and the model distribution to choose how to represent the output to determine the cross-entropy function.
Any type of neural network unit as output can also be used as a hidden unit, throughout the sixth chapter we envision Feedforward Network providing a set of hidden features H=f (x;θ)
6.2.2.1linear units for Gaussian output distributions
Give characteristic h, a layer of linear output cell output a vector y^=w^t H + b
The linear output layer is also commonly used for the average value of the Gaussian distribution of the output condition:
Maximizing Log-likelihood is equivalent to minimizing the MSE.
The maximum likelihood also makes it possible to directly train the Gaussian covariance, or let the Gaussian covariance become the input function, however, the covariance must be limited to a matrix that is positive for all inputs, and in a linear output layer it is difficult to meet such limitations, so in particular, the output unit is used to satisfy the covariance input.
6.2.2.2 sigmoid units for Bernoulli output distributions
Many tasks require a value of $ two for the prediction of Y, and the classification of the two categories belongs to this.
The maximum likelihood is to define the distribution of the effort of a given x under Y
The job distribution is defined as just a single number, the neural network only needs to predict P (y=1|x)
Imagine that we use a linear unit,
This does define a qualified conditional distribution, but we cannot train it well by gradient descent, and when the wh+b out of range, the output gradient will be 0, if the gradient is 0 then it will be wrong, because the training algorithm will not be able to change the parameters of the guidance.
In a different way, we can use the Sigmoid output unit to combine the maximum likelihood
A sigmoid output unit is defined as
Here σ is the sigmoid function mentioned in Chapter 3.10
We can think of this sigmoid output unit as having two things, one linear layer and one sigmoid activation function.
6.2.2.3softmax units for Multinoulli output distributions
Multi-noulli and Ber-noulli are consistent, and any time we want to represent a discrete probability distribution of a possible value of N, we might use the Softmax function,
The Softmax function is most commonly used as an output classifier, and rarely softmax is used inside the model (not the output layer).
In binary variables, we want to get a single number because this number needs to be 0 or 1, and because we want the logarithm of this number to be very good for log--likelihood gradient optimization.
To be able to produce a discrete variable of n values, we now need to produce a vector ^y,
^yi=p (y=i|x), we not only require that each element of y is between 0 and 1, but also that all elements of Y are 1,
We want to maximize the log P (y=i;z) =log Softmax (z) i
The first entry on the right of equation 6.30 indicates that the input Zi always has a direct contribution to the cost function, because this item does not saturate, even if Zi's contribution to the second is small, and when the maximum log-likelihood is maximized, the first item causes the Zi to rise, and the second one causes the z-vector to fall all the time, in order to Get some visual sense for the second item (that summation), Log∑j exp (ZJ) can be approximated to Max J ZJ, so the approximation is based on other exp (ZK) is very small for Max J ZJ, so we can intuitively feel-log-likelihood cost funct Ion always punishes the most inaccurate predictions, and if the correct answer already has the largest input to Softmax, then the next two items will be offset.
Deep Learning (Yoshua Bengio, Ian Goodfellow, Aaron Courville) translation Part 2 the 6th Chapter