The last time I wrote this note was a 13 thing ... At that time, busy internship, looking for work, graduation and so on did not write down, and now work for half a year is also stable, I will continue to write this note. In fact, a lot of chapters have been read, but have not written out, first from the 5th chapter, 第2-4 Chapter comparison basis, and then fill!
5th Chapter Neural Networks
In chapters 3rd and 4th, we have learned about linear regression and classification models consisting of a linear combination of fixed base functions (basis functions). Such models have useful parsing and computational properties, but the problem with dimension catastrophes (The Curse of dimensionality) (i.e. high-dimensional data) limits their practical applicability. In order to apply these models to the problem of big data, we must adjust these base functions according to the data.
SVM is discussed in the seventh chapter, which is a very famous and effective classification method. SVM has its own method theory, and one of its important advantages is that although it involves nonlinear optimization, the objective function of SVM is still convex. Not specifically expanded in this chapter, the seventh chapter is detailed.
Another option is to fix the number of base functions in advance, but allow them to adjust their parameters during the training process, which means that the base function can be adjusted. In the field of pattern recognition, the most typical algorithm for this method is the forward neural network (Feed-forward Neural Networks, hereafter referred to as NN) that will be discussed in this section, or a multilayer perceptron (multilayer perceptron). (Note: Here the multilayer model is continuous, such as the sigmoid function, and the Perceptron method is originally discontinuous; Perceptron method is not introduced in PRML book, the latter is written separately according to other data). In many cases, the NN-trained model is more compact than the SVM model with the same generalization capability (Note: I understand that the parameters are less), so it is easy to evaluate, but the cost is that the base function of NN is no longer the convex function of the training parameter. In practice, it is acceptable to spend a lot of computational resources in training to get a compact model that can quickly process new data.
Next we will see that in order to get the parameters of the neural network, we essentially do a maximum likelihood estimation, which involves nonlinear optimization problems. This requires a log likelihood function for the parameter derivation number, we will talk about the error back propagation algorithm (Error BACKPROPAGATION,BP), as well as some extension methods of BP algorithm.
5.1 Feed-forward Network Functions
The general theory of linear models in chapters 3rd and 4th is based on a linear combination of fixed base functions, in the form of:
The F () is a nonlinear excitation function in the classification problem, and the identity of the unit matrix in the regression model. Our goal is to make the base function in the above model dependent on parameters, and these parameters and the above WJ are adjustable at the time of training. There are many forms of the basis function, the basis function of the neural network is in the same form as (5.1), so the ability of each base function is a nonlinear function of the input linear combination, and the parameters in the linear combination are the parameters that can be adjusted. This is the basic idea of neural networks, consisting of a series of function transformations: first we construct m linear functions for input variables
where J=1,..., M, superscript (1) indicates that the parameter is the first-level parameter of the neural network (input is not a layer). We call the parameter the weight weights, and the parameter is the intercept biases. called Excitation (activation), it is converted by a nonlinear excitation function h () that is conductive to:
These m function values are the output of the base function in (5.1), which is called the Hidden Layer Unit (hidden units) in the neural network model. The general choice of the nonlinear excitation function h () is the sigmoid function or the Tanh function. According to (5.1), these values are again linearly combined into the excitation values of the output unit,
Where K=1,..., K,k is the number of output units. This conversion is the second layer of the neural network and is the bias parameter. Finally, the excitation values of these output units are then converted from the appropriate excitation function to the appropriate final output. Similar to the above mentioned, if it is to do regression problem, the excitation function we choose identity, that is, if we are doing multiple 2 classification problems, we use the logistic sigmoid function:
If there are multiple categories of classification problems, we use the Softmax function, see PRML Book formula (4.62).
So, we put together all the stages, we can get the overall neural network function (using sigmoid output unit, two-layer network, as shown in Figure 5.1 below):
Therefore, the neural network model is a nonlinear function, from the input variable set to the output variable set, and is controlled by the adjustable parameter vector w. The structure of the network can be seen in Figure 5.1, the entire network is forward propagation.
We can specifically increase the x0=1 and z0=1 two variable inputs, so that the bias (offset, intercept) items can be combined into the accumulation, simplifying the expression, so you can get:
And:
The following deduction will take the form of (5.9). If you see the fourth chapter on the Perception Machine (perception) Introduction, you will find that the above form is equivalent to using a two-layer perceptron model, but also because of this, the neural network model is also known as Multilayer perceptron (the multilayer perceptron, or MLP) model. The difference is that the Perceptron model takes the step function of output 0/1 (step-function), and NN uses a continuous non-linear function such as sigmoid in the middle of the hidden layer element, indicating that the NN for the parameter is conductive, which is very important in the training of NN model.
If the excitation function of the hidden layer element is linear, then the final model is a linear model regardless of successive layers. And if the hidden layer element is less than the input unit or output unit, then there will be a loss of information, similar to the hidden layer did a data dimensionality. At present, few people pay attention to the neural network model of multilayer linear element. Figure 5.1 above is one of the most typical NN model structures, it can be easily extended-continue to the output layer as the hidden layer, and add a new level, using the same function as before the transfer method. The industry in the address of the NN model, some of the level of unity, some people refer to Figure 5.1 is a 3-layer network, and in this book it is recommended that the model is 2, because the parameters can be adjusted layer only 2 layers.
Another way to generalize the model is as shown in Figure 5.2, where the node of input can be directly connected to output and does not necessarily require a layer of pass. (Note: This kind of NN structure is more generalized, when the optimization of BP also can cope, but how to produce these layers of connection?) This does not unfold in the book, do not know if such a model in the deep network structure has no application? Some students see must leave a message to tell ha ~)
Another very important property, the NN model can be sparse, in fact, the brain is the same, not all neurons are active, only a very small fraction will be active, the different layers of neurons can not be fully connected. In the back of 5.5.6, we will see an example of the sparse network structure used by convolutional neural networks.
We can naturally design a more complex network structure, but in general we have limited network structure as a forward network, that is, there is no closed forward loop, as shown in Figure 5.2, each hidden layer unit or output unit can be calculated by the following:
Thus, when there is input, all the cells in the network are gradually affected and then activated (and may not be activated). The neural network model has a strong approximate fitting function, so it is also called Universal approximators.
In fact, the two-layer NN model can fit any function, as long as the hidden layer element is enough and the parameters are well trained. Figure 5.3 below illustrates the ability of the NN model to fit. EXPLANATION Please look at the description on the left side of the picture.
Symmetry of 5.1.1 Weight space
This is an interesting property of the Feedforward network, for example, we look at the typical two-layer network of Figure 5.1, and examine a hidden-layer element, if we take the symbol of its input parameter all inverse, take the tanh function as an example, we will get the opposite excitation function value, namely Tanh (−a) =−tanh (a). And then the unit all the output connection weights are reversed, we can get the same output, that is to say, there are two different sets of weights can be obtained the same output value. If there is a m hidden layer element, there is actually a 2M equivalent parameter value scheme.
In addition, if we swap the input and output weights of the two units of the hidden layer with each other, the final output of the entire network is the same, that is, any combination of weights is one of all m!. It can be seen that the above-mentioned neural network actually has the m!2m symmetry of the weights. Such properties are available in many motivational functions, but in general we seldom care about this.
Today begins to learn pattern recognition with machine learning pattern recognition and learning (PRML), chapter 5.1,neural Networks Neural network-forward network.