The previous article has introduced 2 Classic machine learning algorithms: linear regression and logistic regression, and in the following exercises you can also feel that these 2 methods can achieve good results in solving some problems. Now take a look at another machine learning algorithm-neural network. The linear regression or logistic regression problem is not theoretically able to solve all the regression and classification problems, then why are there other kinds of machine learning algorithms? For example, the neural network algorithm that is going to be spoken here immediately. In fact, the reason is very simple, in the previous series of blog exercises can be found that those sample points of the input feature dimensions are very small (for example, 2 to 3-dimensional), in the use of logistic regression solution, you need to re-map the original sample features to the high-dimensional space, if the feature is 3-dimensional, and the exponent is up The highest dimension of the resulting coefficients should be 20 dimensions. But the general real-life data characteristics are very large, such as a small poor grayscale image 50*50, itself only 2,500 characteristics, if the use of logistic regression to do target detection, it is possible to achieve the characteristics of millions. In this way, not only the computational complexity, but also because the feature dimension is too large is easy to learn the function produced over-fitting phenomenon. In general, only linear regression and logistic regression are not enough in real life, so the neural network has been studied slowly due to its unique advantages.
The expression structure of the neural network model is relatively clear, and the input value and the corresponding weight are multiplied and then added to the final offset value is the output. But the mathematical formula is more cumbersome, easy to mistake. Assuming that the J layer Network has SJ nodes, and the J+1 layer network has s (j+1) nodes, then the parameter of the J layer should be a matrix, the matrix size is s (j+1) * (sj+1), of course, at this time because the weight of 1 of the network node is not counted in. It is clear that in order to facilitate the expression of the formula, the mathematical formulae of vectorization are often used in neural networks. Why do neural networks have the most learning function? First, biologically speaking, it simulates the function of the human brain, and the human brain has a powerful learning mechanism. Secondly, it can be seen from the neural network model that if we only look at the last layer that the output layer is connected to the output layer, it is actually a simple linear regression equation (if the output is between 0~1, the logistic regression equation), That is to say that so much of the network is just learning some of the new features, and these new features are well suited as a problem solver features. So, to be blunt, neural networks are meant to learn some of the features that are better suited to problem solving.
On the surface, the previous layer and the current layer of the neural network are directly connected, and the linear combination of the output values of the previous layer constitutes the output of the current layer, so that even if there are many layers of neural networks, can we only learn the linear combination of the input features? So why is it that neural networks can learn arbitrary nonlinear functions? In fact, I made an essential mistake just now, because the linear combination of the output of the previous layer is not directly the output of this layer, but generally also through a function compound, such as the most common function of the logistic function (other functions such as hyperbolic tangent function is also very common), Otherwise, you can only learn linear features. Neural network functions are relatively powerful, such as single-layer neural network can learn to "and", "or", "not" as well as the non-or gate, and so on, two-layer neural network can learn to "XOR" gate (through the door and the gate and the non-or door composition of a or door composition), The 3-layer neural network is capable of learning arbitrary functions (excluding input and output layers), which have many interesting stories in the development of neural networks. Of course, the neural network is also very easy to extend to the multi-classification problem, if it is the N classification problem, it is only in the design of the network output layer set N nodes can be. In this way, if the system can be divided, there is always a learning network can make the input characteristics of the end of the N output node only one is 1, which achieves the purpose of multi-classification.
Neural network loss function is very easy to determine, here is a multi-classification of neural networks for example. Of course, here comes the loss function in the framework of supervised learning theory, because only then can we know how much it has lost (recently developed into unsupervised learning frameworks can also calculate loss functions, such as autoencoder, etc.). Assuming that the various parameters in the network have been learned, then for each input sample, you can get an output value, this output value and input sample callout output value can be obtained a loss item. Since the output value in a multi-classification is a multidimensional vector, it takes every dimension to calculate its loss (since it is a multi-classification problem, the values that the training sample should be labeled should also be multidimensional, or at least be converted to multidimensional). In this case, the loss function expression of the neural network is very similar to the loss function expression in the previous logistic regression, which is easy to understand.
With the expression of the loss function, we can use the gradient descent method or Newton method to find the parameters of the network, regardless of which method, we need to calculate the loss function on a partial derivative of a parameter, so our work focus on the loss function on the partial derivative of each parameter, the most famous algorithm of the partial derivative is the BP algorithm, Also called the inverse propagation algorithm. When using the BP algorithm to find the partial derivative, it can be proved that the loss function on the L-layer of a parameter of the bias and the first L layer of the error of the node, and the parameter corresponding to the previous layer of network number in the output of this layer (that is, l layer) of the output value, Then the work at this time is converted to each layer of network each node of the error of the method (of course, the input layer is not calculated error). And it can be proved by the theory that the error of each node can be computed by the next layer of network, so the node reverse propagation (which is also the source of the reverse propagation algorithm name). Summing up, when there are more than one training sample, each input a sample, then the output value of each node, and then by the sample value of the input sample to find the error of each node, so that the loss function for each node error can be obtained by the output value of the node has been error to accumulate, When all the samples are treated the same, the final cumulative value is the partial derivative of the loss function corresponding to the position parameter. The theoretical source of BP algorithm is that the error of a node is transmitted by the simple error of the front, and the transfer coefficient is the coefficient of the network.
In general, the use of gradient descent method to solve the problem of neural network is very easy to error, because the loss function to solve the partial derivative process of the parameters of a number of matrices, in the program is easy to mistake, if the loss function or loss function of the partial derivative are wrong, then the subsequent iterative process is more wrong, resulting in no convergence, So it is necessary to check whether the partial derivative is correct. Andrew Ng in the course tells you to use gradient checking method to detect, that is, when the partial derivative of the loss function is obtained, takes a parameter value, calculates the partial value of the value of the parameter, and then takes 2 parameter points near the value of the parameter, Using the loss function in the difference between the two points is divided by the distance of the 2 points (in fact, if the 2 points close enough, the result is the definition of the derivative), compared to the two calculated results are equal, if close to equal, then to a large extent, this partial derivative is not calculated error, The rest of the work can be assured, it is important to remember not to run gradient checking, because in the run gradient checking with the BP for each layer of error calculation, which is time-consuming (but I feel that even if not calculated gradient Checking, do you want to use BP algorithm for inverse calculation? )。
In the network training, do not set the initial value of the parameters to the same, because the learning of each layer of the parameters are ultimately the same, that is to learn the implied characteristics are the same, then redundant, and the effect is not good. It is therefore advisable to initialize these parameters randomly, and generally to meet the mean value of 0, and around 0 random.
If the same algorithm is used to solve the parameters of the network (for example, the BP algorithm), then the network performance depends on the structure of the network (that is, the number of hidden layers and the number of each hidden layer of neurons), the general default structure is: Only one hidden layer, If multiple hidden layers need to be taken, the number of neurons in each hidden layer is set to the same, of course, the more the number of hidden-layer neurons, the better the effect.
Tornadomeet Source: Http://www.cnblogs.com/tornadomeet Welcome to reprint or share, but be sure to declare the source of the article.
Reprint Deep Learning: Seven (basic knowledge _2)