Machine learning Five: neural network, reverse propagation algorithm

Source: Internet
Author: User

I. Limitations of logistic regression

In the logistic regression section, using the multi-classification of logistic regression, we realized the numbers on the image that recognize 20*20.

But the use of a first-order model, and did not use the polynomial, why?

Can imagine, in the original 400 characteristics of the data samples, the increase of two, three, four times polynomial, what will be the case?

Obviously, the number of characteristics of the training sample will be higher than the number of orders, and, more importantly, in a formula to fit so many features, the difficulty is very large, may not converge to a more ideal state.

In other words, logistic regression cannot provide a very complex model.

Because it is essentially a linear classifier, it is good to solve the problem of linear sub-problems.

So how to solve the problem of non-linear can be divided?

Solution Ideas

If there is a method, the nonlinear can be divided into the problem of feature extraction, to become close to linear can be divided, then apply a logistic regression, whether it can solve the nonlinear problem?

This is the idea of neural networks.

Ii. Neural network 1, structure

The structure of the neural network, as shown in



Above is a simplest model, divided into three layers: input layer, hidden layer, output layer.

The hidden layer can be a multilayer structure, and by extending the structure of the hidden layer, you can build a more cluttered model, such as the following model:



Each layer of output, is the next layer of output, layers connected to form a network.

A node in a network, called a neuron. Each neuron, in fact, is an operation similar to a logistic regression (the reason is "similar", because it can use logistic regression, there are other algorithms instead, but you can use the logical regression to understand its operation mechanism).

According to the analysis in the preface above, it is obvious that the hidden layer is the extraction of feature, and the output layer is actually a logistic regression.

Why is it that hidden layers are feature extraction?

For ease of understanding, this assumes that all neurons perform logistic regression.

A logistic regression, in which the plane can be split in one. In a neural network, where n multiple logistic regression is performed, the plane can be cut into n multiple regions, which are finally synthesized by the output layer as a result.

If you focus only on the output layer, then the areas that are cut out in front of you can actually be regarded as a feature, a more advanced feature, extracted from the original sample. This is the extraction of features.

2. Calculation principle 2.1 Forward propagation, calculation output

The following solves how the final result is obtained when a sample is entered from the input layer.

Assuming that each neuron performs a logistic regression calculation, the output of the i\ layer Network is:\[a^{(i)} = g (z^{(i)}) = G (\theta^ta^{(I-1)}) \tag{1}\]

Take the following three-layer network as an example:



The input and output of each layer are as follows:

Input Layer:

\[a^{(1)} = x\]
Hidden Layer:

\[\begin{split}z^{(2)} &= \theta^{(1)}a^{(1)} \a^{(2)} &= g (z^{(2)}) \end{split}\]
Output Layer:

\[\begin{split}z^{(3)} &= \theta^{(2)}a^{(2)} \a^{(3)} &= g (z^{(2)}) \end{split}\]

The end result of the entire network is:

\[h_\theta (x) = a^{(3)}\]

The above process: the output of the above layer, as the next layer of input, a layer after layer of superposition operation, the final output, this calculation method, called "Forward propagation"

2.2 Reverse propagation, seeking theta matrix

The purpose of the training algorithm is "to get the parameter matrix which minimizes the error function", and to deal with the minimization error by the gradient descent method, we need to calculate the error function J and the partial derivative of J to Theta.

2.2.1 Error Function J

\[j (\theta) =-\frac{1}{m} \sum_{i=1}^{m}\sum_{k=1}^{k}[y_k^{(i)}log (H_\theta (x^{(i)}) _k + (1-y_k^{(i)}) log (1-h_\ Theta (x^{(i)})) _k] + \frac{\lambda}{2m}\sum_{l=1}^{l-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1} (\theta_{ji}^{(L)}) ^2 \ Tag{2}\]

where \ (k\) is the number of units of the output layer, that is, the number of categories. In calculating the error, each category needs to be counted. The following regular items are the sum of the values of all parameters \ (\theta\) in the entire neural network.

2.2.2 J-Derivative of Theta

Here we give the results first, then the derivation:

\[\frac{\partial}{\partial\theta_{ij}^{(L)}}j (\theta) = \frac{1}{m}\sum_{t=1}^{m}\delta_i^{(t) (l+1)}a_j^{(t) (l)} + \frac{\lambda}{m}\sum_{l=1}^{l-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1} (\theta_{ji}^{(L)}) \tag{3}\]

which

\[\begin{cases}\delta^{(L)} &=& a^{(l)}-y \\\delta^{(L)} &=& \delta^{(l+1)}* (\theta^{(l+1)}) ^t*g ' (Z ^{(L)}) \\\delta^{(0)} &=& 0\\end{cases}\tag{4}\]

What the above formula describes

The error of the l\ layer can be calculated by the error of the l+1\ layer, and the error of the last layer is the difference between the value and the sample \ (y\) value calculated by the forward propagation of the system.
In other words, starting from the output layer, each layer of error can be obtained by a layer of reverse iteration, after determining the error, the partial derivative will also be calculated, and then the model can be adjusted. This is, "reverse propagation algorithm"

And the reverse spread of the content, in fact, is the error.

    • A visual understanding of the error:

The error of the output layer is the total error of the system;

The error of the middle layer, that is, the contribution value of the total error (therefore, the\ (\theta\) matrix, in the forward propagation, is the characteristic weight, but in the reverse propagation, is the error weight);

The input layer, the output is the original data, that is, no error.

The derivation process of 2.2.3 Inverse propagation algorithm (1) The first part deduces the partial derivative

The above gives the conclusion of the reverse propagation, which is deduced below.

The matrix form calculates the partial derivative of the first \ (l\) layer:

\[\begin{split}\frac{\partial J (\theta)}{\partial\theta^{(L)}} &= \frac{\partial J (\theta)}{\partial z^{(l+1)}} * \frac{\partial z^{(l+1)}}{\partial \theta^{(L)}} \\&= \frac{\partial J (\theta)}{\partial z^{(l+1)}} * \frac{\ Partial (\theta^{(L)}*a^{(L)})}{\partial \theta^{(L)}} \\&= \frac{\partial J (\theta)}{\partial z^{(l+1)}} * a^{(L) }\end{split}\tag{5}\]

Make \[\delta^{(l)} = \frac{\partial J (\theta)}{\partial z^{(L)}} \tag{6}\]

Then there are

\[\begin{split}\frac{\partial J (\theta)}{\partial\theta^{(L)}} &=& \frac{\partial J (\theta)}{\partial z^{(l +1)}} * a^{(L)} \\&=& \delta^{(l+1)} * a^{(l)}\\end{split}\tag{7}\]

(2) The second part, derivation error delta

In the derivation process above, there is this equation:

\[\delta^{(l)} = \frac{\partial J (\theta)}{\partial z^{(l)}}\]

What do you mean by that? The following are derived from the output layer, and the middle layer, respectively, to interpret this equation.

    • Output layer

Because the error function is as follows (omit the regular item here)

\[j (\theta) =-\frac{1}{m} \sum_{i=1}^{m}\sum_{k=1}^{k}[y_k^{(i)}log (H_\theta (x^{(i)}) _k + (1-y_k^{(i)}) log (1-h_\ Theta (x^{(i)})) _k] \]

This expression is the total error, then, for each neuron of the output layer error, the matrix can be expressed as:

\[c =-[Ylog (H_\theta (x)) + (1-y) log (1-h_\theta (x))]\tag{8}\]

Therefore, the error of the output layer is:

\[\begin{split}\delta^{(L)} &=& \frac{\partial J (\theta)}{\partial z^{(l)}} = \frac{\partial c}{\partial z^{(l )}} \\&=& \frac{\partial}{\partial z^{(L)}} [Ylog (H_\theta (x)) + (1-y) log (1-h_\theta (x))]\\&=&-\frac {y} {g (z^{(L)})}g ' (z^{(L)})-\frac{1-y}{1-g (z^{(L)})} (-G ' (Z^{(L)}) \\&=& \frac{g (z^{(L)})-y}{g (z^{(L)}) (1-g ( z^{(L)})} (g ' (z^{(L)}) \\&=& g (z^{(L)})-y\\&=& a^{(l)}-y\end{split}\tag{9}\]

This result, a bit of meaning, indicating the \ (\delta\) value of the transmission layer, is the difference between the system output value and the sample \ (y\) value. Therefore, we call \ (\delta\) the error of each neuron in each layer of the nervous system.

    • Derivation of middle-layer error

For the first \ (l\) layer

\[\begin{split}\delta^{(L)} &=& \frac{\partial J (\theta)}{\partial z^{(L)}} \\&=& \frac{\partial J (\ Theta)}{\partial z^{(l+1)}} * \frac{\partial z^{(l+1)}}{\partial z^{(L)}} \\&=& \delta^{(l+1)} * \frac{\partial [(\theta^{(L+1)}) ^t*g (z^{(L)})} {\partial z^{(l)}} \\&=& \delta^{(l+1)} * (\theta^{(l+1)}) ^t*g ' (z^{(L)}) \\end{split}\tag{10}\]

That is, the error of the l\ layer, which can be calculated with the error of the l+1\ layer, is exactly the same as the previously determined conclusion.

This is the content of all derivations of reverse propagation.

Third, the implementation of the program

The example comes from the Wunda machine learning programming problem. The sample is the same as the digital recognition of multiple classifications in logistic regression.

1, calculate the loss function, and gradient
function [J Grad] = nncostfunction (Nn_params, ... input_layer_size, ...                                   Hidden_layer_size, ... num_labels, ... X, Y, lambda) Theta1 = reshape (Nn_params (1:hidden_layer_size * (input_layer_size + 1)), ... Hidden_layer _size, (input_layer_size + 1)); Theta2 = Reshape (Nn_params ((1 + (Hidden_layer_size * (input_layer_size + 1)): End), ... num_labels, (Hidde         N_layer_size + 1));% Setup some useful variablesm = Size (X, 1); % need to return the following variables correctly J = 0; Theta1_grad = zeros (Size (Theta1)); Theta2_grad = zeros (Size (Theta2)),%------forward propagation calculation output------% input LAYERA1 = [Ones (M, 1) X]; %add +1 to x;% hidden Layera2 = sigmoid (A1 * Theta1 '); a2 = [Ones (M, 1) a2];% Output LAYERA3 = sigmoid (A2 * Theta2 ');%---- --The Y value of the sample------% [1 0 0 0 0 0 0 0 0 0]--the value is 1% [0 1 0 0 0 0 0 0 0 0]--the value is 2Y = Zeros (m,num_labels); for i = 1:m Y (i,y (i)) = 1;end%------Loss Function J------j = (sum (sum (Y. * LOG (A3)))-Sum (SUM ((1-y). * Log (1-A3)))/m;% Remove theta0t1 = Theta1 (:, 2:end); t2 = Theta2 (:, 2:end); regularize = LAMBDA/2/M * (SUM (SUM (t1.^2) ) + SUM (sum (t2.^2)); j = j + regularize;%------Reverse propagation calculates each layer error------DELTA3 = A3-Y;DELTA2 = delta3 * Theta2. * A2. * (1-A2);d elta2 = Delta2 (:, 2:end);%------COMPUTE gradient------Theta1_grad = (DELTA2 ' * A1 + [Zeros (Size (t1,1), 1) t1] * lambda)/m; Theta2_grad = (DELTA3 ' * A2 + [Zeros (Size (t2,1), 1) T2] * lambda)/m;% unroll Gradientsgrad = [Theta1_grad (:); Theta2_grad (:)];end
2, forward propagation and calculation of the delta, need to use the sigmoid function and its derivative 2.1 sigmoid function
function g = sigmoid(z)g = 1.0 ./ (1.0 + exp(-z));end
Derivative of 2.2 sigmoid function
function g = sigmoidGradient(z)g = sigmoid(z) .* (1 - sigmoid(z));end
3, training process 3.1, random initialization of theta parameter matrix
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels);% Unroll parametersinitial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];

In logistic regression, the theta matrix can be initialized to the same value, such as full 0 or full 1. But it doesn't work in a neural network.

The reason is that: in a neural network, neurons are organized in an all-connected form, that is, any node of the n-1 layer, which is connected to all nodes in the nth layer.

If the theta matrix is initialized to the same value at initialization time, each neuron in the same layer does the same operation, and multiple neurons perform the same operation, which is useless for data fitting, but a waste of resources, resulting in redundancy. This is a symmetric phenomenon.

The implementation of the random initialization parameters is as follows:

function W = randInitializeWeights(L_in, L_out)W = zeros(L_out, 1 + L_in);epsilon_init = 0.12;W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;end
3.2. Initialization parameters
options = optimset('MaxIter', 100);% 正则项参数lambda = 1;% 损失函数costFunction = @(p) nnCostFunction(p, ...                                   input_layer_size, ...                                   hidden_layer_size, ...                                   num_labels, X, y, lambda);% 梯度下降计算参数[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);% 获取两层神经网络的参数Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...                 hidden_layer_size, (input_layer_size + 1));Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...                 num_labels, (hidden_layer_size + 1));
4. Forecast
pred = predict(Theta1, Theta2, X);fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);

It can be seen that the predicted results are close to 3 points higher than the logistic regression accuracy rate.

The reason is that the model that neural network can build is more complicated than logistic regression, and its ability to fit data is stronger.

Predict function, using training to get the parameter matrix, forward propagation calculation results are the output layer, the output layer represents an input sample after the god network calculation, it may belong to the probability value of each classification. Similar to logistic regression, taking the maximum value is the final result.

function p = predict(Theta1, Theta2, X)m = size(X, 1);num_labels = size(Theta2, 1);p = zeros(size(X, 1), 1);h1 = sigmoid([ones(m, 1) X] * Theta1');h2 = sigmoid([ones(m, 1) h1] * Theta2');[dummy, p] = max(h2, [], 2);end

Machine learning Five: neural network, reverse propagation algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.