[Machine Learning] study notes-neural Networks

Source: Internet
Author: User

Introduction

For a nonlinear classification problem with a large feature number, if the previous regression algorithm is used, the time complexity of the algorithm will be very large, and it may lead to overfitting problems, such as:

At this point, we can choose to use neural network algorithm.
Neural network algorithms were first thought to mimic the learning function of the brain.

A neuron, which has multiple dendrites (dendrite) as the input channel of information, also has multiple axons (Axon) as an output channel for information. The output of one neuron can be used as input to another neuron. The concept of neuron is similar to the classifier concept of multi-classification problem, it can receive multiple inputs and produce many different outputs under different weights (weights).

Model representation

The model can be written in the following form:
\[\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline \end{bmatrix}\rightarrow\begin{bmatrix}\ \ \ \newline \end{ Bmatrix}\rightarrow H_\theta (x) \]
Can be called a single-layer feedforward network , consisting of input layer \ (x\), output layer and the hidden layer between them.

Each output layer has a weight matrix (weights matrix) and a bias unit (bias Unit)to calculate the output.

Forward propagation


First review the calculation of the h_\theta\ in the single classification problem of the logistic regression:
\[\begin{align*}\begin{bmatrix}x_0 \newline x_1 \newline X_2\end{bmatrix} \rightarrow\begin{bmatrix}g (z^{(2)}) \end{ Bmatrix} \rightarrow H_\theta (x) \end{align*}\]

Can be written as:
\[z^{(2)}=\omega^{(2)}a^{(1)}+b^{(2)}\\a^{(2)}=g (z^{(2)}) \\h_\theta (x) =a^{(2)}\]
The forward propagation of the neural network, that is, the layer is added on this basis, so that the output of one layer as the input of the next layer:
\[z^{(i)}=\omega^{(i)}a^{(i-1)}+b^{(i)}\\a^{(i)}=g (z^{(i)}) \\z^{(i+1)}=\omega^{(i+1)}a^{(i)}+b^{(i+1)}\\...\]
It is important to note that each layer has multiple units, so the weights are also a two-dimensional matrix.

Reverse propagation (backpropagation)

Intuitive understanding

But given the initial bias unit and weight matrix, the predicted value will be less than ideal.
So, how do you make the predicted value match the real value?
\[z^{(i)}=\omega^{(i)}a^{(i-1)}+b^{(i)}\]
It can be found that the final output can be changed by changing the \ (a,\omega,b\) of each layer, but in fact \ (a\) cannot be changed directly.
So essentially it's about changing \ (\omega\) and \ (b\) to get the predicted value close to the real value.
The idea, like the former logistic regression and the linear regression model, is to construct the cost function first, then the value of the cost equation is reduced to the lowest point by the gradient descent method, and the appropriate \ (\omega\) and \ (b\)are obtained.
When using gradient descent, the gradient of each \ (\omega\) and \ (b\) needs to be computed, the greater the absolute value of the gradient, the more sensitive the current cost function is to the change of the parameter, and the faster the cost function will be changed.

Derivation of calculus formula

Take the network in 3B1B video as an example:

The cost equation can be represented by the activation value of the last layer \ (a^{(l)}\) and the mean square error of the true value y:\ ((a^{(L)}-y) ^2\) . (PS: Here l=4, some textbooks calculate mean square error when multiplied by \ (1/2\))
Then we want to solve the gradient of \ (\omega\) and \ (b\) .
Here is an example of \ (\frac{c_0}{\partial \omega^{(L)}}\) :

To find the gradient, which is the sensitivity of the cost function to the change of parameters.
It can be found that changing \ (\omega^{(l)}\)first affects the \ (z^{(l)}\), then affects the \ (a^{(l)}\), and finally affects \ (c_0\).
With this feature, \ (\frac{c_0}{\partial \omega^{(L)}}\) can be decomposed:

This is the so-called chain rule (Chain rule):
\[\begin{split}\frac{c_0}{\partial \omega^{(L)}}=&\frac{\partial z^{(L)}}{\partial \omega^{(l)}}\frac{\ Partial a^{(L)}}{\partial z^{(l)}}\frac{\partial c_0}{\partial a^{(L)}}\\=&a^{l-1}\sigma\prime (z^{(L)}) 2 (a^{(l )}-y) \end{split}\]

The gradient of \ (b^{(L)}\) can also be obtained:

The above network has only one neuron per layer, if there are multiple units, the above formula is also established.
As mentioned before, the weight matrix is two-dimensional, you can give two subscript \ (j,k\) to represent \ (\omega\):

The chain rules are updated as follows:
\[\begin{split}\frac{c_0}{\partial \omega_{jk}^{(L)}}&= \frac{\partial z_j^{(L)}}{\partial \omega_{jk}^{(l)}}\ Frac{\partial a_j^{(L)}}{\partial z_j^{(l)}}\frac{\partial c_0}{\partial a_j^{(L)}}\&=a^{l-1}_k \sigma\prime (z^ {(l)}_j) 2 (a^{(l)}_j-y_j) \end{split}\]
And to push this formula to other layers ( \frac{c}{\partial \omega_{jk}^{(L)}}\) , only the \ (\frac{\partial c}{\partial a_j^{) in the formula is required ( L)}}\) .
Summarized as follows:

Therefore, it can be found that when the gradient is calculated, the first two (A^{l-1}_k, \sigma\prime (z^{(l)}_j) \) can be calculated directly, and the last one, you can calculate the \ (\frac{\partial c0}{\ Partial a_j^{(L)}}\), and then spread forward in a layer, the reverse propagation is probably the meaning of it.
Andrew's Machine learning program gives a calculation method, which can be understood by this idea.

TIPS: Random Gradient Descent method (Stochastic gradient descent)


In the previous batch model, each update weights to traverse all the samples and then take the mean, so inefficient, you can divide the sample into several equal-sized mini-batch, each time traversing a mini-batch, update the lower weights, although the descent of the route may not be the shortest, But the speed is improved a lot, this is the random gradient descent algorithm.

[Machine Learning] Learning notes-neural Networks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.