The first week of deep learning research

Source: Internet
Author: User

The following is only my personal knowledge, not to mention please PAT.
(At present, I only see some deep learning review and Tom Mitchell's book "Machine Learning" in the Neural network chapter, the understanding is limited. Feel 3\4 speak generally, reluctantly a look. The fifth chapter is purely to make notes, really bad expression, do not understand or look at Tom's book it. )

The organization of this article:
1. My general understanding of deep learning
2. A brief history of development
3. Perceptron model
4. The gradient descent training method of Perceptron
5. Reverse propagation algorithm (BP)

1. My general understanding of deep learning
Deep learning is a class of methods based on artificial neural networks. Multilayer neural networks are composed of input layer, output layer and multilayer hidden layer. In general, the input layer of the neural network input is the basic representation of the object, hidden layer is another feature of the object representation, the lower layer of hidden layers represents the low-level characteristics of the object representation, high-level hidden layer represents the high-level characteristics of the object representation, each layer of the characteristics expressed through the layer and layer coefficients to characterize, Neural network is to extract the advanced characteristics of the object from the low-level feature representation, and the concrete type of output layer output object. In the case of image recognition, iterative training is the input of the image from the input layer pixels, and then extract the advanced features of the image, constantly changing the coefficients of the neural network so that the input image of the pixel in the final output layer can output the correct type of image (said more farfetched).

2. A brief history of development
In the the 1940s, a perceptron model (a single artificial neuron) was proposed based on the operating mechanism of the neuron, and by the 1960s, the sensitivity of the single-layer Perceptron model was not strong and the study heat decreased. In the the 1980s, a reverse propagation algorithm was proposed to realize the training of multi-layer networks, but it was generally trained on the 3-tier network, because BP was not enough to train more layers. The more layers the artificial neural network is, the more powerful its characterization is, and the fewer nodes it needs to hide the layer. So the study of artificial neural network is limited. By the year 2006, Hinton published the "Deep Belief network" This article, proposed the layering training method, causes the artificial neural network research to fire again.

3. Perceptron model

Is the initial model of the perceptron. It is divided into input parts: x1,x2,..., xn and the coefficients of each input w1,w2,..., WN (also known as weight vector) and the threshold W0, as well as the activation function o and output. The Perceptron model is a simple linear classification model, when the input linear combination (W1.X1+W2.X2+...+WN.XN) is greater than the threshold value (W0), the activation function outputs 1, otherwise output-1. If the w0 is also classified as an input factor, then the formula in the equation can be explained. The perceptron becomes a useful classifier when data is thrown into the training data and the coefficients of the perceptron are adjusted. For example, the activation function is the straight line.

The Perceptron model can only characterize linear variance and cannot characterize nonlinear functions, which are destined to be improved.

4. The gradient descent training method of Perceptron
The training of Perceptron is to learn the most appropriate coefficient, so that the coefficients can best characterize the activation function, or make the perceptron relative to a fixed training sample of the smallest error.
To express it in mathematics is this:
The activation function is expressed as: O=w0+w1.x1+...+wn.xn
The error is expressed in a formula. where d is the training sample set, TD is the target output of training sample D, and OD is the output of the Perceptron to training sample D.
So now the task is to adjust the function so that E is minimized.
This is actually an optimization problem.
Assuming that there are only two coefficients w0 and W1, then the relationships between E and W0 and W1 are as follows:

The arrows show the opposite direction of the gradient of the point, pointing to the steepest direction along the error surface in the W0 and W1 planes. It can be seen that as long as this direction can be reduced to the error surface of the error e the smallest point.
What to do?
To minimize E, start with an arbitrary initial weight vector, and then modify the vector at a very small pace. Each step modifies the weight vector in the direction of the steepest descent along the error surface, looping through the process until the E is minimal.
Gradient seeking method such as:

Each change: where,

5. Reverse propagation algorithm (BP)
The connection of multiple linear elements still produces linear functions, and we prefer to have a network that can characterize nonlinear functions. The Perceptron model is not done, but the sigmoid unit can. The sigmoid unit will be used as a neural neuron for the BP algorithm trained by the neural network.
Unlike the Perceptron model, the activation function of the sigmoid unit is changed. Such as

All right. Back to the BP algorithm.
Multilayer Networks,
How does BP train multi-layered networks?
A: or with a gradient descent method, but improved a bit.
Since BP is using the gradient descent method to train the multilayer network, then how is error e defined?
For:
Where outputs is a collection of network output units, TKD is the target output value of training sample d in the K output unit. OKD is the actual output value of the training sample in the K output unit. For each output unit of the network (the cell at the last layer) K, its error entry is
For the above definition of E, is the neuron of the hidden layer defined by the target output value? If not, then how does e have to be defined?
A: For each hidden unit h, its error entry is. Because the training sample only provides the target value TK for the output of the network, a direct target value is missing to calculate the error value of the hidden unit. Therefore, the following indirect method is adopted to calculate the error term of the hidden element: The weighted sum of the error δk of each element affected by the hidden unit h is calculated, and the weight of each error δk is wkh,wkh from the hidden unit h to the output unit K. This weighted value depicts the extent to which the error of the hidden unit h for the output unit K should be "responsible".

Then, each weight can be updated by this formula:
The above-mentioned Yita are learning rates.

The first week of deep learning research

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.