**a summary of neural networks**
found that now every day to see things have a new understanding, but also to the knowledge of the past.

Before listening to some of Zhang Yuhong's lessons, today I went to see some of his in-depth study series in the cloud-dwelling community, it introduces the development of neural network history, the teacher is very humorous, theory a lot, no matter what anyway can say a 123, but feel the article sometimes too broad, interested can go to see, Attach a link to the teacher's article.

A hou door "deep" like the sea, deep learning how much (one of the introductory series) **machine learning**

Two-layer function of machine learning

(1) Facing the past (the collected historical data, used for training) and discovering patterns underlying the data, we call it descriptive analysis (descriptive analyses);

(2) for the future, based on the established model, the prediction of new input data objects is called Predictive analysis (predictive).

Formal definition of machine learning

According to Dr. Li Hongyi of Taiwan University, the so-called machine learning, in form, can be approximately equivalent to looking for a specific input and expected output function (as shown in Figure 2-5) in a data object by means of statistical or inferential methods. As a rule, we write the input variables in uppercase X, while the output variables are written in uppercase Y. So the so-called machine learning, in form, is to complete the following transformation: y= F (X).

Specifically, machine learning to do well, you need to take three big strides:

(1) How to find a series of functions to achieve the expected function, this is a modeling problem.

(2) How to find a reasonable set of evaluation criteria to assess the quality of the function, this is a question of evaluation.

(3) How to quickly find the function of the best performance, which is the optimization problem (for example, in machine learning gradient descent method to do is this job). **m-p Neuron Model 1943**

It simulates the working mechanism of the neurons in the brain, which is a "m-p neuron model" which was put forward but has been used so far in the 40 's.

In this model, neurons receive input signals from n other neurons, the expression of which is usually expressed by the weight of the connections between the neurons (weight), the neurons will receive input values superimposed by a certain weight, compare the thresholds of the current neurons, and then pass the " activating function (activation function) "Outward expression output (this is conceptually called the Perceptron, is the concept after 15)

The learning rule of the so-called neural network is the rule of adjusting weights and thresholds.

This model can complete "with (and)", "or" and "non" state transitions, but cannot complete "XOR"

XOR, that is, if and only if the input value X1 and x2 are not equal, the output is 1. Conversely, the output is 0. You can simply and rudely interpret "XOR" as: the Nanhuannu: AI output is 1, and the base is all without results (output is 0). The teacher's understanding is humorous, easy to remember φ (>ω<*).

In simple terms, the perceptual machine model, it is a network structure composed of two layers of neurons, the input layer receives input from the outside world and transmits the signal to the output layer by activating the function (threshold), so it is also called "Threshold Logic unit (threshold logic)", which is the simple logical unit , gradual evolution, more and more complex, constitutes our current research hotspot-depth Learning network.

The problem of "with, or, not" in atomic Boolean functions is linear and linearly (separable). For linear non-atomic Boolean functions (such as "XOR" operations), there is no simple linear hyperplane to differentiate them.

If there is a problem, we must find a way to solve it.

We add a layer of neurons between the input layer and the output layer, calling it the hidden layer (hidden layer, also known as the "hidden Layer"). In this way, the neurons in the hidden layer and the output layer have activation functions. Until 1975 (at this time, the distance Ivachninco the concept of multilayer neural network has been up to 10 years), the perception of the "different or difficult" to be completely solved by the theoretical circle.

**Multilayer feedforward Neural network**

The model of M-p Perceptron is a neuron model multilayer perceptron is also referred to as Multilayer neural network (this is the author's own point of view, there are errors and asked the reader to point out the exchange). In common multilayer neural networks, each layer of neurons is only connected to the next layer of neurons. In the same layer, the neurons are not connected to each other, and they are not connected to each other. This simplified neural network structure is referred to as "multilayer feedforward neural Networks (multi-layer feedforward Neural Networks)".

The essence of neural network learning is to use the "loss function" to adjust the weights (weight) in the network loss.

The weight of the neural network, in the end should be how to adjust.

The first method is "error back propagation" (propagation, or BP, for short).

The second kind of improvement method is the current mainstream method, namely "depth learning" commonly used "layered initialization" (layer-wise pre-training) training mechanism, different from BP's "from the back to the front" of the training parameters method, "depth learning" is a kind of from "before to back" **BP Neural network** of layered training method

Simply put, is to first randomly set the initial value, then calculate the current network output, and then according to the network output and the expected difference between the output, using iterative algorithm, the reverse direction to change the parameters of the previous layers, until the network convergence and stability.

BP neural network can be divided into two steps (specific content can go to see the original author)

(1) Forward propagation signal, output classified information;

(2) Reverse propagation error, adjust the network weight value. If the intended purpose is not achieved, retrace (1) and (2).

It was first published by Geoffrey Hinton and David Rumelhart in the journal Nature magazine in 1986: "Learning Representations by back-propagating Errors "in the proposed. In this paper, the application of the inverse propagation algorithm to the neural network model is presented in the first systematic and concise way.

The BP reverse-propagation algorithm reduces the amount of error correction directly to the level of the neuron number itself. But in a more layer network, when its residuals propagate to the front layer (i.e. the input layer), its influence has become very small, even the gradient diffusion (gradient-diffusion), which seriously affects the training accuracy. The root cause is that for non convex functions, once the gradient disappears, there is no guiding significance, which may be limited to the local optimal. And the phenomenon of "gradient diffusion" will become more and more serious with the increase of network layer. In other words, as the gradient of the layer of reduction, resulting in the adjustment of network weights to adjust the effectiveness of the effect is less and less, so the BP algorithm for the shallow network structure (usually less than 3), which restricts the BP algorithm data representation ability, This limits BP's performance limit. Compared with the original ecological BP algorithm, although it reduces the training of network parameters, but its network parameters training cost is not small, time-consuming very "considerable".

Here's a sketch of the gradient.

The smoother the ramp (the smaller the slope), the slower the process of reaching the peaks (peak functions), and the steeper the slopes (the greater the slope), the faster the mountain peaks (as in the case of a computer), and the more rapidly it will reach the mountains if the gravity resistance of the mountain is not taken into account (for computers). Is the increasingly fast convergence to the extreme point).

If we turn the mountain peaks into "big shifts" and turn the mountain into a rock bottom (that is, to find a minimum), the way to find the steepest slopes and climb the peaks is not fundamentally changing, but in the opposite direction. If the slope of the mountaineering process is called a gradient (gradient), to find the bottom of the method, it can be called "gradient decline (gradient descent)", we are also very easy to see the "gradient" problem is that it is easy to converge to the local minimum value. As the peak of climbing, we will sigh "a mountain is higher than a mountain", to explore the bottom of the valley, we may also find that, "one valley is lower than a valley", can only be "only the edge of the body in this mountain."

So the second kind of improvement is conceived and born. **Deep Learning**

The component of the depth belief network (DBN) is the limited Boltzmann machine (restricted Boltzmann machines, RBM). The construction of DBN is in fact divided into two steps: (1) to train each layer of RBM network alone "unsupervised" to ensure that the feature vectors can retain the feature information as much as possible when mapping to different feature spaces; (2) at the last level of DBN, set up the BP network, The output eigenvector is used to receive the RBM as its input eigenvector, then the entity relation classifier is trained "supervised", and the network weights are fine-tuned (fine-tune).

Convolution neural network CNN is a major milestone in depth learning. The concept of "convolution": the so-called convolution, but a function and another function in a certain dimension of the weighted "superposition" effect just.

The subject report of Academician Li Deii. In the report, academician Li mentioned the problem of convolution comprehension, which is very interesting. He said, what is convolution? For example, the constant bending of a wire, assuming that the heating function is f (t), and that the heat dissipation function is g (t), the temperature at this moment is the convolution of f (t) and g (t). In a given environment, the sound source function of the sound body is f (t), and the reflection effect function of the sound source is g (t), then the receiving voice is the convolution of f (t) and g (t) in this environment.

Without considering the input layer, a typical convolution neural network is usually composed of several convolution layers (convolutional Layer), the activation layer (activation Layer), the pool layer (pooling Layer) and the full connection layer (fully Connected Layer) composition.

Typical CNN structure

CNN topological structure

Convolution layer: This is the core of convolution neural network. In the convolution layer, through the realization of "local perception" and "Weight value sharing" series of design concepts, can achieve two important objectives: the implementation of dimensionality reduction processing of high dimensional input data and the realization of automatic extraction of the core characteristics of the original data.

Activation layer: The function is to process the linear output of the previous layer through the nonlinear activation function, so as to simulate any function, and then enhance the network's representation ability. In the field of depth learning, Relu (rectified-linear unit, fixed linear Element) is a more active function now, because it converges faster and does not produce a gradient extinction problem.

Pool layer: Also known as sub-sampling layer (subsampling Layer). In simple terms, with local dependencies, "sampling" preserves useful information at the same time with less data size. The ingenious sampling also has the local linear transformation invariance, thus enhances the convolution neural network the generalization processing ability.

Full-Connection layer: This network layer is equivalent to the traditional multilayer perceptron (multi-layer perceptron, abbreviated MLP, for example, the BP algorithm we have already explained [2]). In general, "convolution-activation-pooling" is a basic processing stack, after multiple front stack processing, the data characteristics to be processed have changed significantly: On the one hand, the dimension of the input data has been reduced to the available "full connection" network, on the other hand, the input data for the fully-connected layer is no longer "nishajuxia, Mixed ", but after repeated purification of the results, so the final output can be controlled high.

3 Core concepts of convolution layer local connection, spatial position arrangement and weight value sharing

Interested in reading the teacher's original works

Local join to reduce the parameters, weight sharing side by shoulder (Depth Learning Primer series 11)

Thank Zhang Yuhong Teacher's article, science writer (teacher Zhang's goal)

If you need to reprint please explain the source of the article