At present, deep learning (Deepin learning, DL) in the field of algorithm is rounds, now is not only the Internet, artificial intelligence, the life of the major areas can reflect the profound learning led to the great change. To learn deep learning, first familiarize yourself with some basic concepts of neural networks (neural Networks, referred to as NN). Of course, the neural network described here is not a biological neural network, we call it artificial neural network (Artificial neural Networks, referred to as Ann) seems more reasonable. Neural network is an algorithm or model in artificial intelligence field, the current neural network has developed into a multidisciplinary interdisciplinary field, and it has been paid more attention and respected with the progress of deep learning.

Why do you say "re"? In fact, the most of the neural network algorithm model has been studied very early, but after some progress, the research of neural network has been caught in a long period of low tide, and then with the progress of Hinton in deep learning, the neural network has again been paid attention to by people. Based on neural network, this paper mainly summarizes some relevant basic knowledge, and then leads to the concept of deep learning, such as where there is improper writing, and please comment on them.

1. Neuron model

Neuron is the most basic structure of neural network, it can be said that the basic unit of neural network, its design inspiration is entirely from the biological neuron information dissemination mechanism. The students we learn about biology know that neurons have two states: excitement and inhibition. In general, most neurons are in a suppressed state, but once a neuron receives stimulation, causing its potential to exceed a threshold, the neuron is activated, in an "excited" state, and the chemical is transmitted to other neurons (in fact, information).

For the biological neuron structure:

In the 1943, McCulloch and Pitts were represented by a simple model of the neuron structure, constituting an artificial neuron model, the "m-p neuron model" We now use, as shown in:

From the m-p neuron model, it can be seen that the output of neurons

$y = f (\sum_{i=1}^{n}w_{i}x_{i}-\theta) $

Where $\theta$ is the activation threshold of neurons we mentioned earlier, function $f (·) $ is also known as an activation function. As shown, the function $f (·) $ can be represented by a step equation, which is greater than the threshold activation; otherwise suppressed. But this is a bit too rough, because the step function is not smooth, discontinuous, non-conductive, so our more common method is to use the sigmoid function to represent the function function $f (·) $

The expression and distribution diagram for the sigmoid function is as follows:

$f (x) = \frac{1}{1+e^{-x}}$

2. Perceptron and Neural networks

Perceptron (Perceptron) is a structure composed of two layers of neurons, the input layer is used to accept the external input signal, the output layer (also known as the function layer of the perceptual machine) is the m-p neuron. Represents an input layer with three neurons (represented as $x_{0}$, $x _{1}$, $x _{2}$) with a perceptual machine structure:

It is not difficult to understand that the Perceptron model can be represented by the following formula:

$y = f (wx + B) $

Wherein, $W $ for the Perceptron input layer to the output layer of the weight of the connection, $b $ represents the output layer bias. In fact, the perceptual machine is a discriminant linear classification model, which can solve the simple linear linearly separable problem with, or, or not, the linear problem can be divided into:

But because it has only one layer of functional neurons, the ability to learn is very limited. It turns out that single-layer perceptron can not solve the simplest non-linear sub-problem-------------------------the students who want to understand the XOR problem or the perception machine cannot solve the problem or the proof of the issue.

About the perceptual machine to solve the different or problem there is a history that we should simply understand: the Perceptron can only do simple linear classification tasks. But the enthusiasm was too high, and no one was aware of it. So, when the AI giant Minsky pointed this out, things changed. In 1969, Minsky published a book called "Perceptron", which proved the weaknesses of the Perceptron with detailed mathematics, and in particular, the perceptual device cannot solve the simple classification tasks such as XOR (XOR). **Minsky that if the computational layer is added to two layers, the computational amount is too large and there is no effective learning algorithm. So, he argues, there is no value in studying deeper networks. **due to the great influence of Minsky and the pessimistic attitude in the book, many scholars and laboratories have given up the research of neural network. The study of neural networks fell into the ice age. This period is also called "AI Winter". Nearly 10 later, the study of the two-layer neural network brought about the recovery of neural network.

We know that many problems in our daily life, even if most of the problems are not linear can be divided, then we have to solve the non-linear problem of how to deal with it? This is the concept of "multilayer" that we are going to draw out in this part. Since the single-layer perceptron can not solve the nonlinear problem, then we use a multilayer perceptron, is a two-layer perceptron to solve the problem of different or problems:

After the network is built, the final classification is obtained by training as follows:

Thus, multilayer perceptron can solve the problem of non-linearity, we usually call multilayer structure such as neural network. However, as Minsky feared before, multilayer perceptron can theoretically solve the nonlinear problem, but the complexity of real life problems is far more than the different or the problem is so simple, so we often want to build multi-layer network, and for multilayer neural network what kind of learning algorithm is a huge challenge, How should we determine if there are at least 33 parameters in the network structure shown in the 4-layer hidden layer (regardless of the bias bias parameter)?

3. Error inverse propagation algorithm

The **main purpose of the so-called neural network training or learning is to get the parameters of the neural network to solve the specified problem through the learning algorithm, and the parameters include the connection weights between each layer neuron and the bias** . Because as the designer of the algorithm (we), we usually construct the network structure according to the actual problem, the parameter determination needs the neural network to iterate through the training sample and the learning algorithm to find the optimal parameter group.

Speaking of neural network learning algorithm, we have to mention one of the most outstanding and most successful representatives-Error inverse propagation (Error backpropagation, referred to as BP) algorithm. BP learning algorithm is usually used in the most widely used multilayer feedforward neural network.

The main flow of BP algorithm can be summarized as follows:

**input** : Training set $d={(X_k, y_k)}_{k=1}^{m}$; Learning rate;

Process:

1. Randomly initialize all connection rights and thresholds within (0, 1) of the network

2. Repeat:

3. For any $ (x_{k}, y_{k}) \in d$ do

4. Calculate the output of the current sample according to the current parameters;

5. Calculate the gradient of the output layer neurons;

6. Calculate the gradient term of the hidden layer neurons;

7. Update connection rights and thresholds

8. End for

9. Until reach the stop condition

**output** : Multi-layer Feedforward neural network with connection right and threshold value determination

Remark: The formula derivation of the following supplementary BP algorithm.

4. Common neural network Model 4.1 Boltzmann machine and limited Boltzmann machine

A kind of model in neural network is to define an "energy" for the network state, the network achieves the ideal state when the energy is minimized, and the network training is to minimize the energy function. The Boltzmann (Boltzmann) machine is a model based on energy, and its neurons are divided into two layers: explicit layer and hidden layer. The explicit layer is used to represent the input and output of the data, and the hidden layer is understood as the intrinsic expression of the data. The neurons of the Boltzmann machine are Boolean, which can only take 0 and 1 values. The standard Boltzmann machine is fully connected, that is to say, the neurons in each layer are interconnected, so the computational complexity is very high, and it is difficult to solve the practical problem. Therefore, we often use a special Boltzmann machine-restricted Boltzmann machine (Restricted Boltzmann mechine, abbreviated as RBM), it is not connected in the layer, there is connectivity between the layers, can be seen as a two-part diagram. For the structure of Boltzmann machines and RBM:

RBM is often trained with contrast divergence (constrastive divergence, abbreviated CD).

4.2 RBF Network

RBF (Radial Basis function) Radial basis function network is a kind of single hidden layer feedforward neural network, which uses radial basis function as the activation function of hidden neurons, while the output layer is a linear combination of the output of the hidden layer neurons. As a RBF neural network:

Training a RBF network usually takes two steps:

1> determine the center of neurons, commonly used methods include random sampling, clustering, etc.

2> determine the parameters of neural network, the common algorithm is BP algorithm.

4.3 Art Network

ART (Adaptive Resonance theory) Adaptive Resonant theory Network is an important representative of competitive learning, which is composed of comparison layer, recognition layer, recognition layer threshold and reset module. Art better alleviates the "plasticity-stability dilemma" in competitive learning (stability-plasticity dilemma), plasticity refers to the ability of neural networks to learn new knowledge, while stability refers to the need for neural networks to keep the memory of old knowledge when learning new knowledge. This makes the art network an important advantage: you can do incremental learning or online learning.

4.4 som Network

SOM (self-organizing map, self-organizing map) is a competitive learning unsupervised neural network that maps high-dimensional input data to low-dimensional spaces (usually two-dimensional), and associates maintain the topological structure of input data in high-dimensional spaces, The similar sample points in the high dimensional space are mapped to neighboring neurons in the network output layer. For the structure of the SOM network:

4.5 Structure Adaptive Network

As we mentioned earlier, the general neural network is the first to specify a good network structure, the purpose of training is to use training samples to determine the appropriate connection rights, thresholds and other parameters. In contrast, the structure Adaptive Network also takes the network structure as one of the learning goals, and wants to find the network structure which is most fit for the data characteristic during the training.

4.6 Recurrent neural networks and Elman networks

Unlike feedforward neural networks, recursive neural networks (recurrent neural Networks, abbreviated RNN) allow for ring structures in the network, allowing the output of some neurons to be returned as input signals, so that the structure and information feedback process, so that the network in the $t$ The output state of the moment is related not only to the input of the $t$ moment, but also to the network state of the $t-1$ moment, which can deal with the dynamic changes related to time.

The Elman network is one of the most commonly used recurrent neural networks, and its structure is as follows:

RNN General training algorithm uses the generalized BP algorithm. It is worth mentioning that**rnn at (t+1) time the results of the network O (t+1) is the input of this time and all the history of the results of the interaction** , so that the time-series to achieve the purpose of modeling . therefore, in a sense, rnn is considered to be the depth of time in deep learning is also wrong.

The ** result O (t+1) of the network at the time of the RNN (t+1) is the result of this time input and all historical interaction ,** so it is not very accurate, because "gradient divergence" also occurs on the timeline, that is, for the t moment, The gradient that it produces disappears after a few layers of history in the timeline, and does not affect the distant past at all. Therefore, "All history" is only the ideal situation. In practice, this effect can only be sustained by a number of timestamps. In other words, the wrong signal behind time steps often does not return to the past enough, like earlier time steps, to affect the network, which makes it difficult to learn the effects of long distances.

In order to solve the gradient divergence on the above time axis, the Machine learning field has developed a memory unit of long and short duration (long-short term memory, referred to as LSTM), realizes the memory function on time through the switch of the gate, and prevents the gradient divergence. In fact, in addition to learning historical information, RNN and LSTM can also be designed as a two-way structure, that is, bidirectional rnn, bidirectional lstm, while taking advantage of historical and future information.

5. Deep Learning

Deep learning refers to the deep neural network model, generally refers to the network layer in three or three layers above the neural network structure.

Theoretically, the more complex the model of the parameter, the greater the "capacity", which means it can accomplish more complex learning tasks. Just as the previous multilayer Perceptron has given us a revelation, the number of layers in the neural network directly determines its ability to portray reality. However, under normal circumstances, the training efficiency of complex model is low, easy to fall into overfitting, so it is difficult to be favored by people. Specifically, with the deepening of the neural network layer, the optimization function becomes more and more prone to the local optimal solution (i.e. over fitting, it has a good fitting effect on the training sample, but the effect on the test set is poor). At the same time, one problem that cannot be neglected is that the "gradient disappears" (or gradient divergence diverge) is more serious as the network layer increases. We often use the sigmoid function as the functional neuron of the hidden layer, and for a signal with a amplitude of 1, when the BP is back-propagating the gradient, the gradient decays to the original 0.25 for each passing layer. When the number of layers is more than one, the lower level basically receives no effective training signal after the gradient index attenuation.

In order to solve the training problem of deep neural network, an effective method is to take unsupervised layer-by-step training (unsupervised layer-wise training), whose basic idea is to train a hidden node each time, and to train the output of the hidden node as input. The output of the hidden node of this layer is the input of the next hidden node, which is called "Pre-Training" (pre-training), and then "fine-tuning" (fine-tunning) training of the whole network after the pre-training is completed. For example Hinton in the depth belief network (deep belief Networks, abbreviated DBN), each layer is an RBM, that is, the entire network can be seen as a number of RBM stacked. When unsupervised training is used, the first layer is trained, which is an RBM model for training samples, which can be trained according to the standard RBM; then, the hidden node of the first pre-training number is treated as the second-level input node, and the second layer is pre-trained; After the pre-training of each layer is completed, the BP algorithm is used to train the whole network.

**In fact, "pre-training + fine-tuning" training methods can be considered to be a large number of parameters grouped, each group first find the local look good settings, and then based on these local results together for global optimization. This can effectively save the training cost while taking advantage of the freedom provided by the model's large number of parameters.**

Another way to save training overhead is to "share the rights" (weight sharing), which allows a group of neurons to use the same connection right, a strategy that plays an important role in convolutional neural networks (convolutional neural Networks, referred to as CNN). For a CNN network:

CNN can train with BP algorithm, but in training, whether it is the convolution layer or the sampling layer, each group of neurons (that is, each "plane") is using the same connection right, which greatly reduces the number of parameters that need to be trained.

6. Reference content

1. Zhou Zhihua "machine learning"

2. Question and Answer: http://www.zhihu.com/question/34681168

[Mechine Learning & Algorithm] Neural network basics