Why do you need machine learning?
Some tasks are more complicated to code directly. We can't handle all the nuances and simple coding. Therefore, machine learning is necessary. Instead, we provide a large amount of data to machine learning algorithms, allowing the algorithm to continuously explore the data and build models to solve the problem. For example, in a new messy lighting scene, identify threedimensional objects from a new perspective; write a program to calculate the probability of credit card transaction fraud.
The machine learning method is as follows: Instead of writing a program for each specific task, it collects a large number of cases and specifies the correct output for a given input. The algorithm uses these examples to generate programs. Unlike handwritten programs, this program can contain millions of data volumes, as well as new cases and trained data. If the data changes, the program is trained on the new data and updated. A lot of calculations are much cheaper than paying a handwritten program.
The application of machine learning is as follows:

Pattern recognition: Identify the face or expression of the actual scene, language recognition.

Identify anomalies: the credit card transaction sequence is abnormal, and the nuclear power plant sensor reading mode is abnormal.

Forecast: The future stock price or currency exchange rate, personal viewing preferences.
What is a neural network?
Neural network is a general machine learning model. It is a set of specific algorithms. It has revolutionized the field of machine learning. It is itself an approximation of ordinary functions and can be applied to any machineintegrated inputoutput complex mapping problem. In general, neural network architecture can be divided into three categories:

Feedforward Neural Network: The most common type, the first layer is the input and the last layer is the output. If there are multiple hidden layers, it is called a "depth" neural network. It is able to calculate a series of similar changes in events, and the activity of each layer of neurons is a nonlinear function of the next layer.

Cyclic neural network: Each node forms a cycle diagram, which can return to the initial point in the direction of the arrow. Cyclic neural networks are complex and difficult to train. They simulate continuous data, which is equivalent to a deep network with a hidden layer for each time segment, except that the same weight is used on each time segment. The network can remember information about hidden states, but it is difficult to train the network with this.

Symmetrical connection network: The same as the cyclic neural network, but the connections between the units are symmetrical (that is, the connection weights are the same in both directions), which is easier to analyze than the cyclic neural network, but the function is limited. A network without a symmetric connection of hidden cells is called a "Hopfield network", and a network with a symmetric connection of hidden cells is called a "Boltzmann machine."
I) Perceptron
As the first generation of neural networks, the perceptron is a computational model with only one neuron. First convert the original input vector into a feature vector, then use the handwriting program to define the feature, and then learn how to weight each feature to get a scalar. If the scalar value is above a certain threshold, the input vector is considered to be a positive sample of the target class. example. The standard perceptron structure is the feedforward model, that is, the input is transmitted to the node, and the output is processed after processing: input from the bottom, top output, as shown in the following figure. But there are also limitations: once the handwritten coding features are determined, there is a big limit in learning. This is devastating for the perceptron, and although the transformation is similar to translation, the focus of pattern recognition is on recognition patterns. If these transitions form a group, the perceptron portion of the learning cannot learn to recognize, so multiple feature units are needed to identify the submode transitions.
Networks without hidden cells also have significant limitations in the modeling of input and output mapping. Increasing the linear element layer also does not work because the linear superposition is still linear, and a fixed nonlinear output cannot establish this mapping. Therefore, it is necessary to establish a multilayer adaptive nonlinear hidden unit.
II) Convolutional Neural Network
Machine learning research has been widely focused on object detection, but there are still many factors that make it difficult
Identifying objects:

Object segmentation and occlusion problems;

Lighting affects pixel intensity;

Objects are displayed in various forms;

Objects with the same function have different physical shapes;

Changes caused by visual differences;

Dimensional jump problem.
The copy feature method is the main method used by the current CNN for target detection, and the largescale copying of the same feature detection map at different positions greatly reduces the number of free parameters to be learned. It uses different feature types, each with its own copy detection map, and allows each image block to be represented in various ways.
CNN can be used for handwritten digit recognition to 3D object recognition, etc., but identifying objects from color images is more complicated than handwritten digit recognition. Its category and pixel are 100 times the number (1000 vs 100, 256*256 color vs28*28 gray scale). ).
ImageNet in the 2012 ILSVRC2012 competition provides a dataset containing 1.2 million highresolution training images. The test image is not labeled and the entrant needs to identify the type of object in the image. Winner Alex Krizhevsky developed a deep convolutional neural network. In addition to some of the largest pooling layers, the architecture has seven hidden layers, all in front of the convolutional layer, and the last two layers are global connections. The activation function is a linear unit at each hidden layer, which is faster than the logic unit, and uses a competitive specification to suppress hidden activity, contributing to intensity changes. Hardware, implemented on a highefficiency convolutional network on two Nvidia GTX 580 GPUs (more than 1000 fast cores), is ideal for matrix multiplication and has high memory bandwidth.
III) Recurrent Neural Network
The Recurrent Neural Network (RNN) has two powerful properties that can be calculated by any computer: (1) a distributed hidden state that allows a large amount of valid information to be stored (2) a complex way to allow updating of the nonlinear dynamics of a hidden state. RNN's powerful computing power and gradient disappearance (or explosion) make it difficult to train. When multilayer backpropagation, if the weight is small, the gradient is exponentially reduced; if the weight is large, the gradient increases exponentially. Some hidden layers of a typical feedforward neural network can cope with the exponential effect. On the other hand, in long sequence RNN, the gradient is easy to disappear (or burst), even with good initial weight, it is difficult to detect the current dependence. The target output of the time input is therefore difficult to handle remote dependencies.
The way to learn RNN is as follows:

Longterm and shortterm memory: RNNs are made with small modules with longterm memory values.

Hessian Free Optimization: Use the optimizer to handle the gradient disappearance problem.

Echo State Network: Initialize Input → Hidden and Hidden → Hidden and Output → Hidden Link, so that the hidden state has a huge weakly coupled oscillator reserve that can be selectively driven by the input.

Initialize with momentum: Same as the echo state network, then use momentum to learn all connections.
IV) Long/Short Term Memory Network
Hochreiter & Schmidhuber (1997) constructed a longterm and shortterm memory network that solves the problem of obtaining longterm memory of RNN. Using a multiplicative logic linear unit to design a memory cell, as long as the "write" gate is opened, information is written and held in the cell. In the middle, you can also open the "read" door to get data from it.
RNN can read the book, the input coordinates of the pen tip are (x, y, p), p represents whether the pen is up or down, and the output is a sequence of characters, using a series of small images as input instead of pen coordinates. Graves & Schmidhuber (2009) said that RNN with LSTM is the best system for reading books.
V) Hopfield Networks
Nonlinear cyclic networks have many manifestations that are difficult to analyze: they can achieve three manifestations of stability, oscillation, or ambiguity. A Hopfield network consists of binary threshold cells with cyclic connections. In 1982, John Hopfield discovered that if the connection is symmetric, there is a global energy function, and each binary "structure" of the entire network has energy, and the binary threshold decision rule causes the network to set a minimum for the energy function. The easiest way to use this type of calculation is to use memory as the minimum energy of the neural network. Using the energy minimum means that the memory gives a content to find memory, and the entire project can be accessed by knowing the local content.
Every time you remember the configuration, you want to generate an energy minimum. But if there are two minimums, it will limit the Hopfield network capacity. Elizabeth Gardner found a better storage rule that used all the weights. Rather than trying to store multiple vectors at once, she loops through the training set multiple times and trains each unit with a perceptron convergence program, giving all other units of the vector the correct state.
VI) Boltzmann Machine Network
The Boltzmann machine is a randomloop neural network that can be regarded as a random generation product of the Hopfield network. It is one of the first neural networks to learn internal representations. The algorithm aims to maximize the product of the probability that the machine is assigned to the binary vector in the training set, which is equivalent to maximizing the sum of the logarithmic probabilities assigned to the training vector, as follows: (1) When the network has no external input, The network is stable at different times; (2) each time the visible vector is sampled.
In 2012, Salakhutdinov and Hinton wrote an effective small batch learning program for the Boltzmann machine. In 2014, the model was updated and called the Restricted Boltzmann machine. Please check the original text for details.
VII) Deep Belief Network
Backpropagation is the standard method for artificial neural networks to calculate the error distribution of each neuron after processing a batch of data, but there are also some problems. First of all, the training data should be marked, but almost all the data are not labeled; secondly, the learning time is insufficient, which means that the network with more hidden layers is slower; thirdly, it may cause the local to fall into the most unfavorable situation. So for the deep network this is not enough.
The unsupervised learning method overcomes the limitations of backpropagation. Using the gradient method to adjust the weight helps to maintain the efficiency and simplicity of the architecture. It can also be used to model the sensory input structure. In particular, it adjusts the weights to maximize the probability of generating models that produce sensory inputs. The belief network is a directed acyclic graph composed of random variables, which can infer the state of unobserved variables, and can also adjust the interaction between variables to make the network more likely to generate training data.
Early graphical models were expertdefined image structures and conditional probabilities that were sparsely connected and focused on making the right inferences rather than learning. But for neural networks, learning is the focus, and its purpose is not to make inferences easier by interpretability or sparse connectivity.
VIII) Deep Autoencoders
The architecture provides two mapping methods, which seem to be a very good way to do nonlinear dimensionality reduction. It is linear (or better) in the number of training cases, and the final coding model is quite compact and fast. However, it is difficult to optimize the depth autoencoder using backpropagation. If the initial weight is small, the backpropagation gradient will disappear. We use unsupervised layerbylayer pretraining or as much initialization initialization weight as the echo state network.
There are three different types of shallow autoencoders for pretraining tasks: (1) RBM as an autoencoder; (2) denoising autoencoder; and (3) a compression autoencoder. For data sets that are not extensively labeled, pretraining facilitates subsequent discriminative learning. Even for deep neural networks, unsupervised training is not necessary for weight initialization for a large number of annotated data sets. Pretraining is the first good way to initialize deep network weights, and there are other methods. But if you expand the network, you need to do pretraining again.