absrtact : This paper will analyze the basic principle of deep neural network to recognize graphic images in detail. For convolutional neural Networks, this paper will discuss in detail the principle and function of each layer in the network in the image recognition, such as the convolution layer (convolutional layers), the sampling layer (pooling layer), the fully connected layer (hidden layers), the output layer (Softmax Output layer). For recursive neural networks, this article will explain the powerful capabilities it presents on the serial data. In this paper, the Feedforward and learning processes of the network are discussed in detail for the general depth neural network model. The deep learning model formed by the combination of convolutional neural networks and recurrent neural networks can even automatically generate text descriptions for images. As a technology that has been re-emerging in recent years, deep learning has made remarkable progress in many fields of artificial intelligence, but the explanatory nature of neural network model is still a difficult problem, this paper discusses the basic principle of using deep learning to realize image recognition from the perspective of theory, and analyzes the conversion process from image to knowledge in detail.
1 Introduction
Traditional machine learning techniques often use primitive forms to process natural data, and the learning ability of models is greatly limited, and forming a pattern recognition or machine learning system often requires considerable expertise to extract features from raw data (like pixel values) and convert them into an appropriate internal representation. Deep learning has the ability to extract features automatically, which is a kind of learning to express.
Deep learning allows multiple processing layers to form complex computational models to automatically obtain representations of data and multiple levels of abstraction. These methods have greatly facilitated the development of speech recognition, visual recognition objects, object detection, drug discovery and genomics. By using the BP algorithm, deep learning has the ability to discover the hidden complex structures in large datasets.
"Express Learning" is able to automatically discover the features that need to be detected from the original input data. The deep learning approach consists of multiple layers, each of which completes a transformation (usually a non-linear transformation) and represents a lower-level feature representation as a more abstract feature. As long as there are enough levels of conversion, even very complex patterns can be automatically learned. For the task of image classification, the neural network will automatically reject irrelevant features, such as background color, object position, etc., but will automatically enlarge useful features such as shapes. Images are often used as primitive inputs in the form of a pixel matrix, so the first learning function in a neural network is usually to detect the presence or absence of the edges of a particular direction and shape, and the position of those edges in the image. The second layer often detects the specific layout of multiple edges while ignoring minor changes in the edge position. The third layer can combine specific edge layouts into one part of the actual object. The subsequent layers will combine these parts to achieve the recognition of objects, which is often done through a fully connected layer. For deep learning, these features and hierarchies do not need to be artificially designed: they can all be obtained through a common learning process.
2 the training process of neural network
As shown in 1, the architecture of the deep learning model is generally stacked by a number of relatively simple modules, and each module computes a nonlinear mapping from input to output. Each module has selectivity and invariance for the input. A neural network with multiple nonlinear layers usually has a depth of 5?20, which can be selectively sensitive to some minor detail and insensitive to certain details, such as the background.
In the early stages of pattern recognition, the researchers wanted to use a multi-layered network to replace the ability to manually extract features, but the neural network training process has not been widely understood. It was not until the the mid 1980s that the researchers discovered and proved that multilayer architectures could be trained with a simple random gradient descent. As long as each module corresponds to a relatively smooth function, the inverse propagation process can be used to calculate the error function for the parametric gradient.
Fig. 1 Feedforward process of neural network
Fig. 2 The reverse error propagation process of neural network
Fig. 3 Chain rule
As shown in 2, the complex neural network calculates the gradient of the target function relative to the parameters in each module based on the reverse propagation process. The mathematical principle of the reverse propagation process is the chain rule, shown in 3. The objective function is independent of the gradient of each module, which is the key to the chain rule, the gradient of the objective function relative to the input of a module can be computed after the target function is calculated from the gradient of the output of the module, and the reverse propagation rule can be repeatedly applied to propagate the gradient through all modules, Thus, the gradient (i.e. error) is continuously transmitted in reverse, propagating from the last layer to the original input.
In the late 90, neural networks and other reverse-propagation-based machine learning areas were criticized to a large extent, and the computer vision and speech recognition community ignored such models. It is generally accepted that it is useful to learn very little apriori knowledge, and multi-stage automatic feature extraction is not feasible. In particular, a simple gradient descent will be given a local minimum, which may be far from the global minimum value.
However, in practice, local optimization will rarely become a problem of large-scale networks. Practice has shown that the system almost always achieves very close results, regardless of the initial conditions. Some recent theoretical and empirical studies have also tended to suggest that local optimality is not a serious problem. Instead, there is a large number of saddle points in the model, where the saddle point gradient is 0 and the training process is stuck at these points. But the analysis shows that most of the saddle points have a goal function value to approach, so it is often not important for the training process to be stuck on which saddle point.
Feedforward Neural Networks have a special type of convolutional neural network (CNN). It is widely believed that this feedforward network is more easily trained and has better generalization ability, especially in the field of image. convolutional neural networks have been widely used in the field of computer vision.
3 convolution neural Network and image comprehension
convolutional Neural Networks (CNN) are commonly used in tensor form inputs, such as a color image corresponding to three two-dimensional matrices, representing the pixel strength of three color channels, respectively. Many other input data is also in the form of tensor: signal sequences, languages, audio spectra, 3D video, and so on. convolutional neural networks have the following characteristics: Local connections, shared weights, sampling, and multilayer.
As shown in 4, a typical CNN structure can be interpreted as a combination of a series of stages. The first stages are composed of two layers: the convolution layer (convolutional layers) and the sampling layer (pooling layers). The input and output of the convolution layer are multiple matrices. The convolution layer consists of a plurality of convolution cores, each of which is a matrix, each of which is equivalent to a filter, which can output a specific feature map, each feature graph is an output unit of the convolution layer. The feature map is then further passed to the next layer through a nonlinear activation function (such as Relu). Different feature graphs use different convolution cores, but the connections between different locations and input graphs in the same feature map are shared weights. The reason for this is twofold. First, in the tensor form of data (example), the adjacent position is often highly correlated, and can be formed by the detection of local features. Second, the same pattern may appear in different locations, that is, if the local feature appears in a location, it may also appear anywhere else. In mathematics, the operation of a feature map based on convolution cores corresponds to a discrete convolution, hence the name.
Figure 4 convolutional neural network and image comprehension
In fact, studies have shown that no matter what kind of image is identified, the convolution cores in the previous convolution are not very big, because they are all matched to some simple edges. The role of convolutional nuclei is to extract local micro-features, and if a certain position is matched to a particular edge, then the position in the resulting feature map will have a greater strength value. If multiple convolution cores match multiple features in the adjacent location, then these features are combined into a recognizable object. For real-world images, graphics are often made up of a lot of simple edges, so the recognition of objects can be realized by detecting the presence or absence of a series of simple edges.
The convolution layer acts as a local feature detected in the output from the previous layer, but the function of the sampling layer is to combine the similar features into the same features and to merge the adjacent features of the position into a closer position. Because the relative position of each feature that forms a particular subject can vary slightly, it can be sampled to enter the strongest position in the feature graph, reducing the dimension of the middle representation (the size of the feature map), so that the model can still detect this feature even if the local feature has some degree of displacement or distortion. CNN's gradient calculation and parameter training process is the same as the conventional depth network, and all the parameters in the convolution core are trained.
Since the early 90, CNN has been used in many fields. , in the early 90, CNN had been used in natural images, face and hand detection, facial recognition and object detection. People also use convolutional networks to implement speech recognition and document reading systems, which are called time-delayed neural networks. This document reading system also trains convolutional neural networks and probabilistic models for constraining natural language. In addition, there are many CNN-based optical character recognition and handwriting recognition systems.
4 recurrent neural networks and natural language comprehension
Recursive neural networks (RNN) are more natural when it comes to dealing with indeterminate long sequence data, such as voice, text. Unlike Feedforward neural networks, RNN has an internal state, retains a "state vector" in its hidden unit, and implicitly contains input information about the past of the sequence. When RNN accepts a new input, it combines the implied state vectors with the new input, generating output that relies on the entire sequence. Rnn and CNN can be combined to form a more comprehensive and accurate understanding of the image.
Figure 5 Recurrent neural network
5, if we take the recursive neural network according to different discrete time steps, and the output of different time steps as the output of different neurons in the network, then the RNN can be regarded as a deep feedforward neural network, which can also be used to train the network with the conventional reverse propagation process. This method of reverse propagation in chronological order is called BPTT (back propagation Through time). However, although RNN is a very powerful dynamic system, its training process will still encounter a big problem, because the gradient at each time step may grow also may decline, so after many time steps of the reverse propagation, the gradient will often explode or disappear, the internal state of the network for the long-term past input memory is very weak.
One solution to this problem is to add an explicit memory module to the network to enhance the memory capacity of the network for the long-term past. Long-term memory model (LSTM) is such a model, LSTM introduced a core element is the cell. LSTM networks have proven to be more effective than regular rnn, especially when there are several tiers in each time step of the network.
Fig. 6 memory model with short and long duration
6, in the network structure of the LSTM, the input of the previous layer will act on the output through more paths, and the introduction of the Gate (gate) makes the network have a focused function. Lstm can be more natural to remember the input before a long time. A storage unit cell is a special unit that acts like an accumulator or a "gated leaky neuron": This unit has a direct connection from the previous state to the next state, so it can replicate its current state and accumulate all external signals, At the same time, because of the existence of the forgotten Gate (Forget gate), Lstm can learn to decide when to clear the contents of the storage unit.
5 automatic generation of picture descriptions
As shown in 7, an incredible demo in the field of deep learning combines convolutional networks and recursive networks to automate the generation of image titles. First, the original image is understood by convolutional neural Network (CNN) and transformed into a distributed representation of semantics. The Recursive Neural Network (RNN) then transforms this high-level representation into a natural language.
Figure 7 automatic generation of picture descriptions
In addition to using the memory mechanism of RNN, you can also increase the focus mechanism (attention) by focusing on the different parts of the picture to translate the images into different headings. The focus mechanism can even make the model more visible, similar to the focus mechanism of RNN machine translation, and we can explain which part of the model is being used while generating words through semantic representation.
6 Future Prospects
Unsupervised learning has helped revitalize the field of deep learning, but the sheer success of supervised learning has overshadowed its role. We expect unsupervised learning to be a more important approach in the long run. Human and animal learning is largely unsupervised: we discover the structure of the world by observing the world autonomously rather than being told the name of each object. We expect that much of the future progress in image understanding comes from training end-to-end models and combining regular CNN with the rnn of enhanced learning to achieve better focus mechanisms. The combination of deep learning and intensive learning systems is still in its infancy, but they have gone beyond the passive vision system in classifying tasks and have achieved impressive results in the field of learning video games.
Reference documents
1. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural information processing Systems 25 1090–1098 (2012).
2. Hinton, G. et al deep Neural Networks for acoustic modeling in speech recognition. IEEE Signal processing Magazine 29, 82–97 (2012).
3. Sutskever, I. vinyals, O. & Le. Q. V. Sequence to Sequence learning with neural networks. In Proc. Advances in Neural information processing Systems 27 3104–3112 (2014).
4. Fredkin, E.: Trie memory. Communications of the ACM 3 (9), 490–499 (1960)
5. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In:international Conference on Artificial Intelligence and Statistics. pp. 315–323 (2011)
6. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep to rectifiers:surpassing human-level performance on Imagenet class Ification. IN:ICCV (2015)
7. Herlihy, M., Shavit, N.: The art of multiprocessor programming. Revised reprint (2012)
8. Hinton, G.E, Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by Preventi Ng Co-adaptation of feature detectors. ArXiv preprint arxiv:1207.0580 (2012)
9. Jelinek, F.: Interpolated estimation of Markov source parameters from sparse data. Pattern Recognition in Practice (1980)
Kingma, D.P, Adam, j.b.: A method for stochastic optimization. In:international Conference on Learning Representation (2015)
Lai, S., Liu, K., Xu, L., Zhao, J.: How to generate a good word embedding? ARXIV preprint arxiv:1507.05523 (2015)
Maas, A.L, Hannun, A.y., Ng, a.y.: Rectifier nonlinearities Improve neural net-work acoustic models. In:proc. ICML. vol, p. 1 (2013)
Mikolov, T. Deoras, A., Kombrink, S., Burget, L., Cernocky, J.: Empirical evalu-ation and combination of advanced LA Nguage modeling techniques. Proceedings of Interspeech pp. 605–608 (2011)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-sentations in vector space. ARXIV preprint arxiv:1301.3781 (2013)
Mnih, A., Hinton, G.: A scalable hierarchical distributed language Model (2009)
Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural Probabilistic language models (2012)
From image to knowledge: an analysis of the principle of deep neural network for Image understanding