A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Deep learning is one of the most important breakthroughs in the field of artificial intelligence in the past ten years. It has been a great success in speech recognition, natural language processing, computer vision, image and video analysis, multimedia and many other fields. This paper focuses on the latest research progress of deep learning in object recognition, object detection and video analysis, and discusses its development trend.
1. A review of the history of deep learning and development
Now the deep learning model belongs to the neural network. The history of neural networks can be traced back to the 40 's and was popular in the 890. The neural network tries to solve various machine learning problems through the mechanism of brain cognition. 1986 Rumelhart, Hinton and Williams published the famous reverse-propagation algorithm for training neural networks  in nature, which is still widely used today.
But later, for a variety of reasons, most academics gave up their neural networks for a long period of time. Neural network has a large number of parameters, often have the problem of fitting, that is, often in the training set accuracy is very high, but on the test set effect is poor. This is partly due to the small size of the training data set at the time. and computing resources are limited, and even training a smaller network can take a long time. In general, neural networks do not exhibit significant advantages over the accuracy of recognition, and are difficult to train compared to other models.
As a result, more scholars have started to use classifiers such as support vector machines, boosting and nearest neighbors. These classifiers can be simulated with a neural network with one or two hidden layers, so it is called a shallow machine learning model. They no longer mimic the cognitive mechanisms of the brain; instead, design different systems for different tasks and adopt different hand-designed features. For example, speech recognition adopts Gaussian mixture model and hidden Markov model, object recognition adopts SIFT feature, face recognition adopts LBP feature, pedestrian detection adopts hog feature.
In 2006, Geoffrey Hinton put forward the deep learning, then deep learning in many areas have achieved great success, received wide attention. There are several reasons why neural networks can regain their youthful vitality. First, the advent of big data has largely eased the problem of training over-fitting. For example, the imagenet training set has millions of labeled images. The rapid development of computer hardware provides powerful computing power, which makes it possible to train large-scale neural networks. A piece of GPU can integrate thousands of cores. In addition, neural network model design and training methods have made great strides. For example, in order to improve the training of neural networks, scholars put forward non-supervised and layer-by-level pre-training. It makes it possible to achieve a good starting point for network parameters before the network can be optimized by using reverse propagation, so that a better local minimum is achieved when the training is completed.
 The most influential breakthrough in the field of computer vision in deep learning took place in the 2012, when Hinton's team used deep learning to win the imagenet image category. Imagenet is one of the most influential competitions in the field of computer vision today. Its training and test samples are from the Internet images. Training samples exceed millions, and the task is to divide the test sample into 1000 classes. Since 2009, many computer vision groups, including industry, have participated in the annual competitions, and the methods of each group have converged. In the 2012 game, the 2 to 4 team used traditional computer vision methods, the characteristics of manual design, their accuracy of the difference of less than 1%. Hinton's team was the first to compete, with deep learning exceeding 10% over the second. This result in the field of computer vision generated a great shock, the upsurge of deep learning.
Another important challenge in the field of computer vision is human face recognition. Labeled Faces in the Wild (LFW)  is today the most famous face recognition test set, founded in 2007. Before that, most of the face recognition test sets were collected under controlled conditions in the laboratory. LFW has collected face photos from over 5,000 celebrities from the Internet to evaluate the performance of face recognition algorithms under non-controllable conditions. These photographs often have complex changes in light, expression, posture, age, and occlusion.
The LFW test set contains 6000 pairs of face images. 3000 pairs are positive samples, each pair of two images belong to the same person; the remaining 3000 pairs are negative samples, and each pair of images belongs to a different person. The accuracy of the random guess is 50%, and studies have shown that  the recognition rate of the human eye on the LFW test set is 97.53% if only the central area of the face that does not include hair is seen. The recognition rate of the human eye is 99.15% if the entire image, including the background and hair, is seen. The classic face recognition algorithm eigenface has only 60% recognition rate on this test set. In non-deep learning algorithms, the best recognition rate is 96.33%. At present, deep learning can reach 99.47% recognition rate .
While the academic community has received extensive attention, deep learning has also had a huge impact in industry. 6 months after Hinton's team won the Imagenet competition, Google and Baidu released new search engines based on image content. They followed the deep learning model used by Hinton in the Imagenet competition, and found that the accuracy rate of image search was greatly improved by applying the data. In 2012, Baidu established the Deep Learning Institute, and in May 2014 it set up a new deep learning laboratory in Silicon Valley in the US, employing Stanford professor Wunda as chief scientist. Facebook set up a new AI lab in New York in December 2013, employing a well-known scholar in the field of deep learning, the inventor of convolutional networks, Yannlecun as the chief scientist. In January 2014, Google's $400 million acquisition of a deep-learning startup, DeepMind. In view of the great influence of deep learning in academia and industry, 2013 MIT Technology Review as the world's top ten technological breakthroughs.
2. What is the difference between deep learning?
Many people ask what are the key differences between deep learning and other machine learning methods, and where is the secret of its success? We will briefly elaborate on these from several aspects below.
2.1. Feature Learning
The biggest difference between deep learning and traditional pattern recognition is that it is an automatic learning feature from big data, rather than a manual design feature. Good features can greatly improve the performance of the pattern recognition system. In the application of pattern recognition in the past decades, the characteristics of manual design are in the same dominant position. It relies on the designer's prior knowledge, it is difficult to take advantage of big data. Due to the reliance on manual tuning parameters, only a small number of parameters are allowed in the design of the feature. Deep learning can automatically learn the representation of features from big data, which can contain thousands of parameters. The manual design of effective features is a rather lengthy process. Recalling the history of the development of computer vision, it often takes 5-10 years to emerge a well-recognized feature. Deep learning can quickly learn from training data for new applications to get new and effective feature representations.
A pattern recognition system consists of two main components of features and classifiers, which are closely related to each other, whereas in traditional methods their optimization is separate. In the framework of neural networks, feature representations and classifiers are jointly optimized to maximize the performance of their joint collaboration. As an example of the convolutional network model used by the Hinton to participate in the Imagenet competition in 2012, this is their first time to participate in the Imagenet image classification competition, so there is not much prior knowledge. The characteristic representation of a model consists of 60 million parameters, which are learned from millions of samples. Surprisingly, the features learned from Imagenet have a very strong generalization capability that can be successfully applied to other datasets and tasks, such as object detection, tracking, retrieval, and so on. Another famous competition in the field of computer vision is psacal VOC. However, it has a small training set and is not suitable for training deep learning models. Some scholars will learn the characteristics of imagenet on the PSACALVOC on the object detection, the detection rate increased 20%.
Since feature learning is so important, what is a good feature? In an image, various complex factors are often combined in a non-linear manner. For example, a face image contains a variety of information such as identity, posture, age, expression, and light. The key to deep learning is the successful separation of these factors through multilayer nonlinear mapping, such as the last hidden layer of the deep learning model, in which different neurons represent different factors. If this hidden layer is characterized as a feature, face recognition, attitude estimation, expression recognition, age estimation will become very simple, because the various factors become a simple linear relationship, no longer interfere with each other.
2.2, the advantages of deep structure
The deep learning model implies that the neural network has a deep structure and consists of many layers. Other commonly used machine learning models, such as support vector machines and boosting, are shallow-layer structures. It is theoretically proven that the three-layer neural network model (including input layer, output layer, and an implicit layer) can approximate any classification function. So why do we need a deep model?
Theoretical studies show that, for specific tasks, if the depth of the model is not enough, the computational units required will increase exponentially. This means that although shallow models can express the same classification functions, they require much more parameters and training samples. The shallow-layer model provides a local representation. It divides the high-dimensional image space into several local regions, and each local area stores at least one template obtained from the training data. The shallow model matches one test sample to another and predicts its category based on the matching results. For example, in the support vector machine model, these templates are support vectors; in the nearest neighbor classifier, these templates are all training samples. As the complexity of classification problem increases, the image space needs to be divided into more and more local regions, so more and more parameters and training samples are needed.
The key to reducing the parameters of the depth model is to reuse the computational units of the middle layer. For example, it can learn hierarchical feature representations for face images. The bottom layer can learn the filter from the original pixel, depict the local edge and texture features, and by combining various edge filters, the intermediate filter can describe different types of human face organs; The highest level describes the overall character of the entire face. Deep Learning provides a distributed representation of features. In the highest hidden layer, each neuron represents an attribute classifier, such as a man's seed and hair color, and so on. Each neuron divides the image space into two, the combination of n neurons can express 2N local regions, and the shallow layer model to express these regions requires at least 2N templates. Thus we can see that the depth model is more expressive and more efficient.
2.5. Ability to extract global features and contextual information
The depth model has the powerful learning ability, the efficient characteristic expression ability, from the pixel level raw data to the abstract semantic concept to extract the information layers by layer. This makes it a prominent advantage in extracting the global features and contextual information of the image. This has brought new ideas for solving some traditional computer vision problems, such as segmentation and key point detection. In order to predict which facial organs (eyes, nose, mouth, hair) each pixel belongs to, it is common practice to take a small area around the pixel to extract texture features (such as local two-value mode), and then to classify the shallow-layer models based on this feature using support vector machines, for example. Because the local area contains limited amount of information, it often produces classification errors, so the segmented image is added to the constraints of smoothing and shape priori. In fact, even if there is a local occlusion, the human eye can also be based on the information of other areas of the face to estimate the labeling of the occlusion. This means that global and contextual information is important for local judgments, and that information is lost from the very beginning in the local feature-based approach.
Ideally, the model should take the whole image as input and predict the whole picture directly. Image segmentation can be solved as a problem of high dimensional data transformation. This not only uses the contextual information, but also implicitly joins the shape priori in the process of high dimensional data transformation. But because the whole image content is too complex, the shallow model is difficult to catch the global feature effectively. The emergence of deep learning makes this idea possible, in the face segmentation , Human body segmentation , Human face image registration  and human posture estimation and other aspects have been successful .
2.4. Combined Deep Learning
The view that some computer vision treats deep learning models as black boxes is not comprehensive. In fact, we can find that the traditional computer vision system and deep learning model are closely related, and can use this link to propose new depth models and new training methods. A successful example of this is the joint deep learning for pedestrian detection . A computer vision system contains a number of key constituent modules. For example, a pedestrian detector includes feature extraction, component detectors, component geometric deformation modeling, part occlusion inference, classifier, and so on. In joint deep learning , the respective layers of the depth model and the various modules of the vision system can be established. If some of the effective key modules in the visual system do not have a corresponding layer in the existing deep learning model, they can inspire us to propose a new depth model. For example, the research work of large object detection proves that the geometric deformation modeling of object parts can improve the detection rate effectively, but there is no corresponding layer in the common depth model. So joint deep learning  and its subsequent work  have proposed a new deformation layer and the deformation of the pool layer to achieve this function.
From the perspective of training, each module of computer vision system is trained or hand-designed, and in the pre-training stage of depth model, each layer is trained individually. If we can establish the correspondence between the computer vision system and the depth model, the experience accumulated in the visual research can provide guidance for the pre-training of the depth model. This pre-training model can at least reach the results comparable to traditional computer vision systems. On this basis, deep learning also uses the reverse propagation to optimize all the layers, so that the interaction between them is optimal, so that the performance of the whole network can be greatly improved.
3. The application of deep learning in object recognition
3.1. Imagenet Image classification
The most important progress of deep learning in object recognition is the task of image classification in the Imagenet ILSVRC challenge. The lowest TOP5 error rate for traditional computer vision methods on this test set is 26.172%. The 2012 Hinton team used convolutional networks to drastically reduce the error rate to 15.315% on the test set. The structure of this network is called Alex Net. Compared with the traditional convolutional network, it has three points which are more important than the other. The first is to use the dropout training strategy, in the training process, some of the input layer and the middle layer of neurons randomly zeroed. This simulates the noise and various disturbances in the input data, which causes some neurons to miss out on some visual patterns. Dropout makes the training process more convergent, but the resulting network model is more robust. Secondly, it adopts the whole streamline unit as the nonlinear excitation function. This not only greatly reduces the complexity of the computation, but also makes the output of neurons sparse. Sparse feature representations are more robust to various disturbances. Thirdly, it generates more training samples by mirroring the training samples and adding random translation disturbances to reduce overfitting.
Inthe magenet ILSVRC2013 competition, the top 20 groups use deep learning, and their influence is evident. The winner is a team from Rob Fergus of New York University, which uses a depth model or convolutional network to further optimize the network structure. The TOP5 error rate is reduced to 11.197%, and its model is called CLARIFAI.
The 2014 deep learning also made important progress, in the ILSVRC2014 competition, the winner Goolenet the TOP5 error rate to 6.656%. It is characterized by a significant increase in the depth of the convolutional network, over 20 layers, which was unthinkable before. The deep network structure is difficult to predict the reverse propagation of error. Because the prediction error is transmitted from the topmost layer to the bottom, the error to the bottom is very small, which makes it difficult to drive the update of the underlying parameters. Goolenet's strategy is to add supervisory signals directly to multiple middle layers, which means that both intermediate and low-level feature representations also need to be able to classify the training data accurately. How to effectively train a deep network model is still an important topic in the future research. While deep learning has been a great success in imagenet, a real problem is that many of the application's training sets are smaller and how to apply deep learning in this case. There are three ways for readers to refer. (1) The model trained on Imagenet can be used as a starting point, and the target training set and the reverse propagation are used to train it, and the model is adapted to the specific application . Imagenet play a pre-training role. (2) If the target training set is not large enough, you can also fix the low-level network parameters, and follow the training set results on imagenet, only update the upper layer. This is because the underlying network parameters are the most difficult to update, and the underlying filters learned from imagenet tend to describe a variety of local edge and texture information, and these filters on the general image has a good universality. (3) directly using the model imagenet on the training, the highest hidden layer output as a feature, instead of the commonly used manual design features .
3.2. Face recognition
Another important breakthrough in deep learning in object recognition is human face recognition. The biggest challenge in face recognition is how to differentiate between intra-class changes caused by factors such as light, posture, and facial expressions, and inter-class changes resulting from identity. These two variation distributions are non-linear and extremely complex, and traditional linear models cannot distinguish them effectively. The purpose of deep learning is to obtain new feature representations through multilayer nonlinear transformations. This feature needs to remove as much of the changes in the class as possible, while preserving the changes between classes.
Face recognition has two kinds of tasks, human face confirmation and face identification. The task of face recognition is to determine whether two face photos belong to the same person, belong to the two classification problem, the correct rate of random guessing is 50%. The task of face recognition is to divide a face image into one of n categories, which is defined by the identity of the human face. This is a multi-classification problem, more challenging, and its difficulty increases with the number of categories, random guess the correct rate is 1/n. Both tasks can be used to learn the feature expression of a human face through a depth model.
In 2013,  The Face recognition task was used as the supervisory signal, and a 92.52% recognition rate was obtained by using convolutional network to learn the face characteristics of LFW. This result is lower than the following deep learning methods, but it also exceeds the majority of non-deep learning algorithms. Because face recognition is a problem of two classification, it is less efficient to learn facial features with it. This question can be understood in several ways. One of the major problems facing deep learning is overfitting. As a two classification problem, the face confirmation task is relatively simple and easy to fit on the training set. In contrast, face recognition is a challenging multi-classification problem, which is not easy to fit and is more suitable for learning facial features through depth model. On the other hand, in face recognition, each training sample is manually calibrated into one of two categories, with less information. In human face recognition, each training sample is manually labeled as one of the n classes, with much more information.
2014 cvpr,deepid and deepface all use face recognition as the supervisory signal, and the recognition rate of 97.45% and 97.35% is obtained on LFW. They use convolutional networks to predict n-dimensional labeling vectors, and the highest hidden layers as face features. This layer in the training process to distinguish between a large number of face categories (for example, in the deepid to distinguish between 1000 types of faces), so it contains a wealth of information about the changes between the classes, but also has a strong generalization ability. Although the use of human face recognition task in training, the features can be applied to face recognition tasks, as well as to identify training centers no new people. For example, the task for testing on LFW is a face recognition task, which differs from the face recognition task used in training, and the training set of deepid and deepface is not coincident with the identity of the LFW test set.
The facial features learned by the human Face recognition task include more in-class changes. DEEPID2 combined with face recognition and face recognition as the supervisory signal, the resulting facial features change in the smallest class while maintaining the change of the class, thus increasing the face recognition rate to 99.15% in the LFW. Using Titan Gpu,deepid2 to extract the features of a face image takes only 35 milliseconds and can be done offline. After PCA compression, 80-dimensional eigenvector is obtained, which can be used for fast face online alignment. In the follow-up work, deepid2+ made further improvements to DeepID2 by increasing the network structure, increasing training data, and adding supervisory information at each level, achieving a 99.47% recognition rate in LFW.
Some people the success of task deep learning is to fit a data set with a complex model with a large number of parameters. This view is also not comprehensive. In fact, further research  suggests that the characteristics of deepid2+ have many important interesting properties. For example, its top-level neuron response is medium sparse, has a strong selectivity to face identity and various face attributes, and has strong robustness to local occlusion. In previous studies, in order to get these attributes, we often needed to add a variety of display constraints to the model. The theoretical analysis behind the deepid2+, which automatically possesses these compelling attributes through large-scale learning, is worth further study in the future.
4. The application of deep learning in object detection
deep learning also for object detection bands in images Has come a great boost. Object detection is a more difficult task than object recognition. A pair of images may contain multiple objects belonging to different categories, and object detection needs to determine the location and category of each object. The progress of deep learning in object detection is also reflected in the Imagenet ILSVRC challenge. The organizers of the 2013 game increased the task of object detection and required the detection of 200 objects in 40,000 Internet images. The method used to win object detection tasks in the game was still a manual design feature, with an average object detection rate of meanaveraged Precision (MAP), which was only 22.581%. In ILSVRC2014, deep learning has dramatically increased the map to 43.933%. More influential jobs include rcnn, overfeat, googlenet, deepid-net, network in network, vgg and spatial pyramid Pooling in deep cnn. The widely used object detection process based on deep learning is presented in rcnn. First, a non-deep learning method (such as selective search) is proposed to extract the candidate region, the feature is extracted from the candidate region by the deep convolutional network, and then the region is divided into objects and backgrounds based on the feature using a linear classifier such as support vector machine. DEEPID-NET This process has been further improved so that the detection rate has been greatly improved, and the contribution of each link has done a detailed experimental analysis. In addition, the design of the deep convolutional network structure is also critical. If a network structure can improve the accuracy of the image classification task, the performance of the object detector can also be significantly improved.
The success of deep learning is now on pedestrian detection. On the largest pedestrian detection test set (caltech), the widely used hog feature and the variable part model  mean a 68% error rate. At present the best result based on deep learning is 20.86%. In the latest research progress, many of the ideas that have been proven effective in object detection have their implementation in deep learning. For example, joint deep learning  proposes a deformation layer to model the geometric deformation between parts of an object, and multi-stage deep learning  can simulate a cascade classifier commonly used in object detection, and a switchable depth network  can express a mixed model of various parts of an object;  A depth model pedestrian detector is adapted to a target scene through migration learning.
5. Deep Learning for video analysis
The application of deep learning in video classification is still in its infancy, and there is much work to be done in the future. Describing the static image features of video, we can use the depth model obtained from imagenet, and the difficulty is how to describe the dynamic characteristics. In previous visual methods, the description of dynamic features often depended on optical flow estimation, tracking of key points, and dynamic textures. How to embody this information in the depth model is a difficult point. The most straightforward approach is to treat the video as a three-dimensional image, directly applying convolutional networks  and learning three-dimensional filters at each level. But this idea obviously did not take into account the difference of time and space dimension. Another simple but more effective way of thinking is to use preprocessing to calculate the optical flow field as an input channel of the convolutional network . There are also research work using the depth encoder (deep Autoencoder) to extract dynamic textures in a non-linear manner , while traditional methods mostly use linear dynamic system modeling. In some of the latest research work , the long-term memory network [LSTM] is being widely watched, and it can capture chronic dependence and complex dynamic modeling in video.
6, the future development prospects
The development of deep learning in image recognition is in the ascendant, and there is huge space in the future. This section explores several possible directions. In object recognition and object detection, there is a tendency to use larger and deeper network structures. Alex Net in ILSVRC2012 includes only 5 convolution layers and two fully connected layers. The network structure used by goolenet and Vgg in ILSVRC2014 is more than 20 layers. A deeper network structure makes direction propagation more difficult. At the same time, the scale of training data is increasing rapidly. This is an urgent need to study new algorithms and develop new parallel computing systems to more effectively utilize big data to train larger and deeper models.
Compared with image recognition, the application of deep learning in video classification is far from mature. The image features obtained from imagenet training can be directly and effectively applied to various image-related recognition tasks (example classification, image retrieval, object detection, image segmentation, etc.), and other different image test sets, which have good generalization performance. However, there is no similar feature available for video analysis in deep learning today. To achieve this, not only will a large-scale training data set be established ( The newly built database containing 1 million YouTube videos), but also a new depth model for video analysis needs to be researched. In addition, the computational capacity of the depth model used for video analysis is also greatly increased.
In image-and video-related applications, the output predictions of the depth model, such as a split-graph or object-detection box, tend to have spatial and temporal correlations. Therefore, it is also a key point to study the depth model with structural output. Although the purpose of neural network is to solve the problem of machine learning in general, domain knowledge plays an important role in the design of depth model. In the image and video-related applications, the most successful is the Deep convolutional network, which is taking advantage of the special structure of the image. The most important of these two operations, convolution and pooling (pooling) are derived from the domain knowledge associated with the image. It is important to improve the performance of image recognition to introduce new effective operation and layer in depth model through research domain knowledge. For example, the pool layer brings local translational invariance, and the deformation pool layer presented in  better describes the geometric deformation of each part of the object. In future studies, it can be further expanded to achieve rotational invariance, scale invariance, and robustness to occlusion.
By studying the relationship between depth models and traditional computer vision systems, we can not only help us understand the causes of deep learning success, but also inspire new models and training methods. Joint deep Learning  and multi-stage deep learning  are two examples where more work can be done in the future. Most of the deep learning in practice has achieved great success, the characteristics of the depth model obtained by the Big Data training (such as sparsity, selectivity, and robustness to occlusion ) is compelling, and the theoretical analysis behind it has much to be done in the future. For example, when convergence, how to obtain a better local minimum, each layer of the transformation of those to identify the beneficial invariance, and loss of the information and so on. The recent quantitative analysis of the deep network structure using wavelet Mallat  is an important exploration in this direction.
The great success of deep learning in image recognition is bound to have a significant impact on multimedia-related applications. We look forward to more scholars in the near future to explore how to use deep learning to get the image features to promote the rapid progress of various applications.
7. Concluding remarks
Since 2012, deep learning has greatly promoted the research progress of image recognition, which is highlighted in imagenet ILSVRC and face recognition, and is rapidly extended to various problems related to image recognition. The essence of deep learning is to use multi-layer nonlinear transformations to automatically learn features from big data to replace the characteristics of manual design. Deep structure makes it highly expressive and learning ability, especially good at extracting complex global features and contextual information, which is difficult for shallow model. In an image, various hidden factors tend to be associated in a complex, non-linear way, and deep learning allows these factors to be graded, with different neurons representing different factors in their highest hidden layers, making classification easier.
The depth model is not a black box, it is closely related to the traditional computer vision system, but it makes the system's various modules (that is, the various layers of the neural network) can be combined learning, the overall optimization, so that performance has been greatly improved. Various applications related to image recognition are also driving the rapid development of deep learning in various aspects of network structure, layer design and training methods. We can foresee that in the coming years, deep learning will enter a period of rapid development in theory, algorithm, and application, and expect more and more wonderful work to have a profound impact on academia and industry.
Research progress and prospect of deep learning in image recognition
Start building with 50+ products and up to 12 months usage for Elastic Compute Service