Deep learning transfer in image recognition

Source: Internet
Author: User

Ext.: http://mp.weixin.qq.com/s?__biz=MzAwNDExMTQwNQ==&mid=209152042&idx=1&sn= Fa0053e66cad3d2f7b107479014d4478#rd#opennewwindow

1. Deep Learning development History

deep Learning is an important breakthrough in the field of artificial intelligence in the past ten years. It has been successfully used in many fields such as speech recognition, natural language processing, computer vision, image and video analysis, multimedia and so on. The existing deep learning model belongs to neural network. The origins of neural networks can be traced back to the the 1940s and were popular in the 890. Neural networks attempt to solve various machine learning problems by simulating the mechanism of brain cognition. In 1986, Rummelhat (Rumelhart), Hinton (Hinton) and Williams (Williams) published in the journal Nature the famous reverse-propagation algorithm used to train neural networks, which is still widely used today.

The neural Network has a large number of parameters, often has the problem of fitting, although its recognition results in the training set accuracy is very high, but in the test set the effect is very poor. This is because the training data set at that time was small, and computing resources were limited, even training a smaller network would take a long time. Compared with other models, neural networks do not exhibit significant advantages in identifying accuracy.

Therefore, more scholars began to use SVM, boosting, nearest neighbor and other classifiers. These classifiers can be simulated with a neural network with one or two hidden layers, so it is called a shallow machine learning model. In this model, different systems are often designed for different tasks, and different manual design features are used. In case of object recognition, using scale invariant feature conversion (scales invariant Feature Transform, SIFT), face recognition uses a local two-value model (locally Binary Patterns, LBP), and pedestrian detection uses a directional gradient histogram ( Histogram of oriented Gradient, HOG) features.
in 2006, Hinton offered deep learning. Deep learning has been a great success in many areas and has received wide attention. There are several reasons why neural networks can regain their youth: first, the emergence of large-scale training data has largely eased the problem of training overfitting. For example, the Imagenet training set has millions of labeled images. Second, the rapid development of computer hardware provides a powerful computing power, and a GPU chip can integrate thousands of cores. This makes it possible to train a large-scale neural network. Thirdly, the model design and training methods of neural networks have made great strides. For example, in order to improve the training of neural network, the scholars put forward the unsupervised and layer-by-level pre-training, so that the network parameters can reach a good starting point before the network can be optimized by using the inverse propagation algorithm, thus achieving a better local minimum point when the training is completed.

The most influential breakthrough in the field of computer vision in deep learning took place in 2012, and the research team at Hinton won the Imagenet image classification competition with deep learning. The 2nd to 4th team used traditional computer vision methods, hand-designed features, and the difference in accuracy between them was no more than 1%. The degree of accuracy of the Hinton research group exceeds 10% in the second place (see table 1). This results in the field of computer vision generated a great shock, triggering the upsurge of deep learning.

Another important challenge in the field of computer vision is human face recognition. Studies have shown that the recognition rate on the outdoor Face Detection database (labeled Faces in the Wild, LFW) is 97.53% if only the central area of the face that does not include hair is visible. The recognition rate of the human eye is 99.15% if the entire image, including the background and hair, is seen. The classical face recognition algorithm (EIGENFACE) has only a 60% recognition rate on the LFW test set. In non-deep learning algorithms, the highest recognition rate is 96.33%. At present, deep learning can reach 99.47% recognition rate.
6 months after the Imagenet competition was won by the research team in Hinton, Google and Baidu released new search engines based on image content. They use the deep learning model and apply it on their own data, and find that the accuracy rate of image search has been greatly improved. Baidu established the Deep Learning Institute in 2012, and in May 2014 set up a new deep learning laboratory in Silicon Valley in the US, employing Stanford University's famous Professor Wunda as the Chief scientist. Facebook set up a new AI lab in New York in December 2013, employing prominent academics in deep learning, Jahn He, inventor of convolutional networks, as Chief scientist, Yann LeCun. In January 2014, Google threw in $400 million to buy a deep-learning startup, DeepMind. In 2013, MIT technology Review the top ten technological breakthroughs in the world, given the immense influence of deep learning in academia and industry.
2. What is the difference between deep learning?

What are the key differences between deep learning and other machine learning methods, and why does it succeed in many areas?

• Feature Learning
The biggest difference between deep learning and traditional pattern recognition methods is that it is characterized by automatic learning from big data rather than by manual design. A good feature can improve the performance of a pattern recognition system. Over the past decades, the characteristics of manual design have been dominant in various applications of pattern recognition. Manual design relies mainly on the prior knowledge of the designer, which makes it difficult to take advantage of big data. Due to the reliance on manual tuning parameters, the number of parameters allowed in the design of the feature is very limited. Deep learning can automatically learn the representation of features from big data, and can contain thousands of parameters.
it often takes 5-10 years to design effective features by hand, and deep learning can quickly learn new and effective feature representations from training data for new applications.
a pattern recognition system consists of two parts: Feature and classifier. In traditional methods, the optimization of features and classifiers is separate. In the framework of neural networks, feature representations and classifiers are jointly optimized to maximize the performance of their joint collaboration.

The feature of the convolutional network model [9], which was adopted by Hinton in 2012 in the Imagenet competition, contains 60 million parameters learned from millions of samples. The features learned from Imagenet have a very strong generalization capability, which can be successfully applied to other datasets and tasks, such as object detection, tracking and retrieval. Another famous competition in the field of computer vision is psacal VOC. However, it has a small training set and is not suitable for training deep learning models. Some scholars have learned the characteristics of imagenet on the Psacal VOC on the object detection, detection rate increased by 20%.

since feature learning is so important, what is a good feature? In an image, various complex factors are often combined in a non-linear manner. For example, the face image contains a variety of information such as identity, posture, age, expression, light and so on. The key to deep learning is the successful separation of these factors through multilayer nonlinear mapping, such as the last hidden layer of the depth model, in which different neurons represent different factors. If this hidden layer is characterized as a feature, face recognition, attitude estimation, expression recognition, age estimation will become very simple, because the various factors become a simple linear relationship, no longer interfere with each other.

• Advantages of deep structure
The "deep" character of deep learning model means that the structure of neural network is deep and composed of many layers. Other commonly used machine learning models, such as support vector machines and boosting, are shallow-layer structures. A three-layer neural network model (including input layers, output layers, and an implicit layer) can approximate any classification function. So why do we need a deep model?
research shows that for a particular task, if the depth of the model is not enough, the computational units it needs will increase exponentially. This means that while shallow models can express the same classification functions, they require much more parameters and training samples. The shallow-layer model provides a local representation. It divides the high-dimensional image space into several local regions, and each local region stores at least one template obtained from the training data, as shown in 1 (a). The shallow model matches one test sample to another and predicts its category based on the matching results. For example, in the support vector machine model, the template is a support vector; in the nearest neighbor classifier, the template is all training samples. As the complexity of classification problem increases, the image space needs to be divided into more and more local regions, so more and more parameters and training samples are needed. Although the parameters of many depth models are quite large at present, if they are replaced by shallow neural networks, the number of parameters required to achieve the same data-fitting effect is difficult to achieve.

the key to reducing the parameters of the depth model is to reuse the computational units of the middle tier. Taking face recognition as an example, deep learning can be performed on the layered feature expression of face Image: The bottom layer starts from the original pixel to learn the filter, depicts the local edge and texture characteristics; The middle filter describes different types of human face organs by combining various edge filters, and the highest level describes the global characteristics of the entire human face.
Deep Learning provides a distributed representation of features. In the highest hidden layer, each neuron represents an attribute classifier (1 (b)), such as gender, ethnicity, and hair color. Each neuron divides the image space into two, the combination of n neurons can express 2N local regions, and the shallow layer model to express these regions requires at least 2N templates. It can be seen that the depth model is more expressive and more efficient.
• Ability to extract global features and contextual information
The depth model has strong learning ability and efficient feature expression ability, the more important advantage is to extract information from pixel-level raw data to abstract semantic concept, which makes it have prominent advantage in extracting global feature and contextual information of image. To solve the traditional computer vision problems (such as segmentation and key point detection) brings new ideas.
In order to predict which facial organ (eye, nose, mouth) each pixel belongs to, it is common practice to take a small area around the pixel to extract texture features (such as local two-value mode), and then use a shallow-layer model, such as support vector machines, to classify the human face image as an example (2). Because the local area contains limited amount of information, it often produces classification errors, so the segmented image is added to the constraints of smoothing and shape priori.

even in the presence of local occlusion, the human eye can estimate the labeling of occluded portions based on information from other areas of the face. It can be concluded that global and contextual information is important for local judgments, which are lost at the very beginning of the local feature-based approach. Ideally, the model should be the whole image as input, directly predict the whole picture of segmentation. Image segmentation can be solved by a high-dimensional data conversion problem. This not only uses the contextual information, but also implicitly joins the shape priori in the process of high dimensional data transformation. But because the whole image content is too complex, the shallow model is difficult to catch the global feature effectively. And the emergence of deep learning makes this idea possible, in the face segmentation, human body segmentation, human face image registration and human posture estimation and other aspects have been successful.

• Combined deep learning
Some researchers who study computer vision think of deep learning models as black boxes, which is not comprehensive. The traditional computer vision system and deep learning model are closely related, and the new depth model and training method can be proposed by using this connection. Joint in-depth learning for pedestrian detection is a successful example. A computer vision system consists of a number of key constituent modules. For example, a pedestrian detector includes feature extraction, component detector, component geometry modeling, part occlusion inference, classifier and other modules. In joint deep learning, each layer of the depth model and each module of the vision system can establish correspondence relation. If the key modules in the visual system do not have a corresponding layer in the existing deep learning model, they can inspire us to propose a new depth model. For example, the research work of large object detection shows that the geometric deformation modeling of object parts can improve the detection rate effectively, but there is no corresponding layer in the common depth model, so the joint deep learning and its subsequent work all propose new deformation layer and deformation pool layer to realize this function.
from the way of training, the computer vision system of each module is one-by-train or manual design. In the pre-training stage of the depth model, each layer is trained individually. If we can establish the correspondence between the computer vision system and the depth model, the experience accumulated in the visual research can provide guidance for the pre-training of the depth model. This kind of pre-training model can achieve comparable results with traditional computer vision systems. On this basis, deep learning also uses the reverse propagation to optimize all the layers, so that the interaction between them is optimal, so that the performance of the whole network can be greatly improved.
3. Application direction of deep learning
3.1. The application of deep learning in object recognition
? Imagenet Image Classification
The most important progress of deep learning in object recognition is the task of image classification in the Imagenet ILSVRC challenge. Traditional computer vision methods the lowest error rate on this test set is 26.172%. In 2012, the Hinton research team used convolutional networks to reduce the error rate to 15.315%. This network structure, known as Alex Net, has three points of difference compared to a traditional convolutional network: First, Alex NET employs dropout's training strategy to randomly zero some neurons in the input and middle layers during the training process. This simulates a variety of noise-to-input data disturbances that cause some neurons to miss out on some visual patterns. Dropout the training process more slowly, but the resulting network model is more robust. Secondly, Alex NET uses the whole streamline unit as the nonlinear excitation function. This not only greatly reduces the complexity of the computation, but also makes the output of the neuron have sparse characteristics and is more robust to various disturbances. Thirdly, Alex net generates more training samples and reduces overfitting by mapping the training samples and adding random translation disturbances.

in the Imagenet ILSVRC 2013 competition, the top 20 teams used deep learning techniques. The winner is a team at New York University Robert Feggs (Rob Fergus), the depth model used in the Convolutional network, and the network structure is further optimized, the error rate of 11.197%, the model is called Clarifai.
in the ILSVRC 2014 game, the winner Goolenet[18] dropped the error rate to 6.656%. The salient feature of Goolenet is that it greatly increases the depth of the convolutional network over 20 layers, which is unthinkable before. The very deep network structure has difficulty in the reverse propagation of the prediction error, because the prediction error is transmitted from the topmost layer to the bottom, the error to the bottom is very small, and it is difficult to drive the update of the underlying parameters. Goolenet's strategy is to add supervisory signals directly to multiple layers, which means that the middle and bottom feature representations are also capable of accurately classifying training data. How to effectively train a deep network model is still an important topic in the future research.
while deep learning has been a great success in Imagenet, many of the training sets used are smaller, and in this case, how do you apply deep learning? There are three methods for reference: (1) The model can be used as a starting point for imagenet training, and the training of target training set and reverse propagation is used to adapt the model to the specific application. At this time imagenet play the role of pre-training. (2) If the target training set is not large enough, the underlying network parameters can be fixed, follow the training set results on Imagenet, and only update the upper layer. This is because the underlying network parameters are the most difficult to update, and the underlying filters learned from imagenet tend to describe a variety of local edge and texture information, and these filters on the general image has a good universality. (3) directly using the model obtained on the imagenet, the output of the highest hidden layer is expressed as a feature, instead of the common manual design feature.


? Human Face recognition
Another important breakthrough in deep learning in object recognition is human face recognition. One of the biggest challenges in face recognition is how to differentiate between intra-class changes due to light, posture, and facial expressions, and inter-class changes due to identity differences. The distributions of these two changes are non-linear and extremely complex, and traditional linear models cannot distinguish them effectively. The purpose of deep learning is to obtain new feature representations through multilayer nonlinear transformations. These new features need to remove as much of the changes in the class as possible, while preserving the inter-class variation.
face recognition includes two kinds of tasks: human face confirmation and face identification. Face recognition is to determine whether two face photos belong to the same person, belong to the two classification problem, the correct rate of random guessing is 50%. Face recognition is a human face image is divided into one of the N categories, the category is defined by the identity of the human face. This is a multi-classification problem, more challenging, and its difficulty increases with the number of categories, random guess the correct rate is 1/n. Both tasks can be used to learn the expression of facial features through depth model.
in 2013, the document uses the face recognition task as the supervisory signal, uses the convolution network to learn the face feature, and obtains 92.52% recognition rate on the LFW. This result is lower than the following deep learning methods, but it also surpasses most of the non-deep learning algorithms. Because face recognition is a two classification problem, it is relatively inefficient to learn facial features, and it is easy to fit on the training set. Face recognition is a more challenging multi-classification problem, it is not easy to fit, it is more suitable to learn facial features through depth model. On the other hand, in face recognition, each training sample is manually labeled as one of two categories, with less information. In human face recognition, each training sample is manually labeled as one of the n classes, with a large amount of information.
in the 2014 IEEE International Computer Vision and Pattern Recognition Conference (IEEE Conference on Computer vision and pattern recognition, CVPR), Both Deepid and DeepFace used face recognition as the supervisory signal, and 97.45% and 97.35% recognition ratios were obtained on LFW respectively (see table 2). They use convolutional networks to predict n-dimensional labeling vectors, and the highest hidden layers as face features. This layer in the training process to distinguish between a large number of face categories (for example, in Deepid to distinguish between 1000 categories of faces), so contains a wealth of information between the class changes, has a strong generalization ability. Although the use of human face recognition task in training, but the characteristics can be applied to face recognition tasks, and to identify the training set whether there are new people. For example, the task for testing on LFW is a face recognition task, unlike a face recognition task in training, and the Deepid and DeepFace training sets are not coincident with the identity of the LFW test set.

the facial features learned by the face recognition task include more in-class changes. DeepID2 combined with face recognition and face recognition as supervisory signals, the resulting facial features can be minimized within the class while preserving the change of the class, thus increasing the face recognition rate to 99.15% in the LFW. DeepID2 using the Titan GPU to extract the features of a face image takes only 35 milliseconds and can be taken offline. After the principal element analysis (Principal Component analyze, PCA) compression, the 80-dimensional eigenvector is finally obtained, which can be used for fast face-on-line alignment. In the follow-up work, DEEPID2 has achieved a 99.47% recognition rate in LFW by extending the network structure, increasing training data, and adding supervisory information to each layer.
Some people think that the success of deep learning is that it is far from easy to fit a dataset into a complex model with a large number of parameters. For example, the success of deepid2+ is also due to its many important interesting features: its top-level neuron response is medium sparse, the face identity and a variety of face attributes are highly selective, the local occlusion has a strong robustness. In previous studies, in order to get these attributes, we often needed to add a variety of display constraints to the model. The theoretical analysis behind the deepid2+, which is automatically owned by large-scale learning, deserves further study in the future.

3.2. The application of deep learning in object detection

object detection is a more difficult task than object recognition. An image may contain multiple objects belonging to different categories, and object detection needs to determine the location and category of each object. In 2013, organizers of the ImageNet ILSVRC game increased the task of object detection, requiring the detection of 200 types of objects in 40,000 Internet images. The winner uses a manual design feature, with an average object detection rate (mean averaged Precision, MAP) of only 22.581%. In ILSVRC 2014, deep learning increased the average object detection rate to 43.933%. More influential work includes rcnn[, Overfeat, Googlenet, deepid-net, network in Network, VGG, and spatial pyramid pooling in deep CNN. RCNN first presents a widely used object detection process based on deep learning, and first uses a non-deep learning method (such as selective search) to propose candidate regions, using deep convolutional networks to extract features from candidate regions, Then, a linear classifier, such as support vector machine, is used to divide the region into objects and backgrounds based on features. Deepid-net further perfected this process, making the detection rate significantly increased, and the contribution of each link to do a detailed experimental analysis. The design of the deep convolutional network structure is also critical, and if a network structure can improve the accuracy of the image classification task, the performance of the object detector can also be significantly improved.
the success of deep learning is also reflected in pedestrian detection. On the largest pedestrian detection test set (Caltech), the widely used directional gradient histogram (histogram of oriented Gradient, HOG) features and the deformable part model average error rate is 68%. At present, the best result based on deep learning detection is 20.86%. In the latest advances in research, many of the object tests proven to be effective have been used for deep learning. For example, joint deep learning puts forward a deformation layer to model the geometric deformation between objects, and multi-stage deep learning can simulate the cascade classifier commonly used in object detection; The switchable depth network can express the mixed model of each part of the object; literature [35] A depth model pedestrian detector is adapted to a target scene through migration learning.


3.3. The application of deep learning in video analysis
The application of deep learning in video classification is still in its infancy, and there is much work to be done in the future. The static image feature of the video can be used to learn the depth model from imagenet, and the difficulty is how to describe the dynamic feature. In the past, the description of dynamic features of visual research methods often depended on the optical flow estimation, the tracking of key points and the dynamic textures. How to embody this information in the depth model is a difficult point. The most straightforward approach is to treat the video as a three-dimensional image and directly apply convolutional networks to learn three-dimensional filters at each level. But this idea obviously did not take into account the difference of time and space dimension. Another simple but more effective idea is that the spatial field distribution of the optical flow field or other dynamic features is computed by preprocessing as an input channel of the convolution network. There are also research work using a depth encoder (deep Autoencoder) to extract dynamic textures in a non-linear manner. In the latest research work, the long-term memory network (long short-term memory, LSTM) has received wide attention, and it can capture chronic dependence and complex dynamic modeling in video.

4, the future development prospects
The application of deep learning in image recognition is in the ascendant, and the future has great development space.
one trend in the study of object recognition and objects detection is the use of larger and deeper network structures. In ILSVRC 2012, Alex net contained only 5 convolutional layers and two fully connected layers. In ILSVRC2014, goolenet and Vgg use more than 20 layers of network structure. Deeper network structure makes the reverse propagation more difficult. At the same time, the scale of training data is rapidly growing. This is an urgent need to research new algorithms and develop new parallel computing systems to more effectively use big data to train larger and deeper models.
compared with image recognition, the application of deep learning in video classification is far from mature. The image features obtained from imagenet training can be directly and effectively applied to various image-related recognition tasks (example classification, image retrieval, object detection, image segmentation, etc.) and other different sets of image testing, which have good generalization performance. But deep learning has not yet received a similar feature that can be used for video analysis. To achieve this, not only will a large-scale training data set be established (the latest literature builds a database containing 1 million YouTube videos), and a new depth model for video analysis needs to be researched. The computational amount of the depth model that is trained for video analysis is also greatly increased.
In image-and video-related applications, the output predictions of the depth model, such as a split-graph or object-detection box, tend to have spatial and temporal correlations. Therefore, it is also a key point to study the depth model with structural output.
Although the aim of neural network is to solve the problem of machine learning in general sense, domain knowledge plays an important role in the design of depth model. In the image and video related applications, the most successful is the Deep convolutional network, which is designed to take advantage of the special structure of the image. The two most important operations-convolution and pooling-are derived from the domain knowledge associated with the image. It is very important to improve the performance of image and video recognition by introducing new effective operation and layer in depth model through research domain knowledge. For example, the pool layer brings local translation invariance, and the deformation pool layer presented in this paper better describes the geometric deformation of each part of the object. In future studies, it can be further extended to achieve rotational invariance, scale invariance and robustness to occlusion.


by studying the relationship between depth models and traditional computer vision systems, we can not only help us understand the causes of deep learning success, but also inspire new models and training methods. Joint deep learning and multi-stage deep learning will have more work to do in the future.


Although deep learning has achieved great success in practice, and the characteristics (such as sparsity, selectivity, and robustness to occlusion) embodied by the depth model obtained through Big data training, there is still much work to be done in the theoretical analysis behind it. For example, when to converge? How to obtain a better local minimum point? What does each layer of transformations do to identify the benefits of invariance and what information is lost? Recently, Malate (Mallat) has made a quantitative analysis of the deep network structure by using wavelet, which is an important exploration in this direction.
5. Conclusion
The depth model is not a black box, it is closely related to the traditional computer vision system, and the various layers of the neural network are greatly improved by joint learning and overall optimization. Various applications related to image recognition are also driving the rapid development of deep learning in various aspects of network structure, layer design and training methods. It can be foreseen that in the next few years, deep learning will enter a high-speed development period in theory, algorithm and application.

Deep learning transfer in image recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.