Deep Learning (review, 2015, application)

Source: Internet
Author: User
Tags svm

0. Original

Deep learning algorithms with applications to Video Analytics for A Smart city:a Survey

1. Target Detection

The goal of target detection is to pinpoint the location of the target in the image. Many work with deep learning algorithms has been proposed. We review the following representative work:

SZEGEDY[28] modified the deep convolutional network, replacing the last layer with a regression layer, with the aim of creating a binary mask,3 for the target box. In addition, the multi-scale strategy is proposed to improve the detection accuracy. As a result, they achieved an average error rate of 0.305 for 20 categories on the VOC2007.

Unlike Szegedy's work, girshick[29] a depth model based on bottom-up region extraction is proposed to solve the problem of target recognition. Figure 5 is the pipeline of the algorithm. First, in the input image, extract 2000 regions. For each area, a large CNN is used to extract features. Finally, each region is categorized using linear SVM. According to the paper, this method can significantly improve the accuracy of detection, from the map, can be increased by 30%.

"Girshick did the same job as I thought, first extracting the area, then constructing the feature space, and finally classifying it with a classifier (three-step strategy). The so-called deep learning is used for target recognition, which is the "Tectonic feature space" step, which is implemented by CNN. So, CNN is actually not a classifier, but an expression of information, that is, the structure of eigenvectors. Wait, if the information can not be expressed in the form of vectors to do, and hard to use CNN to extract the characteristics of the vector form, the effect will not be very poor? Back to CNN is an expression of information that is now known to extract the characteristics of a picture, just like the Hog,haar feature. So how do you automatically extract the characteristics of an action event? 】

Similarly, erhan[30] proposes a deep network based on significance to detect any target. However, they use a deep neural network in the form of an unknown class label to produce the desired area of the target, called bounding box. In their work, the object detection problem turns into a regression problem for the coordinates of the bounding box. In the training section, the post-propagation algorithm is used to update the box coordinates, the confidence factor, and the learning characteristics to solve the problem of predicting the correspondence between box and real box. In general, they have customized a deep neural network for the purpose of targeting.

"The idea for them to work is to identify the object, change it in a different way, transform it into another problem, and then tailor the depth network to the problem." From this article, you can learn how to solve a problem with a different problem in a deep network way. The problem domain range of the method (deep network) is enlarged. 】

Target detection technology in smart city has many applications, such as pedestrian detection, vehicle detection, unmanned monitoring of the detection. Deep learning algorithms can handle a wide range of different goals. Therefore, in smart city systems, deep models are often used to process large-scale data.

2. Target Tracking

The goal tracking is to position the target in the next picture in a set of video sequences, given the location of the first frame destination. We have described some of the representative work below (we review some representative works as follows).

WANG[31] Aiming at the problem of visual tracking, this paper proposes to use stacked denoising autoencoder learning depth characteristics. In the paper, the use of the depth of learning features, you can complete the other 7 new trackers, taking 10 sequence pictures as an example, the average center error of only 7.3 pixels, the average success rate of 85.5%. In addition, the average frame rate of the tracker with depth features reaches 15 frames per second, which achieves the actual application requirement.

"This paper is very powerful, using a network architecture method in deep learning (stacked denoising Autoencoder) to learn the depth features. This feature is not confined to the CNN feature in Rcnn, but is a feature that is needed. Well, the Cnn-like method can only get CNN features. Other DL architectures that are tied to CNN can also do "feature extraction" for specific problems. So, from this point of view, deep learning is from extracting features. And what exactly is the problem, what data, what characteristics to extract, what deep learning architecture to use, depends on your accumulation and understanding. 】

LI[16] for tracking problems, use CNN to learn discriminant features (discriminative feature) expression. Figure 7 is a block diagram of this algorithm. As can be seen from the Block diagram, a multi-CNN pool is used to handle all possible underlying cues to ensure different kernels, which is designed to extract the target from around the target. You can also see that a specific class of CNN is used to track a certain type of object (such as a face).

"The brother uses CNN to extract features, and this feature emphasizes discriminative. How to use, there are a lot of trick and options, such as this, with a number of CNN, there is also a class-specific CNN, meaning not to say, in addition to the designated class of CNN, there is no designated category of CNN, that is, unsupervised CNN. Well, using CNN to extract CNN features, is not to specify the category, if it is to classify, you need to specify the category, if only to extract the features, you do not have to specify the category. However, this is not called unsupervised CNN, because this CNN is not to classify, but to extract features. 】

WANG[33] proposed the characteristics of the learning hierarchy, which can resist the motion and shape change of the target. This hierarchy and the corresponding adjustments are shown in Figure 8. First, using the method proposed in [34] to learn the general characteristics, wherein [34] Proposed method is a time-constrained two-layer neural network, learning from the backup video offline learning, this general feature can be used to combat complex motion changes. Then, based on the sequence of the specified target, some features are pre-learned online. The advantage of this is that these features can capture the shape change of the target. For example, the tracker presented in this paper can not only deal with the non-rigid shape change of basketball players, but also can deal with specific shape changes. According to the paper, the use of feature learning methods can significantly improve the tracking effect, especially for the inclusion of complex motion changes in the target.

"The Lord's work is based on [34], [34] is the early cultivator of DL, such as Ng,kaiyu, this article does the thing is to propose a method to extract the invariant characteristics of the image." The author of this paper directly adopts this method, and then designs another method, which is composed of two phases, and obtains the hierarchical feature presented in this paper. Another good idea is that using feature self-learning methods can solve a lot of problems that cannot be solved by traditional thinking, such as the target of tracking has changed greatly. So, you have to think about the characteristics of learning, this idea, what can be used in your problem. 】

Target tracking can be used in smart city surveillance systems. The ability to automatically track suspects or target cars is important for safety. Therefore, for a large number of video data, deep learning methods can enhance the performance of the tracking system.

3. Human Face recognition

Face recognition consists of two main tasks: the first is face verification, and the second is face identity confirmation. The goal of the former is to give two faces to see if they belong to the same person. The goal of the latter is to determine the identity of a given human face, based on a known set of human faces. Recently, many methods of deep learning have yielded good results for both of these issues.

HUANG[35] uses the convolutional depth confidence network to learn hierarchical features. The main work in this article is as follows:
1. A local convolution restricted Boltzmann machine is designed to handle the global structure of a class of targets such as human faces.
2. The method of deep learning is used on the LBP feature [36] rather than the original pixel value to capture more complex features.
3. In order to strengthen the multilayer network, learning the parameters of the network architecture becomes very important.
Figure 9 shows the restricted Boltzmann machine for local convolution. The paper said that the use of learning to the characteristics can be achieved and the best hand-designed characteristics of the method, the same effect. In fact, in the next work, the depth features are far away from the manual design features.

"Or learning characteristics, this time the appearance of the restricted Boltzmann machine, plus the local convolution." So the DL is not limited to the 1989 LeCun of the article, not limited to CNN, but more essential things. In addition, this article has been a hint that the DL is not directly used in the original pixel, but after a layer of processing (LBP feature), and then using the DL method. 】

TAIGMAN[15] presented a face alignment algorithm based on 3D face model, and the use of 9-layer network to learn the expression of human face. Figure 10 is a block diagram of this 9-layer network. The first three layers extract low-level features (such as edges, textures), and the next three layers are partially connected, learning a different set of filters for each location of the face picture, because different regions have different local statistical characteristics. The next two layers are fully connected, which are used to extract the correlation of the features of the face picture in different positions. The last layer is the category label, using K-way Softmax. The goal of training is to maximize the probability of the correct category by minimizing the mutual entropy loss for each training sample. The paper shows that using this method can achieve the same effect as the human eye on the LFW data set.

"From this 9-layer network you can see: 1. Low-level network extraction low-grade features, high-level network extraction advanced features (really?) )。 2. Deep networks are simply extracting features, which are not related to classification. The last layer here is the classification (which is the method of the neural network that is spoken in the general Pattern Recognition book). So, in fact, the classification problem and DL does not matter (really? )。 3. Although it has 9 layers, in fact, each layer of design is substantiated, not blind nonsense. 】

SUN[37] for the face verification problem, also use DL extract depth features, own to take a name, called Deepid. Figure 11 is the feature extraction step. First of all......

【】

LU[38] The hierarchical feature expression is automatically learned from the original pixel by the method of joint feature learning. Figure 12 is an algorithm block diagram. First, each face graph is divided into non-overlapping areas, which are combined to learn the feature weights matrix. Then, the features learned in each region pass through the pool and are expressed as local histogram features. Finally, these local features are combined to form a long-length eigenvector that is used to express the face. In addition, the combined learning model is grouped into hierarchical form, so that hierarchical information can be mined. This method has a relatively good effect on 5 face datasets.

"Feature learning occupies a limited space throughout the pipeline, starting with pixels, ending in intermediate features, and then using traditional feature methods, such as histogram feature descriptors, which are then stitched into eigenvectors (why?). Simple stitching is right, right, the best way? Since the eigenvector is so important, so simple splicing, will it be too casual? In addition, this article mentions the idea of "mining hierarchical information with a depth-level architecture", as mentioned in the previous article. So, what is hierarchical information? How do you use a deep hierarchy to mine so-called hierarchical information? 】

Face recognition has been widely used in safety systems and human-computer interaction systems. It is still a challenging issue because of uncertainties such as illumination, posture, expression and so on. Deep learning can use the Big Data training depth model to achieve more effective facial features. In the future, face recognition systems in smart cities rely heavily on hierarchical features learned from the depth model.

4. Image classification

Image classification has been a hot research field in the past few decades. A lot of good methods have been proposed, such as word bag expression, space pyramid matching, theme model, model based on "part", sparse coding and so on. These methods use the original pixel values, or are hand-designed features that do not get data-driven representations. Recently, deep learning has yielded very good results in the classification of pictures. We look back at some of the following tasks:

KRIZHEVSKY[11] (Hinton, the work of image classification in 2012, the historic work) designed a deep convolutional network with 6,000,000 parameters, with 650,000 neurons. This depth model has 5 convolution layers, a max-pooling layer, 3 fully connected layers, and finally a layer of Softmax with 1000 nodes. Figure 13 is a block diagram of this depth model. This depth model reduces the error rate of the image classification to 8% during a certain year's competition. The test data set contains 1.2million high-resolution images in 1000 categories. This competition is renowned in the field of computer vision and is held once a year. The competition not only attracts academics, but also industry. For example, Google won the championship in the 2014 game with a 6.66% error rate. For now, high-performance computing is critical for deep learning. Baidu, a Chinese search engine company, got a 5.98% error rate, using a supercomputer called Minwa, which contains 36 server nodes and 4 nvidia GPUs per server. (Don't want to write, cheat all.) )

"The Hinton design of this network alex-net, has the historical significance, is worth in-depth study." 】

LU[44] proposed a multi-manifold depth metric learning (really clumsy). First, you use a manifold to model each picture, and then send the manifold model to a multilayer network of depth models and map to another feature space. In particular, a depth network is a specified category, so different categories have different parameters. Then, use maximal manifold margin criterion to learn the parameters of these manifolds. In the test phase, the depth network of these specified categories calculates the similarity between the test picture and all training categories. The nearest one is the classification result. This approach yielded good results in 5 widely used data sets.

"Deep learning and manifold integration." First of all, the 1th, to a picture with a manifold to model, think about it is very magical. A "depth model of the specified category" is mentioned here, so that the depth model is not just extracting features, but extracting them for specific categories, which is a little complicated. Does it combine the characteristics of specific goals with those that are not related to goals? Is it really good to do that? 】

ZUO[14] for image classification, an end-to-end hierarchical convolution recurrent neural Network (C-HRNN) is designed for mining context correlation. Figure 15 shows the structure of the C-HRNN. First, use the 5-layer CNN to extract the middle-tier representation of the image area. Then, the output of the 5th layer, the multi-scale processing. For each scale, the spatial dependency is excavated through the relationship between the region and its neighboring region. For different scales, the scale dependence is expressed by the correspondence between high-scale and low-scale. Finally, the HRNN output of different scales is given to two fully connected layers. C-hrnn not only makes full use of CNN's expressive power, but also effectively encodes the spatial and scale dependence of different regions. The algorithm has good results in 4 image classification data sets.

"This brother, on CNN's basis, added a spatial dependency and a scale dependency, which is equivalent to a CNN improvement." In other words, you have a good grasp of the nature of the CNN thing, and you are free to use CNN in your own problems, plus your understanding of other basic features and how to express them so that you can design a depth model for your problems. This is what I want to do, I can solve their own problems, to design a solution to their own problems in depth model. 】

# 5. Scene Markers

The goal of a scene marker is to assign a semantic label to each pixel of the scene image. Very challenging, because in many cases, some categories cannot be differentiated. In general, the real-world "object" pixels can vary in size, illumination, posture, and so on. Based on the method of deep learning, it gives very good results to the scene tagging problem. We look back at some of these efforts.

SHUAI[19] proposed to use CNN as a parametric model to learn discriminant features (discriminative features), used in the classification of scene markers. Figure 16 is a block diagram of the method. First, the ambiguity of the local context is removed using the global scene semantics. By passing inter-class dependencies and a priori obtained from similar examples. Then, at the pixel level, the global potential energy and the global confidence factor are combined. By combining the global and local confidence factors, we can get the marked results. Finally, we use the wide margin method based on measure learning to improve the accuracy of global confidence factor. This model has achieved good results in the data set of Siftflow and Stanford.

On the basis of CNN, we put forward the concept of global confidence factor and local confidence factor, then stir it and make a model. From this article, perhaps you can learn how to transform CNN, design your own model, to solve their own problems. 】

SHUAI[45] (this person again) a RNN (DAG-RNN) with a direction-free graph structure is proposed to model the long-distance semantic dependence in the image unit. Figure 17 is the proposed network structure. First, using a non-directed ring graph (UCG) to model the dependencies between image units, RNN does not directly deal with UCG structured graphs. Thus, a UCG is decomposed into several directed acyclic graphs (dags). Each DAG, processed with Dag-rnn, gets a hidden layer. Use these hidden layers to get a feature map of the context association. In this way, the local expression can be embedded into the abstract subject of the image, the effect is greatly enhanced. The paper says that Dag-rnns has a good effect on data sets such as Siftflow,camvid and Barcelona.

"The survey is written by the Chinese, so the articles in the review are mostly written by Chinese people." Is the Chinese scientific research level has reached the world's top level, or to promote the work of their compatriots? However, since you choose to read this review, you still have to trust the professional standards of the authors. This article does not use CNN, well, finally see the use of CNN's deep learning methods used in the CV problem. It is RNN, but what you can see is that RNN does not deal directly with pixels, but rather a layer of processing (UCG). These graphs may be something in the probability map model, is this a mixture of the legendary probability map model and the depth model? 】

WANG[46] The framework of unsupervised joint feature learning and coding is proposed for RGBD scene tagging problem. First, feature learning and coding are performed using a two-tiered network, called Jfle (Federated feature Learning and coding). To make jfle more generic, model the input data using a nonlinear stacking depth model called Jdfle. The input data for this model is a dense sample of patches from the RGBD image, and the output of the model is the feature of the corresponding path (corresponding path features), which then produces the hyper-pixel feature. Finally, linear SVM is used to map these hyper-pixel features to each scene label. This method works well in the NYU depth data set.

"This article, which belongs to the typical use of the depth model: data encoding, exchanging data for a way of expressing, or extracting a more power expression, then what to do, the SVM"

Scene tags can ...

6. Personal Summary
    1. The Video Analytics for A Smart City in the title of the article is purely A title party.
    2. In a word, how to design the depth model to solve the problem. The core problem of depth model solving is that the data feature is excavated and expressed, which is the method of automatically constructing feature space and eigenvector. As to how much space the specific method occupies throughout the pipeline, it depends on the understanding of the problem itself and the understanding of the various methods.
    3. The meta-models in the depth model, the CNN,DBM and the like, are the founders of this field. In the middle of how many pits, through how many detours, only they know.
    4. On the internet to see a comment on Hinton in 2012 of the paper, said, "Hinton since the 2006 article, the doubts constantly, finally in the 2012 picture classification of the game shot ", the description is very interesting, the master a shot, really extraordinary. Others are 1 pips a few snail speed move, he directly increased by 10%. Suddenly, startled heaven and earth, weeping ghosts and spirits, the world's many disciples, attributed to its under the door, a stream.

Deep Learning (review, 2015, application)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.