CS231N Eighth: Target detection and location learning record

Last Update:2016-09-06 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Combining video eighth episode and notes: http://chuansong.me/n/353443351445

This lesson starts with three aspects of classification (classification), positioning (Localization) and detection (Detection).

You can see from the visual:
1. For classification, it is to divide a given picture into one of several categories given. It is clear that only one object in the given category can exist in the image.
2. The positioning is to find the corresponding object of the location area, the box is selected (ie, bounding box), this marquee in addition to the location information (x, Y) also contains its size information (W,H). Similarly, the image here contains only a single object.
3. Detection can be seen as an extension of positioning, that is, given an image or video frame, find out where all the targets are located, and give the specific categories of each target. The difference from positioning is that the number of objects contained in the image is indeterminate.
4. Instance segmentation (Instance segmentation), which is based on the detection, outlines the contours of each object, followed by semantic segmentation (Semantic segmentation)

Positioning

A simple training process for locating the network:

1. First train a classification model, which is mainly used to extract features of the convolution network part

2. Connect the fully connected regression network (fully-connected "regression head") to the marquee after the above-trained convolutional network

3. Then train the part of the regression network, just like training a classification neural network.

4. Last use, after the convolutional network connected to two fully connected head, for the marquee positioning and classification

So here's a question, is it a marquee for each category, or a layer of classification for the entire picture, only to say that both now have.
Here's how:
I. Classification section: Category C
Second, the marquee part:
1. Ambiguous type: 4 digits (1 boxes)
2. Type clear: C x 4 digits (1 boxes per class)

Another common marquee option is Sliding Window.
Here also to add a question about the object marquee, the proposal (after the OP) method is divided into two categories, a class called grouping methods, that is, the first break the picture, and then the aggregation of a method, such as selective search; Another class is called window scoring method. is a way to generate a large number of windows and rate them and then filter out the low points, such as objectness. The sliding window here is also a method of this kind . See
http://blog.csdn.net/zxdxyz/article/details/46119369

Each sliding window, as input to CNN, predicts a marquee, gives a rating, and finally combines the score with a couple of marquee blends.

Sliding window needs to score all the locations of the image, then output the box regression outputs and finally position the image.

boxes because there is (x,y,w,h), so multiply on 4.

This article takes a multi-scale approach. Here to add a multi-scale (multi-scale) problem, the traditional detection/location algorithm is fixed input image unchanged, using different sizes of sliding windows to support different scales of objects. For CNN, the size of the sliding window is the size of the input image during training, and it cannot be changed. So, the way CNN supports multi-scale is to fix the size of the sliding window and change the size of the input image. Specifically, for a given image to be processed, the image is resize to the corresponding scale, then, in each scale execution of the above-mentioned dense sampling algorithm, finally, the results of all the scales together to obtain the final result.
Given a framed picture, it is known that the picture has some sort of object, but because the box is not very suitable, or when there is a difference in training, so take different scale images to detect, get results. For CNN, the use of different scale needs to shrink the picture, because the final FC layer input size is fixed (such as 5x5), so the different size input after POOL5 "feature map" sizes, at this time take all possible 5x5 as input to get eigenvectors.

For example, the training level under (1414), the training network, the test phase in a number of large pictures, because each 1414 of the view area to produce a classification prediction distribution, (1616) of the picture on the test, there are 4 different 1414 view, so the final generation of a 4 categorical forecast distribution , make up a 22 classification result graph with C feature graph, then calculate the whole connected part according to the 11 convolution method, so the whole system can be regarded as a complete convolution system.
The latter is based on the classification (recognition) and is positioned.
Behind the trained CNN is a regressor network of two fully connected layers. Training, you only need to train the back of the two fully connected layers. The output of this regressor network is a bounding box, the last layer of the Regressor network is class specific, that is, for each class, you need to train the last layer alone. Thus, assuming that the number of categories is 1000, the Regressor network outputs 1000 bounding box, each of which corresponds to a class of bounding box.
For positioning problems, run classification network and Regressor network at the same time on each scale when testing. Thus, for each scale, the classification network gives the probability distribution of the category of the image block, and Regressor network further gives a bounding box for each class, so that for each bounding box, There is a confidence level corresponding to it. Finally, the combination of this information and the selection of the box, the final results of positioning.

Detection

It is not easy to classify each of the different locations in different scales, because the amount of data is too much, so try to select the possible correct areas. This is also the official now common practice, more famous is the region proposals:selective Search.

Selective search is only a method in region proposals (RP), in addition to the edge boxes method. Why use region proposal? Because it takes advantage of information such as textures, edges, and colors in the image, it guarantees a high recall rate when fewer windows (thousands of or even hundreds of) are selected (this noun will be described in conjunction with other concepts when referred to again later). This greatly reduces the time complexity of subsequent operations, and the obtained candidate window is higher than the quality of the sliding window (sliding window fixed aspect ratio)

Selective Search:
The simple thing is to break the picture first, then follow the principle of the super-pixel (superpixel), according to the human defined distance aggregation.

Then went into the most famous three detection algorithms,r-cnn,fast r-cnn,faster r-cnn:

R-CNN:
(1) Input test image
(2) using selective search algorithm to extract about 2000 region proposal in the image.
(3) The size of each region proposal scaled (warp) into 227x227 and input to CNN, the output of the FC7 layer of CNN as a feature.
(4) Each region proposal extracted to the CNN feature input to the SVM classification.
(5) for the SVM sub-class region proposal do border regression (bounding-box regression), border regression is a linear regression algorithm to correct the region proposal.

A few explanations are given for the above framework:

The diagram above is the flowchart of the test, to be tested we first have to train the CNN model to extract the features, and the SVM for classification: Using the pre-trained model (ALEXNET/VGG16) on imagenet to fine-tune the CNN model for feature extraction, Then we use CNN model to train SVM for training set feature.
Scaling to the same scale for each region proposal is due to the need to ensure that the dimensions are fixed by the CNN full connection layer input.
Less painting a process-for the SVM sub-class region proposal do border regression (bounding-box regression), border regression is the region proposal to correct the linear regression algorithm, in order to let region The proposal extracted window is more consistent with the target real window. Because the region proposal extracted to the window can not be the same as the manual tag, if the region proposal with the target location offset is larger, even if the classification is correct, but because IOU (region proposal and ground The intersection of the Truth's window is less than 0.5 compared to the set, so the target is still not detected.

Fast r-cnn
To solve the R-CNN problem:
1. Slow operating Speed
2. The training is divided into several stages and the steps are tedious
3. Support vector machines and regression are trained afterwards: CNN features are not updated with support vector machines and regression
Before entering fast r-cnn, it is necessary to mention another network, namely Sppnet: for different sizes of input images, on CNN after the feature map divided into the same size of the feature map and pooling, converted to the same scale vector.

These 2000 region proposal are not all part of the image, then we can be completely to the image of the convolution layer feature, and then only need to map the region proposal in the location of the original convolution layer feature map, so for an image we only need to mention a convolution layer feature, Then, each region proposal convolution feature is entered into the full-attached layer for subsequent operations. (for CNN, most of the operations are spent on convolution operations, which can save a lot of time). Now the problem is that each region proposal the scale is different, it is certainly not possible to enter the full connection layer directly, because the full connection layer input must be a fixed length. Spp-net happens to solve this problem:

The corresponding is Spp-net network structure diagram, arbitrary to an image input to CNN, after convolution operation we can get convolution characteristics (such as VGG16 final convolution layer for the conv5_3, a total of 512 feature map). The window in the diagram is a region proposal corresponding to the feature map area, only need to map these different size window features to the same dimension, as an all-connected input, you can ensure that only the image is extracted once convolution layer features. Spp-net uses spatial pyramid sampling (spatial pyramid pooling): divides each window into blocks of 4*4, 2*2, 1*1, and then each block is sampled using max-pooling, This gives a feature vector of length (4*4+2*2+1) *512 dimension after each window passes through the SPP layer, which is subsequently manipulated as an input to the full-join layer.

frame diagram for Fast r-cnn:

(1) ROI pooling layer is actually a lite version of Spp-net, Spp-net uses a different size pyramid map for each proposal, while the ROI pooling layer needs to be sampled to a 7x7 feature map. There are 512 feature maps for the VGG16 network Conv5_3, so that all region proposal correspond to a 7*7*512 dimension's eigenvector as input to the full join layer.
(2) The R-CNN training process is divided into three stages, while fast R-CNN directly uses Softmax instead of SVM classification, and the multi-task loss function border regression is also added to the network, so that the whole training process is end-to-end (except for the region proposal extraction phase).
(3) Fast r-cnn in the process of network fine-tuning, the part of the convolution layer is also fine-tuned to achieve better detection results.

Faster r-cnn
In the target detection framework of region proposal + CNN classification, the quality of region proposal directly affects the accuracy of target detection task. If a method is found to extract only hundreds of or fewer high-quality fake windows, and the recall rate is high, this will not only speed up the target detection speed, but also improve the performance of target detection (less false yang). RPN (region proposal Networks) network was born.

The core idea of RPN is to use convolutional neural networks to produce region proposal directly, and the method used is essentially sliding window. RPN's design is ingenious, RPN only need to slide over the final convolution layer, because anchor mechanism and border regression can get the region proposal of Multiscale multi-aspect ratio.

We looked directly at the RPN network structure diagram (using the zf< Zeiler and Fergus model> model), given the input image (assuming a resolution of 600*1000), the convolution operation obtained the last layer of the convolution feature graph (about the size of 40*60). On this feature map, convolution cores (sliding windows) and feature plots are used for 3*3, and the last layer of convolutional layer has 256 feature map, then this 3*3 region convolution can obtain a 256-dimensional eigenvector, followed by the CLS layer ( Box-classification layer) and Reg layer (box-regression layer) are used for classification and border regression (similar to fast r-cnn, except that there are only two categories for the target and background). Each feature area of the 3*3 sliding window predicts the input image at the same time 3 scales (128,256,512), 3 aspect ratio (1:1,1:2,2:1) region proposal, the mechanism of this mapping is called anchor. So for this 40*60 feature map, there is a total of about 20000 (40*60*9) anchor, which predicts 20,000 region proposal.
What are the benefits of this design? Although the sliding window strategy is now used, however: the sliding window operation is carried out on the convolution Layer feature map, the dimension is reduced 16*16 times than the original image (in the middle after 4 2*2 pooling operation), the multi-scale uses 9 kinds of anchor, corresponding three scales and three kinds of aspect ratio, Plus the back of the border to return, so even the 9 kinds of anchor outside the window can also get a closer to the target region proposal.

Faster r-cnn Architecture

Algorithm of deep learning target detection based on regression method

Faster R-cnn method is currently the mainstream target detection method, but the speed does not meet the real-time requirements. Yolo A class of methods slowly show its importance, such methods use the idea of regression, both given the input image, directly in the image of multiple locations to return to the location of the target border and target category.
YOLO (CVPR2016, oral)

We look directly at the YOLO of the above target detection flowchart:
(1) Give an input image, first divide the image into a 7*7 grid
(2) For each grid, we predict 2 borders (including the confidence level of each border is the target and the probability of each border area on more than one category)
(3) According to the previous step can predict the 7*7*2 target window, and then based on the threshold to remove the lower probability of the target window, the last NMS to remove redundant windows.
Can see the whole process is very simple, do not need the middle region proposal in the search for the target, the direct regression to complete the position and category of the decision.

So how do you get to the location and category information of the target directly on the grid at different locations? Above is YOLO network structure diagram, the front network structure is similar to the Googlenet model, the main is the last two layers of the structure, the convolution layer followed by a 4096-dimensional fully connected layer, and then all connected to a 7*7*30 dimension of the tensor. In fact, this 7*7 is divided by the number of meshes, now to predict the target two possible locations on each grid and the location of the target confidence and category, that is, each grid forecast two targets, each target information has 4-dimensional coordinate information (center point coordinates + length-width), one is the target's confidence level, and the category number 20 ( VOC on 20 categories), the total is (4+1) *2+20 = 30-dimensional vector. This allows you to use the full-image feature of the front 4096 dimension to directly return the desired information (border Information plus category) on each grid.

Summary: YOLO the target detection task into a regression problem, greatly speeding up the detection speed, so that the YOLO can process 45 images per second. And because each network predicts the target window with full-image information, the false positive ratio is significantly reduced (full contextual information). But Yolo also has a problem: without the region proposal mechanism, only using 7*7 grid regression will make the target not very accurate positioning, which also led to the YOLO detection accuracy is not very high.

ssd:single Shot multibox Detector
The problem of YOLO is analyzed above, and it is not very accurate to use the whole graph feature to 7*7 the target in the rough grid. Is it possible to combine the ideas of region proposal to achieve precise positioning? SSD combined with YOLO's regression thought and faster R-CNN's anchor mechanism did this.

is a frame diagram of the SSD, the first SSD to get the target location and the class method is the same as YOLO, is the use of regression, but Yolo predict a location using full-image features, SSD predicts a location using the location around the features (feel more reasonable). So how do you establish the correspondence between a location and its characteristics? Perhaps you have thought of using the anchor mechanism of the faster r-cnn. as shown in the framework diagram of the SSD, if a layer feature map (Figure B) is 8*8, then the features of each location are extracted using the 3*3 sliding window, and the feature returns the coordinate information and class information of the target (Figure C).
Unlike the faster r-cnn, this anchor is on multiple feature maps, which can take advantage of multi-layered features and naturally reach multiple scales (different layers of feature map 3*3 sliding windows feel different).
* * Summary: **SSD combined with YOLO in the regression thought and faster r-cnn in the anchor mechanism, using the full map of the various locations of the multi-scale regional characteristics of the regression, not only maintain the YOLO speed of the characteristics, but also ensure that the window prediction with faster R-CNN as more accurate. The SSD maps on the VOC2007 to 72.1% and speeds up to 58 frames per second on the GPU.
* * Summary: **yolo's proposed to target detection a new idea, SSD performance let us see the target detection in the actual application of the real possibility.

CS231N Eighth: Target detection and location learning record

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More