Wunda Deep Learning note course4 WEEK3 target detection

Last Update:2018-08-26 Source: Internet

Author: User

Tags ord

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1.Objection localization

Picture detection problems are divided into:

1. Picture Category: Whether it is a car (results only for a single object)

2. Classification and positioning: car, car location (results only for a single object)

3. Target detection: Detection of different objects and positioning (results may contain multiple objects)

Classification and positioning of the expression:

The output layer for classification and positioning can be expressed as follows:

1.Pc is present

2.BX coordinates x of the target Center

3.by coordinates y of the target Center

4.bH Target Height

5.bW Target Width

Whether 6.C1 is a category 1

Whether 7.C2 is a category 2

Whether 8.C3 is a category 3

In the model training, BX, by, BH, bw are determined by the artificial values.

Loss function:

Pc=1, i.e. y1=y^1=1:

L(Y^,Y)=(Y^1−Y1)2+(Y^2−Y2)2+++(y^8−y8)2 L (y^,y) = (y^1−y1) (y^2−y2) 2+⋯+ (y^8−y8) 2

Pc=0, i.e. y1=0y1=0

L (y^1,y) = (y^1-y1) 2

Of course in the actual target location application, we can use the better way is:

RightC1 C 2 "> c2 C 3 "> c3 through Softmax output;
Apply a squared error or a similar method to the four values of the bounding box;
Apply the logistic regression loss function to PC , or the square prediction error.

In comparison, the square error has been able to achieve a better effect.

2.Landmark detection

In addition to the above-mentioned rectangle detection, it is possible to detect the key features of the target.

Its output is:

1.Pc

2.l1_x,l1_y (record location of key point features)

3.l2_x,l2_y

..........

By calibrating the location information of the feature points in the training data, we can locate and mark the different characteristics of the face. AR application is based on facial expression recognition to design, such as face distortion, add head accessories and so on.

In human posture detection, we can also record the posture of human body by labeling the key points of different characteristics of human body.

3.Objection detection

A simple algorithm for target detection is sliding window detection

First: Collect some target images and non-target images as a sample of the training set to train to get a CNN model such as:

Note: The training set picture size is small, try to include only the corresponding target

Then choose the right window from the test image, swipe from left to right, top to bottom, and use the trained CNN model for each window to see if there is a target.

If the target is determined, the window is the target area, and if there is no target, the window is a non-target area.

Advantages: The principle is simple, and no artificial selection of the target area (the sliding window to detect the target is the target area).

Disadvantage: The size of the sliding window, too large step selection will affect the success rate detected by the target. Because each selected window is calculated on CNN network, if the step size is small, it will result in large computational capacity and low performance.

In summary, the sliding window algorithm is not fast enough, not flexible enough

4.Convolutional implentation of sliding windows

Convolution implementation sliding window

To convert an all-connected layer to a convolution layer:

In the last week's course, Ng lectured on the convolution core of 1x1 equivalent to applying a fully connected neural network to a slice of a three-dimensional image. Similarly, the fully connected layer can be replaced by a convolution layer of 1x1 size convolution cores. It is important to note that the number of convolution cores is the same as the number of hidden neurons (i.e. guaranteed output has the same channel).

The resulting output layer dimension is 1 x 1 x 4 and represents the 4 class output value.

We use a well-trained model, enter an entire picture of 16x16x3 size, the blue part of the figure represents the size of the sliding window. We use a 2-size stride sliding window, respectively with convolution core convolution, and finally get 4 10x10x16 size of the feature map, however, because in the sliding window operation, the input part has a large number of overlapping, that is, there are a lot of repeated operations, resulting in the next layer of the feature map value also has a large number of overlapping, So the last obtained second layer of activation value (feature map) constitutes a pair of 12x12x16 size feature map. The same process is true for the subsequent pooled and fully connected layers.

So, the sliding window on the entire picture of the operation of the sliding convolution, is equivalent to the image on the direct convolution operation process. So the convolution layer realizes this process of sliding window, we do not need to divide the input image into four subsets to perform forward propagation, but to input them as a picture into the convolutional neural network for calculation, in which the overlapping parts (common areas) can share a large number of calculations.

It is worth mentioning that the window stepping length is related to the selected Max pool size. If you need a step length of 4, simply set Max pool to 4 x 4.

Based on the above method, we input the whole picture into a well-trained convolutional neural network. No need to use the sliding window to split the image, just one forward propagation, we can simultaneously get all the picture subset of the predicted values.

5.Bounding Box prodiction

The problem with sliding window algorithms may not be accurate output bounding box

To solve this problem, we can use YOLO (you Once) algorithm

YOLO algorithm: divides the target image into N X N regions, for simplicity, the next is divided into 3 X3,

Then the target detection and localization algorithm is used for each small area, and the output of each small area is consistent with the output of the positioning.

Where the current region is pc=0, that is, the target center is detected no longer in this area, and conversely, in this region (it is not clear how it determines the target center in this area, if the target is detected in a certain area of the region, you can determine whether its center is in this area, and how it determines its full boundary) .

Yolo is a convolution implementation, not on the nxn grid n^2 operation, but a single convolution implementation, the algorithm achieves high efficiency, fast running speed, can realize real-time recognition.

Bounding boxes Details:

When using the YOLO algorithm to realize target detection, the difference of the specified parameters of the bounding box has a great influence on the prediction accuracy when the target object is defined in the grid, and when the training label y is given. Here is a more reasonable agreement: (Other fixed value method can read the paper)

For each mesh, the upper-left corner is (0,0), and the lower-right corner is (+);
The midpoint BX, by represents the coordinate value, between the 0~1;
Wide-high BH, BW represents a proportional value, there is a case of >1. (do not understand how to judge its full bounding boxes)

6.intersection-over-union

The intersection-over-union is used to evaluate whether the bounding boxes detection is accurate and the formula is:

Iou=i/u

Generally in the target detection task, if the contract is iou⩾0.5, then the detection is correct. Of course, the larger the standard, the more stringent the target detection algorithm. The larger the IOU value, the better. :

7.non-max suppression

YOLO algorithm, if the adjacent area is judged to have the same target, how to choose, as follows:

Non-max suppression (non-maximal value suppression):

1. Remove all meshes with pc< thresholds

2. For the rest, select one of the largest PC, in the use of IOU, shielding its overlapping larger mesh, repeat the process until the end

8.Anchor boxes

This is a case where there is only one object in a grid, what should I do if there is more than one object in a grid cell?

Introduction of Anchor Boxes

Just add a component to the output that corresponds to the anchor box.

In the YOLO algorithm, the selection method is the same as the non-maximal value suppression of the upper section, only the NMS for each anchor box is carried out separately.

Difficult questions:

If we use two anchor box, but there are three objects in the same lattice, it can only be handled by some additional means;
There are two objects in the same lattice, but they have the same anchor box shape, and there is a need to introduce some means of dealing with the situation.

However, the probability of the above two problems will not be very large, the target detection algorithm will not have a great impact.

Anchor Box options:

Generally manually specify the shape of the Anchor box, select 5~10 to cover a variety of different shapes, can cover the shape of the object we want to detect;
Advanced method: K-means algorithm: Clustering different object shapes, using clustering results to select a group of the most representative anchor box, in order to represent the shape of the object we want to detect.

9.Putting it togerther YOLO agorithm

This section is a summary of the previous, which reads as follows:

10.Region Proposal (Optional)

R-CNN:

R-CNN (regions with convolutional networks) will select the candidate areas of the target in our image, thus avoiding the useless operation of the traditional sliding window in a large number of non-object areas.

So after using the R-CNN, we will no longer detect the algorithm for each sliding window operation, but only select the window of some candidate area and run the convolutional network on a few windows.

Implementation: The use of image segmentation algorithm, the image is divided into many different colors of color blocks, and then placed on these color blocks of the window, the contents of the window into the network, thereby reducing the number of windows to be processed.

There are three ways to proposals region:

R-CNN: The form of sliding window, only a single area block at a time to detect the target, the operation speed is slow.
Fast R-CNN: The use of convolution to implement sliding window algorithms, similar to the 4th section of the practice.
Faster R-CNN: The image is segmented by convolution to further improve the running speed.

By comparison, the Faster r-cnn is still running slower than YOLO.

Added: reference since 79028058

79865515

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More