Convolutional Neural Network (III)-Target Detection

Source: Internet
Author: User
Tags ranges

This chapter describes Target Locating and target detection (including multi-target detection ).

1. object localization

After the original image passes through the Conv convolution layer, the softmax layer outputs 4x1 vectors, which are:

Note that the class label may also be a probability. The preceding four vectors correspond to pedestrain, car, motorcycle, and background.
The model for Target Locating and target detection is as follows:

After the original image passes through the Conv convolution layer, the softmax layer outputs 8x1 vectors. In addition to the above general CNN classification 3X1 vector (class label), it also contains (BX, by), indicating the coordinates of the center of the target; it also contains BH and BW, it indicates the height and width of the rectangle region where the target is located. It also contains PC, indicating that the rectangle region is the probability of the target. The value ranges from 0 ~ 1, and the greater the probability, the greater. Generally, the upper left corner of the image is set to the origin (0, 0), and the lower right corner is (1, 1 ). During model training, the numeric values of BX, by, BH, and BW are determined manually. For example, you can obtain BX = 0.5, by = 0.7, BH = 0.3, BW = 0.4.

The output label can be expressed:

If PC is set to 0, no target is detected. If PC is set to 0, all the seven parameters following the output label can be ignored.
For the loss function, if the square error form is used, there are two situations:

Of course, in addition to square errors, you can also use Logistic Regression loss functions, class labels C1, C2, c3c1, C2, C3, or softmax. In comparison, the square error has been able to achieve better results.

2. Landmark Detection

In addition to detecting the target category and position using the rectangular area, we can locate only the coordinates of the key feature points of the target. These key points are calledLandmarks.
For example, face recognition can locate and detect the coordinates of some facial feature points, as shown in:

The network model detects a total of 64 feature points on the human face, and whether the mark is a face, the output label has a total of 64X2 + 1 = 129 values. Emotion classification and judgment can be performed by detecting facial feature points, or applied to the ar field.

In addition to face feature points detection, you can also detect human body posture actions, as shown in:

3. Object Detection

A Simple Method for object detection is the sliding window algorithm. This algorithm first collects various target and non-target images in the training sample set. Note that the image size of the training set is small. Try to include only the target, as shown in:

Then, use these training sets to construct the CNN model, so that the model has a high recognition rate.

Finally, on the test image, select the appropriate size window and step length to slide from left to right and from top to bottom. Each window area is sent to the previously constructed CNN model for identification and judgment. If a target exists, this window is the target area. If no target exists, this window is not the target area.


The sliding window algorithm is simple in principle and does not need to be manually selected for the target area (the Sliding Window of the target is detected as the target area ). However, its disadvantage is obvious. First, the size and step length of the sliding window must be set visually. If the sliding window is too small or too large, the target detection accuracy will be reduced if the step length is too large. In addition, CNN network computing is required for each sliding window area. If the sliding window and step length are small, the entire target detection algorithm runs for a long time. Therefore, the sliding window algorithm is simple, but has poor performance, fast enough, and flexible enough.

4. Convolutional Implementation of sliding windows

The sliding window algorithm can be implemented using convolution to increase the running speed and reduce the repetitive operation cost.

First, the full connection layer is included when a single sliding window area enters the CNN network model. The first step to implement the convolution of the sliding window algorithm is to transform the full connection layer into a convolution layer, as shown in:

It is easy to convert the full connection layer into a convolution layer. You only need to use a filter operator that is the same as the upper layer to perform convolution. The final output layer dimension is 1x1X4, which represents four types of output values.

After the convolutional network structure of a Single Window area is established, you can use this network parameter and structure for operations on the image to be detected. For example, for a 16x16x3 image, the step length is 2, and the output layer of the CNN network is 2x2x4. 2x2 indicates four window results. For more complex 28x28 X3 images, the output layer of the CNN network is 8x8x4, with 64 window results.

The previous sliding window algorithm needs to carry out CNN forward computing repeatedly. For example, for 16x16x3 images, four times are required, and for 28x28x3 images, 64 times are required. Instead of the sliding window algorithm, convolution only requires one CNN forward computation regardless of the size of the original image. Because many Repeated Computation parts are shared, this greatly reduces the computing cost. It is worth mentioning that the step length of the window is related to the size of the selected Max pool. If the step length is 4, you only need to set the Max pool to 4x4.

5. bounding box predictions

The sliding window algorithm sometimes has the problem that the sliding window cannot fully cover the target, as shown in the blue window.

The YOLO (you only look once) algorithm can solve such problems and generate more accurate target areas (such as red windows ).
The Yolo algorithm first splits the original image into N x n grids, and each grid represents an area. To simplify the description, divide the image into 3x3 grids.

Then, the idea of implementing the Sliding Window Algorithm Using the convolution method in the previous section is used to construct the CNN network for the original image, and the dimension of the output layer is 3x3x8. 3x3 corresponds to 9 grids, and the output of each grid contains 8 elements:


The divided mesh can be more confidential. The smaller the grid, the smaller the probability that the Central Coordinates of multiple targets are divided into one grid. This is exactly what we want to see.

6. intersection over Union

IOU, that is, the ratio of intersection to Union, can be used to evaluate the accuracy of the Target Detection area.

As shown in, the red box is the actual target area, and the blue box is the detection target area. The intersection of the two areas is the green part and the set is the purple part. The IOU ratio can be used to define the closeness between a blue box and a red box:

IOU can indicate the closeness of any two regions. The IOU value ranges from 0 ~ The closer the value is to 1, the closer the two regions are.

7. Non-Max Suppression

In the Yolo algorithm, multiple grids may detect the same target. For example, several adjacent grids determine the central coordinates of the same target.

The three Green grids and the three red grids respectively detect the same target. How can we determine which grid is the most accurate? The method is to use the non-maximum value suppression algorithm.

The non-Max suppression method is simple. The Pc value of each grid can be obtained, and the PC value reflects the reliability of the grid containing the coordinates of the Target Center. First, select the grid and region corresponding to the maximum value of the PC, and then calculate the IOU between the region and all other regions, removing all the grids and regions where the IOU is greater than the threshold (for example, 0.5. In this way, only one grid corresponds to the same target, and the mesh has the largest PC and the most trusted. Next, select the largest PC mesh from the remaining mesh and repeat the previous operation. Finally, each target can be mapped to only one grid and region. As shown in:

SummaryThe process of the non-maximum suppression algorithm is as follows:

    1. Remove all grids whose Pc values are less than a certain threshold (for example, 0.6;
    1. Select the grid with the largest PC value and use IOU to discard the grid with a large overlap with the grid;
    1. Repeat Step 2 for the remaining mesh.
8. Anchor boxes

So far, we have introduced that a single grid can only detect one target at most. How can we use the Yolo algorithm to detect overlapping targets, such as a person standing in front of a vehicle? The method is to use anchor boxes of different shapes.

As shown in, the same grid has two targets: people and vehicles. To detect two targets at the same time, we can set two anchor boxes, anchor box 1 detector, and anchor box 2 detector vehicles. That is to say, each grid adds an additional layer of output. The original output dimension is 3x3x8, and now it is 3x3x2x8 (it can also be written in the form of 3x3x16 ). Here, 2 indicates that there are two anchor boxes, which are used to detect multiple targets in a grid at the same time. Each anchor box has a PC value. If both PCs are greater than a certain threshold value, two targets are detected.

When using the Yolo algorithm, you only need to use the non-maximum value suppression in the previous section for each anchor box. Parallel implementation between anchor boxes.

By the way, the anchor boxes shape can be selected manually, or other machine learning algorithms can be used, such as K clustering algorithm, to classify the shape of all targets detected, select the main shape as anchor boxes.

9. YOLO Algorithm

This section mainly introduces the Yolo algorithm process, which is a review of the previous sections. Shows the network structure, including two anchor boxes.

  1. For each grid call, get 2 predicted bounding boxes.
  2. Get rid of Low Probability predictions.
  3. For each class (pedestrian, car, motorcycle) use non-Max suppression to generate final predictions.


10. region proposals

The sliding window algorithm previously introduced scans each area of the original image, even areas with blank or apparently no targets, as shown in. This will reduce the algorithm running efficiency and time consumption.

To solve this problem, you can use region proposals to avoid scanning useless regions. The specific method is to first process the original image segmentation algorithm, and then perform the target detection for the blocks in the image after the detachment is split.

Region proposals has three methods:
R-CNN: Sliding Window. Only target detection is performed for a single area at a time, which is slow in operation.
Fast R-CNN: Uses convolution to implement the Sliding Window Algorithm, similar to Section 4th.
Faster R-CNN: Uses convolution to separate images to further improve the running speed.
In comparison, faster R-CNN is still slower than Yolo.

Convolutional Neural Network (III)-Target Detection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.