ILSVRC2016 Target Detection Task review-video target detection (VID)

Source: Internet
Author: User
Tags cas scale image

Original URL:

Http://www.cnblogs.com/laiqun/p/6501865.html


The task of image target detection has made great progress in the past three years, and the detection performance has been significantly improved. But in video surveillance, vehicle-assisted driving and other fields, video-based target detection has a wider range of needs. Because of the problems such as motion blur, occlusion, diversity of morphological variation and diversity of illumination, it is not possible to detect the target in video only by using image target detection technology. How to make use of information such as target timing information and context in video is the key to improve the performance of video target detection.

ILSVRC2015 has added a new video target detection task (object detection from video, VID), which provides good data support for researchers. ILSVRC2015 's vid evaluation index is the same as the image target detection and evaluation index-the map of the test window is computed. For video target detection, however, a good detector not only ensures accurate detection on every frame of the image, but also ensures consistency/continuity of the results (i.e., for a specific target, an excellent detector should continuously detect this target and not confuse it with other targets). ILSVRC2016 has added a new sub-task to the VID task for this issue (see part fourth-video target detection timing consistency introduction).

On ILSVRC2016, the top three are swept by the domestic team (see table 1, table 2) on Vid two sub-tasks that do not use external data. This paper summarizes the method of video target detection in ILSVRC2016 based on the related data published by Nuist,cuvideo,mcg-ict-cas and Itlab-inha four teams.

Table 1. ILSVRC2016 VID Results (no external data)

Table 2. ILSVRC2016 VID Tracking Result (no external data)

By learning from the relevant reports [2-5] of the participating teams, the video target detection algorithm is mainly used in the following framework:

The video frame is considered as an independent image, and the detection result is obtained by the image target detection algorithm.

Using the timing information and contextual information of video to modify the test results;

The detection results are further modified based on the tracking trajectory of the High quality inspection window.

This article is divided into four parts, the first three parts of how to improve the accuracy of video target detection, and finally describes how to ensure the consistency of video target detection.

One, single frame image target detection

This phase usually splits the video into independent video frames, and obtains a robust single-frame detection result by selecting excellent image object detection frame and various techniques to improve the accuracy of the image detection. ILSVRC2016 Target Detection task review (top)-image target detection has been summarized in detail and is not repeated here.

Combining with the relevant documents of each team, we think that the selection of training data and the choice of network structure are very important to improve the performance of target detection.

Training Data Selection

First, the ILSVRC2016 vid training data is analyzed: The vid database contains 30 categories, with a total of 3,862 video clips in the training set and a total of more than 1.12 million frames. From a digital perspective, it seems sufficient to train 30 categories of detectors with such a large amount of data. However, the same video clip has a single background and a smaller number of adjacent frames. So to train the existing target detection model, the VID training set has a large amount of data redundancy, and the data diversity is poor, it is necessary to expand it. In the race task, you can extract images from the ILSVRC det and ILSVRC loc data that contain the VID category. Cuvideo, Nuist, and Mcg-ict-cas use ILSVRC Vid+det as a training set, Itlab-inha ILSVRC, COCO Vid+det, and so on as a training set. Note that when building a new training set, pay attention to balancing samples and removing redundancy (Cuvideo and Mcg-ict-cas extract part of the VID training set training model, Itlab-inha Select a certain number of images in each category to participate in the training, Nuist uses a model trained on Det to filter the vid data. For the same network, using an expanded dataset can improve detection accuracy by around 10%.

Network Structure Selection

Different network architectures also have a significant impact on detection performance. We experiment on the vid validation set: The same training data, the detection accuracy of the faster R-CNN[7] model based on the resnet101[6] is about 12% higher than the faster R-CNN model based on vgg16[8. This is the key to MSRA's success in the 2015 ILSVRC and Coco game. The first few teams this year are basically using Resnet/inception's basic network, cuvideo using the 269-layer gbd-net[9].

Ii. Improvement of classification losses

There are some problems such as motion blur, low resolution and occlusion in some video frames, even the best image target detection algorithm can not detect the target well at present. Fortunately, timing information and contextual information in the video can help us deal with this kind of problem. The more representative methods are t-cnn[10] in the motion guidance propagation (motion-guided propagation, MGP) and multi-context suppression (Multi-context suppression, MCS).

MGP

There are many missed targets in the single frame detection, and these missed targets may be included in the results of adjacent frame image detection. So we can use the optical flow information to transmit the detection results of the current frame forward backward, and the MGP processing can improve the target recall rate. As shown in Figure 1, the T-moment detection window is transmitted forward and backward, which can well fill the T-1 and t+1 time missed targets.

Figure 1. MGP schematic diagram [10]

MCS

Using the image detection algorithm to treat video frames as separate images does not take full advantage of the context information for the entire video. While there may be any category of targets in the video, there are only a few categories for a single video clip, and there is a co-occurrence between these categories (there may be whales in the video segment of the ship, but it is unlikely that zebras will appear). Therefore, the entire video segment can be used for statistical analysis of the detection results: Sorting all the detection window by score, select the category of higher scores, the remaining those with lower scores is likely to be a false test, the score must be suppressed (Figure 2). The accuracy of the target detection is improved by the correct category before the MCS-treated test results.

Figure 2. Multi-context suppression schematic diagram [10]

Third, the use of tracking information correction

The above mentioned MGP can fill some video frames missing targets, but for multi-frame continuous missed targets are not very effective, and target tracking can be a good solution to this problem. Cuvideo, Nuist, Mcg-ict-cas and Itlab-inha four teams all used tracking algorithms to further improve the recall rate of video target detection. The basic process for getting the target sequence using the tracking algorithm is as follows:

Using image target detection algorithm to obtain better detection results;

The target with the highest detection score is selected as the starting anchor point of the tracking;

Based on the selected anchor point, the track is traced forward to the whole video clip, and the tracking trajectory is generated.

From the remaining target to select the highest score to track, it should be noted that if this window appears in the previous tracking track, then skip directly, select the next target to track;

The algorithm iterates the execution and can use the score threshold as the termination condition.

The tracking trajectory can be used to improve the target recall rate, as well as to modify the results as long sequence context information.

Iv. Network selection and training skills

For video target detection, in addition to ensure the accuracy of each frame of the image detection, but also to ensure long-time stability to track each target. To do this, ILSVRC2016 adds a vid sub-task that calculates a map of each target tracking track (Tracklet)/pipeline (Tubelet) to evaluate timing consistency of the detection algorithm or to track the performance of continuity.

Evaluation index: Image target Detection Map Evaluation object is the accuracy of each detection window, and video timing consistency evaluation object is the target tracking trajectory is accurate; if the detection window is the same as ground Truth category in image target detection, the window IOU greater than 0.5 is considered as a positive example. While evaluating timing consistency, if the detected trace trajectory and the ground Truth (target true tracking trajectory) are the same target (TrackID same), and the detected window and ground The Truth window has more than one ratio of IOU greater than 0.5, so the tracking trajectory is considered to be a positive example; The track's score is the average of all the window scores on the sequence. The analysis shows that if a target's trajectory is divided into multiple segments or a target tracking trajectory, mixing other targets will reduce the consistency.

So how to ensure the timing consistency of the target in the video detection. This article thinks can start from the following three aspects:

Ensure that the results of each frame image detection are as accurate as possible in the image detection stage;

Track the quality of the inspection window and ensure the quality of the track (minimizing drift in tracking);

The trace results obtained in the previous two steps will be overlapped or pro-connected, and post-processing should be done accordingly.

Itlab-inha Team proposed a multi-target tracking algorithm based on transform point detection [11], the algorithm first detects the target, then tracking it, and tracking track point in the process of analysis and processing, can better alleviate the drift phenomenon when tracking, and can be in a timely way to terminate the trajectory of the tracking.

Aiming at the consistency problem of video target detection, the author's Mcg-ict-cas proposes a method of target pipeline generation based on detection and tracking.

A. Trace-based target pipeline/trace track

B. Detection-based target pipeline

C. Fusion pipeline based on detection and tracking

Figure 3. Based on detection/tracking/detection + Trace pipeline diagram

Figure 3-a represents the target pipeline (a red bounding box) obtained using the tracking algorithm, and the green bounding box represents the target's ground Truth. You can see that as time goes by, the tracking window gradually offsets the target, and finally may even lose the target. Mcg-ict-cas proposes a method based on the detection of the target pipeline, as shown in Figure 3-b, the detection-based pipeline window (red bounding box) positioning is more accurate, but due to the motion blur of the target makes the detector appear undetected. From the above analysis, it is shown that the target pipeline recall rate of the tracking algorithm is higher, but the positioning is not accurate, and the target location based on the detection window is precise, but the recall rate is relatively low. Because of the complementarity of the two, Mcg-ict-cas further proposes a pipeline fusion algorithm, fusion of the detection pipeline and tracking pipeline, merging the recurring window and stitching the intermittent pipeline.

As shown in Figure 4, compared to the individual detection or tracking of the generated target pipeline, the recall rate of the detection window corresponding to the fusion target pipeline has maintained a higher value with the increase of the IOU threshold, indicating that the fused window can maintain a high window recall rate and a more precise positioning. The converged target pipeline map was boosted by 12.1% on the vid test set.

Figure 4: Recall rate of the target pipeline generated by different methods

Summarize

This paper mainly introduces the video target detection algorithm based on ILSVRC2016 vid contest task. Compared with the image target detection, the current video target detection algorithm is cumbersome and the information contained in the video itself is not fully mined. How to simplify the video target detection process so that it has real-time, how to further explore the rich information contained in the video so that it has higher detection accuracy, and how to ensure the consistency of video target detection may be the next focus on video target detection to solve the problem.

Reference documents

[1] ILSVRC2016 Related reports

[2] Cuvideo Slide

[3] Nuist Slide

[4] Mcg-ict-cas Slide

[5] Itlab-inha Slide

[6] He K, Zhang X, Ren S, et al deep residual learning for image recognition[j]. ARXIV preprint arxiv:1512.03385, 2015.

[7] Ren S, He K, Girshick R, et al. Faster r-cnn:towards Real-time object detection with region proposal Networks[c]//advance s in neural information processing systems. 2015:91-99.

[8] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[j]. ARXIV Preprint arxiv:1409.1556, 2014.

[9] Zeng X, Ouyang W, Yang B, et al Gated bi-directional CNN for Object Detection[c]//european Conference on computer Vision. Springer International Publishing, 2016:354-369.

[10] Kang K, Li H, Yan J, et al t-cnn:tubelets with convolutional neural networks for object detection from videos[j]. ArXiv preprint arxiv:1604.02532, 2016.

[11] Lee B, Erdenee E, Jin S, et al multi-class multi-object Tracking Using changing point Detection[c]//european Conference o N Computer Vision. Springer International Publishing, 2016:68-83.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.