"Reprint" ILSVRC2016 Target Detection task review: Video target detection (VID)
Reproduced from: http://geek.csdn.net/news/detail/133792
The task of image target detection has made great progress in the past three years, and the detection performance has been greatly improved. But in the field of video surveillance, vehicle aided driving and so on, the target detection based on video has more extensive demand. Because of the problems of motion blur, occlusion, diversity of morphological changes and diversity of illumination variation, it is not very good to detect the target in the video only by using the image target detection technology. How to use the information of target timing information and context in video is the key to improve the performance of video target detection.
ILSVRC2015 has added a new video target detection task (object detection from, VID), which provides good data support for researchers. The vid evaluation index of ILSVRC2015 is the same as that of image target detection, which is the map of the computed detection window. However, for video target detection, a good detector is not only to ensure accurate detection on each frame, but also to ensure consistency/continuity of detection results (i.e., for a specific target, good detectors should be continuously detected and not confused with other targets). ILSVRC2016 has added a new subtask to the VID task (see part fourth-video target detection timing consistency introduction).
On ILSVRC2016, on the vid two subtasks that do not use external data, the top three were swept by the domestic team (see table 1, table 2). In this paper, the video target detection method in ILSVRC2016 is summarized based on the relevant data published by Nuist,cuvideo,mcg-ict-cas and Itlab-inha four teams.
Table 1 ILSVRC2016 VID Results (no external data)
Table 2 ILSVRC2016 VID Tracking result (no external data)
Through the relevant report of the team [2-5] to learn, the video target detection algorithm is mainly used in the following framework: The video frame as a separate image, the use of image target detection algorithm to obtain detection results, the use of video timing information and contextual information to modify the detection results; The detection results are further modified based on the tracking trajectory of high quality detection window.
This article is divided into four parts, the first three part describes how to improve the accuracy of video target detection, and finally describes how to ensure the consistency of video target detection. 1. Single Frame image target detection
In this stage, the video is split into the independent video frame, and the robust single frame detection results can be obtained by selecting excellent image target detection frame and various techniques to improve the image detection accuracy. "ILSVRC2016 Target detection Task review (top) – Image target Detection" has been summarized in detail, this is not repeated here.
Combined with their own experiments and the relevant documents of the participating teams, we think that the selection of training data and the choice of network structure are very important to improve the performance of target detection.
Training Data Selection
First, the ILSVRC2016 vid training data are analyzed: The vid database contains 30 categories, the training set has 3,862 video clips, and the total frame number is over 1.12 million. From a digital perspective, it seems enough to train 30 categories of detectors in such a large amount of data. However, the same video fragment has a single background, and the images of the adjacent frames are less varied. Therefore, to train the existing target detection model, the VID training set has a lot of data redundancy, and the data diversity is poor, it is necessary to expand it. In the race task, you can expand the images containing the vid category from the ILSVRC det and ILSVRC loc data. Cuvideo, Nuist and Mcg-ict-cas use ILSVRC Vid+det as a training set, Itlab-inha ILSVRC, Vid+det COCO, and so on as training sets. Note that when building a new training set, pay attention to balancing samples and removing redundancy (Cuvideo and Mcg-ict-cas extract part of the VID training set training model, Itlab-inha Select a number of images in each category to participate in training, Nuist uses the model trained on Det to filter the vid data. For the same network, using the expanded dataset can improve the detection accuracy of about 10%.
Network Structure Selection
The different network structure has a great influence on the detection performance. We experimented on the vid validation set: The same training data, the detection accuracy of the faster R-CNN[7] model based on Resnet101[6] was about 12% higher than that of the faster R-CNN model based on vgg16[8. This is the key to MSRA's success in the 2015 ILSVRC and Coco competitions. The first few teams in this year's race are basically using Resnet/inception's basic network, Cuvideo use the 269-storey gbd-net[9]. 2. Improved classification loss MGP
There are many missing targets in the single frame detection results, and the adjacent frame image detection results may include these missed targets. So we can use the optical flow information to the current frame detection results forward to the transmission, after MGP processing can improve the target recall rate. As shown in Figure 1, the detection window of T-time is transmitted backwards and forwards, which can well fill the missing target of T-1 and t+1 moments.
Fig. 1 MGP sketch map [ten] MCS
Using the image detection algorithm to treat video frames as separate images does not fully utilize the context information of the whole video. Although there may be any category of targets in the video, there are only a few categories for a single video fragment, and there is a common relationship between these categories (there may be whales in the video section of a ship, but the zebra is not likely to appear). So, you can use the entire video section of the test results for statistical analysis: All the detection window by scoring, select a higher score category, the remaining those lower scores of categories is likely to be a false check, the score should be suppressed (Figure 2). The correct classification of the detected results after the MCS is in front, and the wrong category is relied upon to improve the accuracy of the target detection.
Figure 2 Multi-context suppression diagram [10] 3. Using tracking information to correct
The MGP mentioned above can fill the missing target on some video frames, but it is not very effective for multiple frames to be missed, and target tracking can solve this problem well. Cuvideo, Nuist, Mcg-ict-cas and Itlab-inha four teams all used the tracking algorithm to further improve the recall rate of video target detection. Using the tracking algorithm to obtain the basic flow of the target sequence is as follows: Using the image target detection algorithm to obtain a better detection results, select the highest detection score as the starting anchor point of the tracking, based on the selected anchor point forward and backward in the entire video fragment tracking, generate tracking trajectory; algorithm iterative execution, You can use the score threshold as the termination condition.
The tracking trajectory can be used to improve the target recall rate, and can also be modified as a long sequence context information. 4. Network Selection and training skills
For video target detection, in addition to ensure the detection accuracy of each frame, we should also ensure a stable tracking of each target for a long time. To do this, ILSVRC2016 adds a vid subtask, which computes each target tracking trajectory (tracklet)/pipeline (Tubelet) map to evaluate the timing consistency of the detection algorithm or the performance of tracking continuity.
Then how to ensure the timing consistency of the target in video detection. In this paper, we can start from the following three aspects: (1) To ensure that the results of each frame image detection in the image detection phase are as accurate as possible; (2) The quality of the detection window to track and ensure that the tracking of the mass (as far as possible to reduce the drift phenomenon in the tracking); (3) The tracking results obtained in the previous two steps overlap or are connected, Need to be targeted for post processing. The
Itlab-inha Team proposed a multi-target tracking algorithm based on transformation point detection [11], the algorithm detects the target first, then tracks it, and analyzes the tracking locus during the tracking process, which can alleviate the drift phenomenon and terminate the tracking in time when the track is abnormal.
aiming at the consistency problem of video target detection, the author's Mcg-ict-cas proposes a method of target pipeline generation based on detection and tracking.
Figure 3 based on the detection/tracking/detection + Trace pipeline Diagram
Figure 3-a represents the target pipeline (red bounding box) obtained using the tracking algorithm, and the green bounding box represents the ground Truth of the target. You can see that, over time, the tracking window gradually offsets the target and may even lose the target. Mcg-ict-cas proposes a method of generating target pipeline based on detection, as shown in Figure 3-b, the location of the pipeline window (red bounding box) based on the detection is more accurate, but the detector appears missing because of the motion blur of the target. From the above analysis, we know that: the target pipeline recall rate is high, but the location is not allowed, but the target pipeline based on the detection window is more accurate, but the recall rate is lower than the former. Because of the complementarity of the two, Mcg-ict-cas further proposes a pipeline fusion algorithm, which integrates the detection pipeline with the tracking pipeline, fusing the repeated windows and splicing the interrupted pipes.
as shown in Figure 4, compared with the target pipeline generated by individual detection or tracking, the recall rate of the detected window corresponding to the Fusion target pipeline keeps a high value with the increase of the IOU threshold, which shows that the merged window can not only maintain a high window recall rate, but also has a more accurate positioning. The fusion target pipeline map increased by 12.1% on the vid test set.
Figure 4 The recall rate for the target pipeline generated by different methods 5. Summary
This paper mainly combines the ILSVRC2016 Vid competition task to introduce the video target detection algorithm. Compared with the image target detection, the current video target detection algorithm flow is tedious and the information contained in the video is not fully excavated. How to streamline the video target detection process so that it has real-time, how to further mining the rich information contained in the video so that it has a higher detection accuracy, and how to ensure the consistency of video target detection may be the next focus on video target detection problems to be solved.
[1] ILSVRC2016 Related reports:
http://image-net.org/challenges/ilsvrc+coco2016
[2] Cuvideo Slide Download Link:
Http://image-net.org/challenges/talks/2016/GBD-Net.pdf
[3] Nuist Slide Download link
Http://image-net.org/challenges/talks/2016/Imagenet%202016%20VID.pptx
[4] Mcg-ict-cas Slide Download link
Http://image-net.org/challenges/talks/2016/MCG-ICT-CAS-ILSVRC2016-Talk-final.pdf
[5] Itlab-inha Slide Download link
Http://image-net.org/challenges/talks/2016/ILSVRC2016_ITLab_for_pdf.pdf
[6] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[j]. ARXIV preprint arxiv:1512.03385, 2015.
[7] Ren S, he K, Girshick R, et al. faster R-cnn:towards Real-time object detection with region proposal s in neural information processing systems. 2015:91-99.
[8] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[j]. ARXIV Preprint arxiv:1409.1556, 2014.
[9] Zeng X, Ouyang W, Yang B, et al. gated bi-directional CNN for object Detection[c]//european Conference on Computer Vision. Springer International Publishing, 2016:354-369.
[10] Kang K, Li H, Yan J, et al t-cnn:tubelets with convolutional neural networks for object detection from videos[j]. ArXiv preprint arxiv:1604.02532, 2016.
[11] Lee B, Erdenee E, Jin S, et al. MULTI-CLASS multi-object tracking Using changing point Detection[c]//european Conference O N Computer Vision. Springer International Publishing, 2016:68-83.