Visual tracking with fully convolutional Networks notes

Source: Internet
Author: User
Tags scale image

A brief introduction to the background, this article is Dalian Science and technology Professor Lu http://202.118.75.4/lu/publications.html

Students Lijun Wang's ICCV2015 article in Hong Kong with Xiaogang Wang, a team that works in Chinese. The author in July in CUHK listen to the report in advance to see the relevant display, feeling the result is amazing. Prof Xiaogang Wang is a deep study of Daniel, Professor Lu is tracking Daniel, this article is a powerful combination of the product.

Start talking about this article. The author first studies the CNN Network from the visual tracking point of view, two attributes

1 The CNN features on different layers can be used for different tracking problems. The higher the top layer, the more abstract the feature, and the semantic information. The advantage of these features is to distinguish between different categories, while for the deformation and occlusion of robust (fig. a). But their disadvantage is that they cannot distinguish between objects within the class, such as different people (figure B below). The underlying features are more local, and can help isolate the target from the background (Figure B below). However, it is not possible to handle drastic changes in the target appearance (figure a). So in tracking, two features are switched in real time according to the interference situation.

three observations and three contributions:

The author mentions that CNN network in tracking three observations is very important, because this has inspired the author how to apply Imagenet pretrained CNN to the visual tracking. At the same time the author cvpr2016 's article is still the continuation of this idea [1].

Let's say three observations:

1.Although the Receptive field 1 of CNN feature maps is large, the activated feature maps are sparse and localized. The activated regions are highly correlated to the regions of semantic. It means that CNN's feature map is feasible to locate the target location, which is the basis

2.Many CNN feature maps are noisy or unrelated for the task of discriminating a particular target to its background. It means that feature map is useful, but not all of it is useful, noisy or redundant, so you need to have a selection mechanism

3.Different layers encode different types of features. Higher layers capture semantic concepts on object categories, whereas lower layers encode more discriminative features to Capture Intra class variations. This means that different layers of feature map (CONV4 and CONV5) have different characteristics, and different feature maps are used to address the different situations that occur in tracking.

The corresponding three contributions are as follows:

1 analyzes the characteristics of CNN from the large-scale image classification, and finds some properties suitable for visual tracking. That is, different computer vision tasks require different characteristics.

2 The author proposes a new tracking method, taking into account the characteristic output of two different convolution layers, so that they can complement each other to deal with the drastic appearance change and distinguish the target itself.

3) A method was designed to automatically select the feature maps, while ignoring the other and the noise. Overall Framework:



Explained as follows:

The first step is to perform feature map selection for the conv4-3 and conv5-3 layers of the Vgg network for a given target, which is to select the most relevant feature maps, the specific reason being to construct a regular objective function of the L1 norm.

In the second step, based on the feature maps of Conv5-3, a universal network gnet is constructed to capture the category information of the target.

The third step, based on the conv4-3 feature maps, constructs a specific network snet to differentiate the target from the background.

The fourth step, using the first frame of the image to initialize GNet and snet, but two networks using the unused Update method

In the fifth step, for the new frame image, the area of interest (ROI) is centered on the target position of the previous frame, including the target and background context information, which is passed through the whole convolution network.

In the sixth step, gnet and snet networks each produce a foreground heat map. The prediction of the next frame's target position is based on the two thermal graphs.

In the seventh step, interference detection is used to determine which thermal graph is used in the previous step to determine the position of the final target. some details

It is worth mentioning that the author uses a lot of details of the technology, which is helpful to improve the effect.

For example, for the update of the model, the author considers the target drift and the heat map matching. Reference Documents

[1] Lijun Wang, wanli Ouyang , Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Stct:sequentially training convolutional Networks for Visual tracking", in Proc. CVPR 2016.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.