Correlation Filter in Visual Tracking Series II: Fast Visual Tracking via dense spatio-temporal Context Learning paper notes

Source: Internet
Author: User

The original text continues, the book after the last. The last time we talked about Correlation Filter class tracker 's ancestor Mosse, let's see how we can refine it further. The paper to be discussed is the STC tracker published by our domestic Zhang Kaihua team on ECCV:Fast Visual Tracking via dense Spatio-temporal Context Learning. It is believed that the people who do the tracking should be more familiar with their team, such as compressive Tracking is one of their masterpieces. the Matlab source code of this paper which is to be spoken today has been released, the link is as follows:

Http://www4.comp.polyu.edu.hk/~cslzhang/STC/STC.htm

First take a look at their tracking algorithm:

See the Update method, fast Fourier transform what is not familiar? Yes, this paper is basically the same as the Mosse method, so where is the innovation point? The author thinks, its innovation point lies in the point, one is with dense time and space environment context dense spatio-temporal context as selling points, second, the method of packing CF class in the way of probability theory ; Thirdly, when the template is updated, the scale transformation is also considered.

So what is the dense space-time context? The simplest of ideas can be expressed in the following diagram: In the course of tracking, due to the target appearance of transformation and occlusion and other reasons, it is difficult to track only the target itself, but if the target area is also taken into account (spatial context), it can reduce the risk of tracking failure to a certain extent. In the example shown in the illustration, it is difficult to trace when the occlusion is in place, but if the surrounding pixels are taken into account (the red box), then the surrounding environment can be used to determine the target. This is a frame of case, if you consider a multi-frame case, the corresponding generation of time and space context. So where does the dense of the word come from? We'll explain this later.

The main idea already has, below we see how to use the probability theory to carry on the theory support. Assuming that $\mathbf{x}\in {{\mathbb{r}}^{2}}$ is a location,$o $ for the target to be tracked, first define the following confident map to measure the target in $\mathbf{x}$ The possibility of appearing:

then define ${{x}^{c}}=\{\operatorname{c} (\mathbf{z}) = (I (\mathbf{z}), \mathbf{z}) |\mathbf{z}\in {{\omega}_{c}} ( {{\mathbf{x}}^{\bigstar}}) \}$ is a collection of contextual features, where ${{\mathbf{x}}^{\bigstar}}$ represents the target location,${{\omega}_{c}} ({{\mathbf{x}}^{\bigstar}}) $ represents a neighborhood that is twice the size of the tracking targetat the ${{\mathbf{x}}^{\bigstar}}$ Point, $I (\mathbf{z}) $ for $\mathbf{z}$ The grayscale value of the image of the point. The meaning of this formula is actually to ${{\mathbf{x}}^{\bigstar}}$ as the center point, take the image around twice times the size of the target frame as a feature, such as the red box. We then use the full probability formula to unfold (1) with the context feature as the intermediate amount :

- (2) divided into two items, left $P (\mathbf{x}|\operatorname{c} (\mathbf{z}), O) $ for the given target and its contextual characteristics, The probability of the target appearing at the $\mathbf{x}$ point , the right-hand $P (\operatorname{c} (\mathbf{z}) |o) $ is the probability that a context feature belongs to the target, that is, the target's context probability priori. The right side of the action is to select a context that is similar to the target's appearance, and the left side is to consider whether it is reasonable to choose a similar appearance and to avoid drift in the tracking process.

then, since the position of the target is known at the first frame, a confident map can be constructed at this time to satisfy the higher probability of being closer tothe target. The author defines the specific values for the confident map as shown in the formula (3) :

which$b, \alpha, \beta $are all empirical constants. Think back to the last one we talked aboutMossemethod, in fact$m (\mathbf{x}) $is the response output we're talking about, butMossedirectly with a Gaussian shape, which is used as(3)definition of the formula. In addition, it was mentioned in the title of this paper that there is a "Dense "where does the word manifest? In this place, for every point near the target, it can be used(3)define the probability value of the formula. The traditional tracking method may be random sampling or spacer sampling, where the probability value is defined for each point, so it isDenseup. But in fact all of the currentCfthe class methods are allDense sampling, and the explicit idea of this concept should appear in the followingCSKmethod, only the author of this article has transformed it into aDense spatio temporal learningup. Ok, gossip Less, and then we continue to solve$P (\mathbf{x}|\operatorname{c} (\mathbf{z}), O) $and the$P (\operatorname{c} (\mathbf{z}) |o) $.

The first look at $P (\operatorname{c} (\mathbf{z}), |o) $, is the context priori of the target, as defined below:

It is the high Scar of the grayscale value of the image near the target frame (and other features are also possible, and a later paper will talk about it). Then$P (\operatorname{c} (\mathbf{z}) |o) $got it,$m (\mathbf{x}) $If you have one, you can bring it in.(2)solving$P (\mathbf{x}|\operatorname{c} (\mathbf{z}), O) $, the routine is still withMosse, you will first$m (\mathbf{x}) $expressed as$P (\mathbf{x}|\operatorname{c} (\mathbf{z}), O) $and the$P (\operatorname{c} (\mathbf{z}) |o) $the convolution(Cross-correlation), throughFftgo to the frequency domain into the point multiplication operation, after the inverse transformation back to the spatial domain, the response to the maximum value of the place as the target location. specifically, setting$P (\mathbf{x}|\operatorname{c} (\mathbf{z}), O) ={{h}^{sc}} (\mathbf{x}-\mathbf{z}) $, you have

The author also emphasizes that ${{H}^{SC}} (\mathbf{x}-\mathbf{z}) $ is a measure of the relative distance and direction between the target's location and its environment context, and is not a symmetric function.

In addition, the definition of \otimes g$ is based on the convolution $f:

so (5) the formula is actually a convolution ($\mathbf{x}$ is $t $ or $m $,$\mathbf{z}$ is $\tau $ or $n $ ), according to the convolution theorem:

and mosse different, Stc in the training template, that is, ${{H}^{SC}} (\mathbf{x}-\mathbf{z}) $ Just consider the first frame. In the trace process, ${{H}^{SC} (\mathbf{x}-\mathbf{z}) $ update like Mosse, no longer described here. In addition, the paper also gives the target frame size Update method, the basic idea can be understood as follows: See Formula (5) $m (\mathbf{x}) =\sum\nolimits_{\mathbf{z}\in {{\omega}_{c}} ({{\ Mathbf{x}}^{\bigstar}})}{{{h}^{sc}} (\mathbf{x}-\mathbf{z}) I (\mathbf{z}) {{\omega}_{\sigma}} (\mathbf{z}-{{\ Mathbf{x}}^{\bigstar}})}$ ${{\omega}_{\sigma}} (\mathbf{z}-{{\mathbf{x} }^{\bigstar}}) $ is not the weight of the Gaussian shape, slightly inappropriate to say, is to use a circle to wrap the target, the circle of high weight, the opposite of the circle, then if the target size< Span style= "font-family: the song Body;" > > It's getting bigger, and we're just going to expand the circle and expand or shrink it by adjusting the $\sigma $ OK

assuming that from $t $ to $t +1$ frames, the target size is multiplied by a $s $ times, which is equivalent to the coordinate system of the scale multiplied by the $s $ times, for the sake of convenience we set $ (u,v) = ( Sx,sy) $, and then, without losing its generality, assuming that the target is at the coordinates of the $t $ frame (0,0), then there is

by ${{\omega}_{\sigma}} (x, y) =\alpha {{E}^{-\frac{{{x}^{2}}+{{y}^{2}}}{{{\sigma}^{2}}}}},{{\omega}_{\ Sigma}} (x/s,y/s) =\alpha {{e}^{-\frac{{{x}^{2}}+{{y}^{2}}}{{{(S\sigma)}^{2}}}}}$ has ${{\omega}_{\sigma}} (x /s,y/s) ={{\omega}_{s\sigma}} (x, y) $, so the (8) formula continues to be deduced as:

then, from $t $ to $t +1$ frame, we correspond to the changed coordinates, so there are $h _{T}^{SC} (u/s,v/s) \approx H_{T+1}^{SC} (u,v) $ and ${{i}_{t}} (u/s,v/s) \approx {{i}_{t+1}} (u,v) $, so the formula (9) continues to become

Let's say that from $t $ to $t +1$ frames are scaled down, so as with scaling, we consider the integrals as two parts: one is the red box part ( the context box size of the $t +1$ frame ), the second is the Blue Basket ($t $ frame of the context box size ) minus the portion of the red box, expressed in formula is:

and because of the Gaussian shape of the $\omega $, the weights for the right-hand part of the above are very small, so the entire right item can be treated as 0, and the $s {{\sigma}_{t}}$ is considered ${{\ Sigma}_{t+1}}$, so the left entry on the formula is approximately ${{c}_{t+1}} (0,0) $:

So there is

the rest is a few tricks, such as using a sliding window to take $s $ average, and so on, can be seen as the original text of the person. This article is probably here. Summing up, which is more attractive to the author of the probability theory of support and the size of the subsequent window changes, as for the context of the environment, the use of other features should be further expanded to improve the robustness of the algorithm. The author home page has the source code, has the interest to be possible to download to run to look, the operation time attention looks like woman This kind of video bar ~

Correlation Filter in Visual Tracking Series II: Fast Visual Tracking via dense spatio-temporal Context Learning paper notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.