From http://www.aichengxu.com/view/2426102
Summary
Target tracking is one of the most important parts in the application of computer vision. In recent years, although much progress has been made in the sharing of source code and data sets, the development of a set of libraries and standards to assess the current most advanced tracking algorithms remains extremely important. After a brief review of recent advances in online target tracking, we have conducted a number of experiments with various evaluation criteria to study the performance of these algorithms. To facilitate performance evaluation and analysis, the test picture sequences are labeled with different characteristics. Through the results of quantitative analysis, we obtain an effective method to achieve robust tracking, and give the potential future research direction in target tracking field.
1. Introduction
In many different applications of computer vision, target tracking is one of the most important components. such as surveillance, human-computer interaction and medical imaging [60,12]. In a frame of the video, the initial state of the target (such as position and size) is given, and the purpose of the trace is to evaluate the state of the target in subsequent frames. Although the target tracking issue has been studied for decades, much progress has been made in recent years [28, 16,
47, 5, 40, 26, 19], but it's still a very challenging question. A number of factors affect the performance of the tracking algorithm, such as lighting changes, occlusion and clutter background, there is no single tracking method can successfully process all the application scenarios. Therefore, it is critical to evaluate the performance of these state-of-the-art trackers to demonstrate their strengths and weaknesses, and to help identify future research directions in the field, thus designing more robust algorithms.
For a comprehensive performance assessment, it is extremely important to collect representative data sets. There are already some datasets for visual tracking in the monitoring scenario, such as vivid[14],caviar[21] and pets datasets. However, the target objects in these monitoring sequences are usually people or small size cars, and the background is usually static. Although there are a subset of trace datasets in a common scenario [47, 5,
33] The bounding box has been calibrated, but most of them have not been calibrated well. For those sequences that are not labeled Groundtruth, it is difficult to evaluate the tracking algorithm because the result is based on an inconsistent target label location.
In recent years, more tracking sources have been publicly available, such as OAB[22],IVT
[47],mil[5],l1[40] and tld[31] algorithms, which are often used for evaluation. However, the format of the input and output of most trackers is different, so it is inconvenient for large-scale performance evaluations. In this work, we set up a code base and a test data set, where the code base includes most of the publicly available trackers, and the test datasets annotate Groundtruth to help evaluate the work. In addition, each sequence in the dataset is marked with several features, such as occlusion, rapid motion, and illumination changes, which often affect the performance of the trace.
A common problem when evaluating tracking algorithms is that the results report is based only on a sequence of a small number of different initial conditions and parameters. Therefore, these results do not reflect the overall performance of the algorithm. For fair and comprehensive performance evaluation, we propose to disturb the initial state of the real target location from space and time. Although the importance of initialization for robustness is well known in the field of tracking, it is rarely mentioned in the literature. As far as we know, this paper is the first time to mention and analyze the problem of initialization in the target tracking comprehensive work. We analyze the performance of each algorithm using a precision graph based on positional error measurements and a success rate graph based on overlapping metrics.
The contribution of this work is reflected in the following three aspects:
Data SetWe set up a tracking data set that contains 50 fully labeled sequences to facilitate tracking algorithm evaluation.
code BaseIn the codebase, we integrate most of the publicly available trackers and unify the input and output formats to facilitate large-scale algorithmic performance evaluations. Currently, the code base already contains 29 tracking algorithms.
Robustness AssessmentThe initial bounding box in the tracking is sampled in time and space to evaluate the robustness and characteristics of the tracker. Each tracker is fully evaluated by analyzing over 660,000 bounding box output results.
This work focuses on single-objective online [1] tracking. The code base, the labeled Datasets, and all of the tracking results are available on this site http://visualtracking.net.
2. Related Research
In this section, we review the target tracking algorithms in recent years from several major modules: target representation, search mechanism, and model update. In addition, some tracking methods based on joint several trackers or mining context information have been proposed.
Presentation ModeIn any visual tracker, the target representation is a major component, and a number of methods have been proposed [35]. Since Lucas and Kanade's pioneering research [37,8], Global templates (original grayscale values) have been widely used in tracking [25, 39,
2]. Then, the subspace-based tracking method [11,47] is proposed, which can better reflect the apparent transformation. In addition, Mei [40] proposed a sparse representation based tracking method to deal with the damaged target appearance, and this research has recently been further improved [41, 57,
64, 10, 55, 42]. In addition to templates, many other visual features have also been used for tracking algorithms such as color histogram [16], directional gradient histogram (HOG) [17,52], Covariance region descriptor [53, 46,
56] and the class HAL feature [54,22]. Recently, discriminant models have been widely used to track [15,4], which identify the target from the background through an online learning two classifier. A large number of learning methods have been modified to deal with tracking problems, such as svm[3], structured output SVM[26], sequencing svm[7],boosting algorithms [4,22],semiboosting[23] and multiple boosting[5]. In order to make the tracker more robust to the attitude change and partial occlusion, the target can be chunked, each small block is described by a descriptor or a histogram. [1] Several histograms are used to describe the target in a predefined grid structure. Kwon and Lee[32] A method of automatically updating the topological structure of local blocks to deal with large attitude changes is proposed. To better handle appearance changes, several recent approaches have been put forward that integrate multiple representations [62, 51,
33].
Search MechanismIn order to assess the state of the target object, the deterministic and random methods have been used. Given that the tracking problem is proposed under the optimal framework, the gradient descent method can be used to effectively locate the target position [37, 16, if the objective function is micro to the motion parameter].
20, 49]. However, these objective functions are usually non-linear and contain many local minima. To mitigate this problem, the dense sampling method was adopted [22, 5, 26], at the cost of a high computational load. On the other hand, random search algorithms such as particle filtering [28,44] have been widely used because such algorithms are relatively less sensitive to local minima and computationally efficient [47, 40,
30].
Model UpdatesIn order to reflect the apparent change, it is critical to update the target representation or the target model. Matthews and others [39] The template update problem in the Lucas-kanade algorithm [37] was proposed, and its template was updated by combining the fixed reference template and the results of the most recent frame extracted from the first frame. Some effective updating algorithms are proposed based on the online hybrid model [29], online boosting[22] and incremental subspace Update [47]. For the discriminant model, the main problem is to improve the sample collection process, so that the online training to get the classifier more robust [23, 5,
31, 26]. Although much progress has been made, it is still very difficult to find an adaptive apparent model to avoid tracking drift.
fusion of context and trackersContextual information is also very important for tracking. In recent years, some tracking algorithms use the local visual information around the helper object or the target to assist in tracking [59, 24,
18]. Contextual information is especially useful when the target is completely obscured or leaves the image area [24]. In order to improve the tracking performance, some tracker fusion methods have been proposed recently. Santner and other people [48] proposed a combination of static, moderate adaptive and highly adaptive tracker method to reflect the apparent change. In the Bayesian framework, (some tracking algorithms) even maintain and select multiple trackers [34] or multiple features [61] to better reflect the apparent change.
3. Algorithms and datasets evaluated
For a fair assessment, the tracking algorithms we tested are open source or binary code, as all tests inevitably involve technical details and special parameter settings [2]. Table 1 lists the list of algorithms that are evaluated. We also evaluated these trackers in the vivid test platform [14], including mean migration (MS-V), template Matching (tm-v), ratio migration (rs-v), and peak difference (pd-v) methods.
Method |
Representation |
Search |
MU |
Code |
Fps |
CPF [44] |
L, IH |
Pf |
N |
C |
109 |
Lot [43] |
L, Color |
Pf |
Y |
M |
0.70 |
IVT [47] |
H, PCA, GM |
Pf |
Y |
Mc |
33.4 |
ASLA [30] |
L, SR, GM |
Pf |
Y |
Mc |
8.5 |
SCM [65] |
L, SR, GM+DM |
Pf |
Y |
Mc |
0.51 |
L1APG [10] |
H, SR, GM |
Pf |
Y |
Mc |
2.0 |
MTT [64] |
H, SR, GM |
Pf |
Y |
M |
1.0 |
VTD [33] |
H, SPCA, GM |
MCMC |
Y |
Mc-e |
5.7 |
VTS [34] |
L, SPCA, GM |
MCMC |
Y |
Mc-e |
5.7 |
LSK [36] |
L, SR, GM |
LOS |
Y |
M-e |
5.5 |
ORIA [58] |
H, T, GM |
LOS |
Y |
M |
9.0 |
DFT [49] |
L, T |
LOS |
Y |
M |
13.2 |
KMS [16] |
H, IH |
LOS |
N |
C |
3,159 |
SMS [13] |
H, IH |
LOS |
N |
C |
19.2 |
vr-v [15] |
H, Color |
LOS |
Y |
Mc |
109 |
Frag [1] |
L, IH |
Ds |
N |
C |
6.3 |
OAB [22] |
H, Haar, DM |
Ds |
Y |
C |
22.4 |
Semit [23] |
H, Haar, DM |
Ds |
Y |
C |
11.2 |
BSBT [50] |
H, Haar, DM |
Ds |
Y |
C |
7.0 |
MIL [5] |
H, Haar, DM |
Ds |
Y |
C |
38.1 |
CT [63] |
H, Haar, DM |
Ds |
Y |
Mc |
64.4 |
TLD [31] |
L, BP, DM |
Ds |
Y |
Mc |
28.1 |
Struck [26] |
H, Haar, DM |
Ds |
Y |
C |
20.2 |
CSK [27] |
H, T, DM |
Ds |
Y |
M |
362 |
CXT [18] |
H, BP, DM |
Ds |
Y |
C |
15.3 |
Table 1. The tracking algorithm being evaluated (Mu:model
Update, fps:frames per second). For the target representation method, l:local (local), H:
Holistic (whole), t:template (template), IH:
Intensity histogram (grayscale histogram), bp:binary pattern (two-value mode), PCA:
Principal component Analysis (PCA), Spca:sparse (sparse PCA), SR:
Sparse representation (sparse representation), dm:discriminative model (discriminant models), GM:
Generative model (generate models). For the search mechanism, pf:particle filter (particle filter), MCMC:
Markov Chain Monte Carlo (Markov chain Monte Carlo method), los:local optimum search (local best searches), DS:
Dense sampling search (dense sampling). For model updates, N:no,y:
Yes. In the code bar, M:MATLAB,C:C/C++,MC:
The MATLAB and C + + mixed programming,
Suffix e:executable binary code (executable binary).
In recent years, a number of standard datasets have been developed for various visual problems, such as the Berkeley split dataset [38],feret face Recognition [45] and the optical flow DataSet [9]. There are also data sets to monitor the tracking in the scenario, such as vivid
[14] and caviar [21] datasets. For general visual tracking, more sequences are used in the evaluation [47,5]. However, most of the sequences are not labeled as accurate values, and the quantitative evaluation results may be generated by different initial conditions. To facilitate fair performance evaluation, we collect and annotate most of the commonly used tracking test sequences. Figure 1 shows the first frame of each sequence, where the target object is initialized with a bounding box.
Figure 1. The tracking sequence used for evaluation. The diagram shows the bounding box of the target object in the first frame of each sequence. These sequences are sorted according to our ranking of results (refer to supplemental material): The sequence in the upper left corner is more difficult to track than the sequence in the lower right corner. Note that we have two targets marked in the jogging sequence.
characteristics of a test sequenceIt is difficult to evaluate the tracking algorithm because many factors can affect the performance of the tracking. To better evaluate and analyze the pros and cons of the tracking method, we classify all the sequences with 11 attributes, which are listed in table 2.
The attribute distribution in our DataSet is shown in 2 (a). Some properties, such as OPR and IPR, occur more frequently than other properties. The table also shows that a sequence is usually marked with multiple attributes. In addition to summarizing the performance of the algorithm throughout the data set, we have created several subsets of the corresponding properties to describe the performance of the algorithm under a particular challenging condition. For example, a OCC subset includes 29 sequences that can be used to analyze the tracker's ability to handle occlusion. The attribute distributions in the OCC subset are shown in 2 (b), and the remainder of the reports can be obtained in the supplemental material.
Property |
Describe |
IV |
Illumination change-dramatic changes in illumination in the target area |
Sv |
Scale change-the ratio of the size of the bounding box in the first frame to the current frame exceeds the range of [1/ts, TS], TS >1 (ts=2) |
Occ |
Occlusion-the target is partially or completely obscured |
Def |
Deformation-non-rigid object deformation |
MB |
Motion Blur-the target area becomes blurred due to the motion of the target or camera |
Fm |
Fast motion-ground truth motion greater than TM pixels (tm=20) |
IPR |
Intra-plane rotation-the target rotates within the image plane |
OPR |
Out-of-plane rotation-the target rotates outside the image plane |
OV |
Out of sight-part of the target out of sight |
Bc |
Background clutter-the background near the target has a color or texture similar to the target |
Lr |
Low resolution-The number of pixels in the ground-truth bounding box is less than TR (tr =400) |
Table 2. A list of the various properties that are marked in the test sequence. The thresholds used in testing are also listed in the table.
Figure 2. (a) The distribution of attributes for the entire test dataset, and (b) the distribution of the Occlusion (OCC) feature in the sequence.
4. Evaluation Method
In this work, we use precision and success rate to do quantitative analysis. In addition, we evaluate the robustness of the tracking algorithm from two aspects.
Accuracy ChartIn the tracking accuracy assessment, a widely used standard is the central position error, which is defined as the average Euclidean distance between the center position of the tracking target and the exact position of the manual calibration. The average center position error of all frames in a sequence is used to enclose the overall performance of the tracking algorithm for the sequence. However, when the tracker loses the target, the trace position of the output is random, and the average error value at this point may not correctly measure the performance of the trace [6]. In recent years, the accuracy chart [6,
27] has been used to measure the overall performance of tracking. The accuracy graph shows the percentage of frames that are within the threshold distance of the given exact value for the position of the estimate. For a typical precision score for each tracker, we used a score threshold equal to 20 pixels [6].
Success rate DiagramAnother evaluation criterion is the overlap rate of the bounding box. Assuming that the bounding box of the trace is γt, the exact bounding box is γa, and the overlap rate is defined as
S = |γt∩γa | /|γt∪γa |, where ∩ and ∪ represent the intersection and the set of two regions, respectively |
| refers to the number of pixels within its region. In order to measure the performance of the algorithm in a series of frames, we calculate the number of successful frames where the overlap ratio s is greater than the given threshold to. The success rate graph gives the proportion of successful frames when this threshold changes from 0 to 1. It may not be fair or representative to evaluate a tracker using a success rate under a particular threshold, such as to=0.5. We use the area under the curve of each success rate graph (AUC) as an alternative to sorting the tracking algorithm.
Robustness AssessmentThe traditional way to evaluate a tracker is to initialize it based on the exact position in the first frame, then run the algorithm in a test sequence, and finally get a result report of the average accuracy or success rate. We make this approach a passing assessment (OPE). However, trackers can be very sensitive to initialization, and giving different initializations at different initial frames can make their performance worse or better. Therefore, we present two ways to evaluate the robustness of the tracker to initialization, that is, to disturb the initialization in time (that is, start tracking at different frames) and spatially (that is, start tracking with different bounding boxes). These two tests are called Time Robustness Assessment (TRE) and spatial robustness Assessment (SRE), respectively.
Most of the proposed test scenarios exist in real-world applications, and trackers are usually initialized with a target detector, where the detector may introduce initialization errors to the tracker in terms of location or size. In addition, the detector may be used to reinitialize the tracker at different times in the instance. By studying the characteristics of tracker in different robustness evaluation, we can understand and analyze the tracking algorithm in depth.
Time Robustness AssessmentGiven an initial frame that marks the exact bounding box of the target, the tracker is initialized and run until the end of the sequence, which is part of the entire sequence. The tracker is evaluated on each sequence fragment and the overall statistics are recorded.
Spatial Robustness AssessmentWe move or scale the exact ground in the first frame
Truth to extract the initialized bounding box. Here, we use offsets from 8 spatial locations, including 4 center offsets and 4 angular offsets, and 4 scale changes (see supplemental materials). The offset is 10% of the target size, and the scale scale change is desirable for 0.8, 0.9, 1.1, and 1.2 of the exact value. Therefore, we evaluated 12 times for each tracker for SRE.
5. Evaluation Results
For each tracker, the default parameters for the source code are used for all evaluations. Table 1 lists each of the trackers in the Intel
i7 3770 CPU (3.4GHz) on the PC Ope runtime fps. More detailed speed statistics, such as maximum and minimum values, can be obtained in supplementary materials.
For Ope, each tracker has been tested for more than 29000 frames. For SRE, each tracker is evaluated 12 times in each sequence, resulting in a total of more than 350,000 bounding box results. For Tre, each sequence is divided into 20 fragments, so each tracker tests about 310000 frames. As far as we know, this is the largest one-time performance assessment of visual tracking. In this article, we give the most important discoveries, and more details and graphs can be found in the supplemental material.
5.1. Overall Performance
The overall performance of each tracker is shown in Figure 3 in the form of a success chart and an accuracy chart, where only the first 10 algorithms are listed, and the clear and complete diagram is given in the supplemental material. For the success rate graph, we use the AUC to summarize and rank the Tracker, and for the accuracy graph, we use the result of a threshold of 20 o'clock to rank. In the accuracy graph, the rankings of some trackers are slightly different from those in the success chart, because they are based on different criteria for different characteristics of the measurement tracker. Because the AUC score of the success rate measures the overall performance, it is more accurate than a single threshold score, and we will analyze the rankings based on the success chart, but use the accuracy graph as an aid.
The average performance of Tre is higher than that of ope because the number of frames tested by ope is less than the number of frames from the first fragment to the last fragment of Tre. Because trackers tend to perform better in shorter sequences, the average of all results in Tre may be higher. On the other hand, the average performance of SRE is lower than that of ope. Initialization errors can cause the tracker to be updated with imprecise epigenetic information, resulting in a gradual drift of the tracking box.
Figure 3. OPE, SRE, and Tre graphs. The performance score for each tracker is given in the figure. For each picture, a clear and complete picture of the top 10 tracker is given in the supplemental material (you can view the high resolution chart).
In the success rate graph, the highest ranked tracker SCM exceeds struck2.6% in Ope, but less than struck1.9% in SRE. The results also show that ope is not the best performance indicator because Ope is just an experiment with SRE or Tre. TLD ranks lower in Tre than in Ope and SRE. This is because the TLD contains a re-instrumented module and can perform better in a long sequence, but in Tre it is a large number of short sequence fragments. Struck in Tre and SRE the plot of success shows that when the overlap rate threshold is small, the success rate of struck is higher than that of SCM and ALSA, but when the overlap rate threshold is large, struck has a lower success rate than SCM and ALSA. This is because struck only estimates the location of the target and does not deal with scale changes.
Sparse representations are used by both SCM,ASLA,LSK,MTT and L1APG. These trackers perform well in SRE and Tre, suggesting that sparse representations are effective models that reflect apparent changes such as occlusion. We note that Scm,asla and LSK are better than MTT and L1APG. The results show that the local sparse representation is more effective than the global sparse template. ASLA's AUC scores from ope to SRE were lower than the other top 5 trackers and Asla ranked higher. This indicates that the registration pool technology used by ASLA is more robust than the non-registration and background clutter.
In the first 10 trackers, the speed of the CSK is the highest, and the proposed cyclic determinant structure plays an important role. The VTD and VTS methods use a hybrid model to improve tracking performance. Compared to other higher-ranked trackers, their performance bottlenecks were attributed to the adoption of global templates in the sparse principal component analysis representation method. Due to space limitations, the analysis in the following sections gives only SRE diagrams, and additional results are included in the supplementary materials.
5.2. Feature-based performance analysis
By tagging the attributes of each sequence, we build a subset of the sequence with different salient features to facilitate the analysis of the tracker's performance under each of the challenging factors. Due to space limitations, we have only elaborated and analyzed the success rate and accuracy graphs of SRE under OCC, SV, and FM characteristics, as shown in 4. Supplementary materials provide more results.
Figure 4.
Graphs of OCC,SV and FM subsets. The number that appears in the title of each image is the ordinal number of the subset sequence. The supplemental material only provides a clear and complete picture of the top 10 trackers (which can be viewed in high-resolution graphs).
Trackers based on dense sampling (such as STRUCK,TLD and CXT) perform better than other trackers when the target is moving quickly. One reason is that the search scope is large enough and the discriminant model can identify the target in a cluttered background. However, due to the poor dynamic model, high overall performance trackers based on random search (such as SCM and ASLA) do not perform well in subset sequences. If the parameter is set to a larger value, the tracker needs to sample more particles to keep its performance stable. These trackers can further improve performance by designing more efficient particle filters in the context of dynamic models.
In the OCC subset, struck,
The Scm,tld,lsk and Asla methods outweigh the other trackers. The results show that structured learning and local sparse representation are effective to solve occlusion problem. In the SV subset, ASLA,SCM and struck perform best. The results show that trackers with affine motion models, such as ASLA and SCM, are generally better able to deal with scale changes than those that use a small amount of expectation to describe only translational motion, such as struck.
5.3. Initialization of different scales
We know that trackers are usually more sensitive to initialization changes. Figure 5 and Figure 6 Show the overall performance of the tracker under different scale initialization conditions. When calculating the overlap rate, we rescale the trace result box so that the overall performance can be compared to the performance at the original scale, which is the ope figure in Figure 3. Figure 6 illustrates the average performance of all trackers at each scale, and when the scale factor is large (for example, 1.2 times times) The performance of the tracker usually decreases significantly, because the initial representation of the target inevitably contains many background pixels. As the initialization scale increases, the performance of tld,cxt,dft and lot decreases. This shows that these trackers are more sensitive to cluttered backgrounds. Some trackers perform better when the scaling factor is small, such as L1apg,mtt,lot and CPF. One reason for this to happen with L1APG and MTT is that the target template must be resized to fit the size of the normally smaller standard template, so that more apparent detail can be preserved in the model if the initial template is smaller. On the other hand, when the initial bounding box increases, some trackers perform well or even Excel, such as Struck,oab,semit and BSBT. This shows that the class HAL feature is robust to complex backgrounds because of the summation operation used in the calculation of features. In general, struck is not sensitive to scale changes relative to other good-performing methods.
Figure 5: The SRE of the tracker initialized with a different size bounding box. The value above each image is a scale factor. For each picture, a clear and complete picture of the top 10 tracker is given in the supplemental material.
Figure 6: Summary of the performance of the tracker initialized with a different size bounding box. AVG (the last one) illustrates the average performance of all trackers at each scale.
6. Conclusion
In this paper, we evaluate the performance of online tracking algorithm in recent years through large-scale experiments. Based on our assessment results and observations, we highlight some of the modules necessary to improve tracking performance. First, background information is critical to effective tracking, and we can use the background information implicitly by using advanced learning techniques in discriminant models (such as struck), or explicitly as tracking contextual information (such as CXT). Second, local models are important for tracking, just as local sparse representations (such as ASLA and SCM) can improve performance more than global sparse representations such as MTT and L1APG. It is especially useful when a part of a target's apparent change occurs, such as partial occlusion or deformation. Moreover, the motion model or the dynamic model is also very important in the target tracking, especially when the moving amplitude of the target is large or sudden movement. However, most of the trackers we have evaluated do not notice this part. Position prediction based on dynamic model can reduce the search scope, so it can improve the effectiveness and robustness of tracking. Improvements to these sections will further improve the existing technology of online target tracking algorithms.
The results show a significant improvement in the target tracking area over the past decade. We present and demonstrate the evaluation criteria for in-depth analysis of tracking algorithms from multiple aspects. These numerous performance evaluations help to better understand the current advanced online target tracking methods and provide a platform for the evaluation of new algorithms. Our ongoing work focuses on expanding datasets and code libraries to include more complete and well-labeled sequences and trackers.
Online Object tracking:a Benchmark Translation