Summary
This paper studies the long-term tracking problem of unknown target in video stream. In the first frame, the tracking target is defined by the selected location and size. In each of the following frames, the tracking task is to determine the location and size of the target or to indicate that the target does not exist. We propose a novel tracking framework (TLD) that explicitly decomposes long-term tracking tasks into tracking, learning, and detection. The tracker completes the target tracking between the image frames. The detector concentrates everything so far into the observed appearance and corrects the tracker if necessary. The Learning phase estimates the error of the detector and updates it to avoid future errors. We study how to identify the error of the detector and learn from the error. Developed a novel learning method (P-n learning), using a pair of "experts" to estimate the error: 1) P-experts estimate missed, 2) n-experts estimate false positives. The learning process is modeled by a discrete dynamic system, and the conditions for improving the learning ability are also found. The TLD framework and PN Learning for real-time implementation are described. Extensive quantitative evaluations have been conducted to show that the tracking effect is significantly higher than the most advanced methods.
Introduction
Consider a video stream captured by a handheld camera that describes the various goals in and out of the camera's field of view. Given a target of interest defined in an image in a single frame with a bounding box, our goal is to automatically determine the target bounding box or to indicate that the target is not visible in each subsequent frame of the image. The video stream is processed at a frame rate, and the process should be executed indefinitely. We refer to this task as a long-term trace and are demonstrated in Figure 1.
Figure 1 in the initial frame (left), given a separate bounding box to define the target location and range, our system tracks, learns and detects targets in real time. A red dot indicates that the target is not visible.
In order to achieve long-term tracking, there are many problems to be solved. The key problem is that it can be detected when the target is re-appearing in the camera's field of view. In fact, the deterioration of the problem is caused by the possible change in the appearance of the target. So the appearance of the initial frame doesn't matter. Then, a successful long-term tracker should be able to deal with scale and illumination changes, background clutter and partial occlusion in real time.
Long-term tracking can be achieved from a tracking or detection perspective. The tracking algorithm estimates the target motion. The tracker only needs to be initialized, fast, and produces a smooth trajectory. On the other hand, it accumulates errors (drifts) at run time and when the target disappears from the camera's field of view, the trace usually fails. The tracking research is aimed at developing increasingly robust trackers to track longer periods of time. Trace failures are not handled directly. The detection-based algorithm independently estimates the target location in each frame. The detector does not drift and does not detect failure when the target disappears from the camera's field of view. However, the offline training phase is required and therefore cannot be applied to unknown targets.
The starting point of our research is that tracking or detection is not an acceptable fact to resolve long-term tracking tasks independently. However, if they can be executed at the same time, one has the potential to benefit from the other. In operation, the tracker can provide the detector with weak tag training data and elevate it. The detector is able to reinitialize the tracker and therefore minimizes trace failures.
The first contribution of this article is to design a new work framework (TLD) that breaks down long-term tracking tasks into three subtasks: tracking, learning, and detection. Each sub-task is processed by a separate component, and the components are executed in parallel. The tracker tracks the target between frames and frames. The detector sets all the looks that have been observed so far and corrects the tracker when necessary. Learn to partially estimate the error of the detector and update it to avoid the same mistakes in the future.
Despite the existence of many various trackers and detectors, we do not know any learning algorithms suitable for the TLD working framework. Such learning methods should have the following characteristics: 1) repeated processing of complex tracking often failed video stream, 2) if the video can not contain obvious information will not degenerate detector, 3) real-time processing.
To address all of these challenges, we rely on the various sources of information contained in the video. For example, in a single frame, consider using a separate image block to represent the location of the target. This block not only defines the appearance of the target, but also determines the block of images surrounding the appearance of the defined background. When you track this block of images, you can find different appearances and more background appearances for the same target. This is in contrast to the standard machine learning approach, where the standard machine learning method considers a single sample to be independent from other samples [1]. This opens up interesting questions about how to effectively use the information in the video in learning.
The second contribution of this paper is a new learning paradigm called p-n learning. Evaluate the detector in each frame of the video. Its response was analyzed by two experts: 1) P-Experts-identification misses, 2) n-Experts-identify false detections. The estimated error increases the training set of the detector and the detector is re-trained to avoid future errors. As with other processes, p-n experts make mistakes themselves. However, if an expert's error probability is within a certain limit (which will be quantified by analysis), these errors can compensate each other to lead to stable learning.
The third contribution of this paper is the algorithm implementation. We demonstrated how to build a real-time, long-term tracking system based on TLD tracking framework and p-n learning. The system tracks, learns and detects the target in the video stream in real time.
The fourth contribution of this paper is the extensive evaluation of modern methods in benchmarking data sets, and our approach has achieved saturated performance. As a result, we have collected and annotated new and more challenging datasets, and have shown significant improvements over advanced methods.
The following organizational arrangements are as follows: Chapter II reviews the long-term follow-up related work. The third chapter introduces the TLD working framework. The fourth chapter puts forward the P-N algorithm. In the fifth chapter, the implementation of TLD algorithm is annotated. The sixth chapter carries out many comparative experiments. The article concludes with contributions and suggestions for future research.
3 Tracking-learning-testing
TLD is a working framework designed for long-term tracking of unknown targets in a video stream. It is shown in block Diagram 2. The components of this working framework can be described as follows: Based on the assumption that the target is limited in motion between frames and frames and that the target is visible, the tracker estimates the motion of the target in successive image frames. If the target moves out of the camera's field of view, the tracker is prone to failure and does not recover. The detector processes each frame image independently, scanning the entire image to focus on all the appearance features that have been observed and learned in the past. Like other detectors, TLD detectors make two types of errors: false positives and false negatives. The Learning module observes the performance of trackers and detectors, evaluates the errors of the detectors, and generates training samples to avoid the same mistakes in the future. The learning module assumes that the tracker and the detector will fail. Based on the contribution of the learning module, the detector generalizes more target appearances and distinguishes between goals and backgrounds.
4 P-n Learning
This section studies the learning modules in the TLD work framework. The goal of the learning module is to improve the performance of the target detector in online video streaming processing. In each frame, we want to evaluate the current detector, identify its errors, and update the detector to avoid recurrence in the future. The key idea of p-n learning is that the errors of detectors can be identified by two types of "experts". The P-Experts only recognize false positives, and N-experts only recognize them. The two experts themselves will have errors, however, their independence allows them to compensate each other for their mistakes.
4.1 The small conservation formula indicates that P-n learning is a semi-supervised learning method. Section 4.2, P-n learning is modeled as a discrete dynamic system, and the conditions for improving the performance of the detector can be found by learning. 4.3 Bars A number of experiments have been carried out using integrated generation experts. Finally, Section 4.4 uses p-n learning to train target detection and propose experts that can be used in practice.
4.1 Formulation
x represents the sample from the feature space, and Y represents the label from the label space. Sample Set X is an untagged sample set, Y is a set of tags, and l={(x, y)} is a sample label set. The input to the P-n learning module is the sample tag set ll and the Untagged sample set Xu,l<<u. The task of the P-n learning module is to learn the classifier f:x->y from the tag set ll and to enhance its performance through the Untagged sample set Xu. The classifier function f is derived from the parameter θ set F. The function set F is limited by the implementation process and is fixed in training, so the training process is the estimation of the parameter θ.
P-n Learning consists of the following four parts.
? Learn the classifier.
? Training set-tag sample training set.
? Supervised training-a method of training a classifier with a training set.
? P-n Expert-a function that generates positive and negative training samples during the learning process. As shown in 3.
The training process is initialized by inserting the label Swatch set L into the training set. The training set is input into the supervised learning module to train the classifier, estimating the initialization parameter θ. The learning process is enhanced by iterative cascading. In the K-iteration, a test set that is not labeled by the classifier classifier that was previously trained. Evaluate the samples that were incorrectly categorized by P-N expert analysis classifiers. These samples are then added to the training sample set. The iteration ends with retraining the classifier. This process continues until the iteration converges or other conditions that stop the iteration.
The key part of P-n learning is to evaluate the classifier's errors. The key idea is to separate false negatives estimates and false positives estimates. Based on this, the unlabeled sample set is divided into two parts based on the classifier, and each part is analyzed by independent experts. P-Expert analysis is classified as a sample of negative samples, estimating false negatives, adding positive sample labels to them, and joining the training set. In the K-iteration, the P-expert outputs a positive sample. N-Expert analysis is classified as a sample of positive samples, estimating false cases, adding negative sample labels to them, and adding them to the training set. In the K-iteration, the N-expert outputs a negative sample. The P-Experts increase the universality of the classifier. N experts increase the resolution of the classifier.
Supervise and guide relevant content. In order to put p-n learning into a broader context, we consider a test set X of a known tag. Under such assumptions, it is easy to identify the samples of the mis-classification, correct their labels, and join the training. This strategy is often called supervised guidance. Classifiers trained with such supervised guidance are primarily concerned with classification boundaries, and performance is often better than training classifiers for random sampling training samples. P-n learning can be seen as a standard supervised guide for non-labeled cases where samples are not labeled and are estimated using p-n experts. As with other processes, p-n experts also make mistakes in the sample label estimation. Such errors are passed on in training. It is analyzed theoretically in the following subsections.
4.2 Stability
This section analyzes the effect of p-n learning on classifier performance. We assume an abstract classifier (nearest neighbor classifier (NN)), whose performance is measured on the test set Xu. The classifier initializes the classified random unlabeled sample, corrects the sub-classification error, and returns the error sample to the P-n expert. For analysis purposes, we consider the case where the label of the sample set Xu is known.
7 Conclusion
In this paper, we study the problem of unknown target tracking in video stream, in which the target appearance changes frequently and enters and exits in the camera's field of view. We have designed a new working framework that breaks down tasks into three modules: tracking, learning, and detection. The learning module is analyzed in detail. We have shown that for individual samples and unmarked video streams, the target detector can be trained with the following strategies: 1) Evaluate the detector, 2) estimate the error of a pair of experts and 3) update the classifier. Each expert is dedicated to identifying special types of classifier errors while allowing itself to make mistakes. The stability of learning modules can be achieved by designing experts to compensate for their errors. The contribution of this theory is to describe the process as a discrete dynamic system, which allows us to specify conditions to ensure that the learning process improves the performance of the classifier. We have proved that these experts can take advantage of the temporal and spatial relationship of video. In this paper, the real-time porting process of working framework is described in detail, and a wide range of experiments are carried out. Compared with similar tracking algorithms, the superiority of our method is clearly demonstrated. The implementation code of the algorithm and the TLD test set are available online.
8 Limitations and future work
There are a number of challenges that need to be raised in order to obtain a more reliable and general tracking system based on TLD. For example, the TLD does not perform well in situations where a full rotation outside the plane is performed. In such a case, the median stream tracker can be reinitialized from the target only after the target is re-emerged with the observed/learned appearance. The currently implemented TLD is trained only when the tracker and detector remain stationary. The result is that the tracker always makes the same mistake. An interesting extension will also be the Training tracker module. Multi-Objective tracking is an interesting question about how to combine training models and share features to maintain the target scale. The current version of the TLD does not perform well for link targets, such as pedestrians. For restricted scenarios, such as static cameras, an interesting extension of the TLD will include background extraction to improve tracking performance.
Traking-learning-detection TLD Classic paper Partial translation