The principle of TLD algorithm--Learning comprehension (II.)

Source: Internet
Author: User
Tags tld

As the name implies, the TLD algorithm consists of three modules: tracer (Tracker), detector (detector), and machine learning (learning).

For video tracking, there are two commonly used methods, one is to use the tracker based on the position of the object in the previous frame to predict its position in the next frame, but this will accumulate errors, and once the object disappears in the image, the tracker will be permanently invalidated, even if the object can not be completed tracking Another method is to use the detector to handle the position of the detected object individually for each frame, but this requires that the detector be trained off-line in advance and can only be used to track previously known objects.

TLD is an algorithm for long-time tracking of unknown objects in video. An "Unknown object" refers to any object that is not known to be the target before it is started to be traced. "Long-time tracking" also means that the algorithm needs to be calculated in real time, the object may disappear in the middle of the tracking, and as the light, the background changes and due to occasional partial occlusion, the object on the pixels reflected in the "appearance" may change greatly. From these points of view, it is not possible to use a tracker or a detector alone to perform such a job. So the author proposes to use the tracker and the detector together, and add machine learning to improve the accuracy of the results.


  The function of the tracker is to track motion between successive frames, and the tracker will be effective when the object is always visible. The tracker estimates the position of the current frame based on where the object is known in the previous frame, which results in a trajectory of the object's motion, from which a positive sample (tracking->learning) can be generated for the learning module.
The function of the detector is to estimate the error of the tracker and correct the results of the tracker if the error is large. The detector makes a full scan of each frame, finds the position of all appearances similar to the target object, produces positive and negative samples from the results of the test, and gives the Learning module (detection->learning). The algorithm selects one of the most trusted locations from all positive samples as the output of this frame TLD , and then updates the starting position (detection->tracking) of the tracker with this result.
The learning module iteratively trains the classifier according to the positive and negative samples produced by the tracker and detector, improving the accuracy of the detector (learning->detection).

---------------------------------------

    • Tracking module

TLD uses the median-flow tracking algorithm presented by the authors themselves.
The author assumes that a "good" tracking algorithm should have positive and negative continuity (Forward-backward consistency), that is, the trajectory should be the same, either in terms of time-based or reverse-order tracking. According to this nature, the author prescribes the FB error (forward-backward error) of any tracker: starting at the initial position x (t) of the time t, tracing the position x (t+p) of the time t+p, and then from position x (t+p) The inverse trace produces the predicted position of the time T X ' (t), and the Euclidean distance between the initial position and the predicted position is used as the FB error of the Tracker at t time.


The Median-flow tracking algorithm uses the Lucas-kanade tracker , which is often said to be the optical flow tracking device . The principle of this tracker is not explained here. Just know that given a number of tracking points, the tracker determines the position of these tracking points in the next frame based on the motion of the pixel.

Selection of tracking points:
The author gives a method to filter the best tracking point based on the error map of FB error mapping, but it is not suitable for the real-time tracking task, it is not described in detail. Only the methods for determining the trace points in the TLD are described here.
First, a few points are generated evenly in the bounding box of the object in the previous frame T, and then the lucas-kanade tracker is used to trace these points to the t+1 frame, then back to the T-frame to calculate the FB error, Filter out the half of the minimum FB error as the best tracking point. At last, the position and size of t+1 frame bounding box are calculated according to the change of coordinates and distance of these points (the scale of the translation is median, and the scale of scaling takes the median value.) take the median of the optical flow method , it is estimated that this is also the origin of the name Median-flow it).

It can also be used with NCC(normalized cross Correlation, normalized inter-off) and SSD(sum-of-squared Differences, The difference squared sum) as a measure of the filter tracking point. In the author's code, the FB error is combined with NCC, so the number of tracking points is less than half the original.

Ncc:

         

Ssds:

        

---------------------------------------

    • Learning Module

The machine learning method used by the

TLD is the author's P-n Learn (P-n learning). P-n learning is a semi-supervised machine learning algorithm that addresses the two errors that occur when the detector classifies a sample Two "experts" are available to correct: ( PN Learning can be referenced : http://blog.csdn.net/carson2005/ article/details/7647519)
                 P Expert (P-expert): Positive sample of detected missing (false negative, positive sample divided into negative sample);
                 N Expert (N-expert): Corrected positive sample of false test (false positive, negative sample divided into positive sample).

  The production of the sample:
Scan the images in different sizes (scanning grid), each in one place to form a bounding box (bounding box), the image area determined by the bounding box is called an image element (patch), The sample set of the image element into the machine learning becomes a sample. The sample produced by the scan is an untagged sample, which needs to be categorized by a classifier to determine its label.
If the algorithm has determined the position of the object in the t+1 frame (which actually determines the position of the bounding box), the bounding box generated by the detector filters out 10 bounding boxes that are closest to it (two of the intersection of the bounding box divided by the area greater than 0.7), and each bounding box is slightly Affine transformations (panning 10%, scaling 10%, and rotating within 10°) produce 20 image elements, resulting in 200 positive samples . Then select a few more distant bounding boxes (the area of intersection divided by the area of less than 0.2), resulting in a negative sample . The resulting sample is a tagged sample, which is put into a training set to update the classifier's parameters. The A in figure shows an example of a scan window.


The author thinks that the result of the algorithm should be "structural": the object in every frame image appears in one position at most, the motion of the object between adjacent frames is continuous, and the position of successive frames can form a smoother trajectory . For example, like the C graph, there is only one positive result per frame, and the result of a continuous frame is a smooth trajectory, rather than a lot of results and no trajectory as shown in B. It should also be noted that during the entire tracking process, the trajectory may be segmented , as the object may disappear in the middle and then appear again.
the role of the P expert is to find the temporal structure of the data, which uses the results of the tracker to predict the position of the object in the t+1 frame. If the position (bounding box) is classified as negative by the detector, the P expert will change the position to positive. That is to say, p experts to ensure that the position of the object on the continuous frame can constitute a continuous trajectory ;
the role of the N expert is to find the spatial structure of the data, comparing all positive samples produced by the detector with those produced by P experts, and choosing the most credible position to ensure that the object appears at most in one position. This position is used as the tracking result of the TLD algorithm . This location is also used to reinitialize the tracker .


For example, in this example, the target vehicle is the following dark car, each frame is a black box detected by the detector positive sample, yellow box is a tracer produced by the positive sample, the Red Star is marked with each frame the last trace results. In the T-frame, the detector did not find dark car, but the P experts according to the results of the tracker that the dark car is also a positive sample, n experts after comparison, that dark car samples more credible, so the light-colored car output as a negative sample. The process of frame t+1 is similar. In frame t+2, p experts produced incorrect results, but after the comparison of the N experts, the results were excluded, and the algorithm could still track the correct vehicle.

---------------------------------------

    • Detection module

The detection module uses a cascading classifier to classify the samples obtained from the bounding box. The Cascade classifier consists of three levels :

image meta-variance classifier (Patch Variance Classifier). Calculates the variance of the gray value of the image element pixel, and marks the sample with the variance less than half the original image element as negative. The paper mentions that more than half of the samples can be ruled out in this step.

Integrated classifier (Ensemble Classifier). is actually a random fern classifier (the Random Ferns Classifier), similar to the stochastic forest (randomness Forest), which differs in that the random forest tree has different criteria for each layer of nodes, whereas the "fern" of a random fern has only one criterion for judging each layer.


As shown, the tree on the left of each layer of the node to the same judging conditions, it becomes the right side of the fern. So the fern is no longer a tree-like structure, but a linear structure . Random fern classifier is classified according to the characteristic value of the sample. Select two points A and b from the image element to compare the luminance values of the two points, if the luminance of a is greater than B, then the eigenvalues are 1, otherwise 0. Each new pair of positions is a new eigenvalue. Each node of the fern is compared to a pair of pixel points.
For example, take 5 pairs of points , Red is a, Blue is B, the sample image passes through a fern containing 5 nodes , and the results of each node are arranged sequentially, and the binary sequence 01011 of length 5 is converted into decimal digit 11. This 11 is the result of the sample being passed through the fern.


Many samples of the same class pass the same fern, and the distribution histogram of the results is obtained. The transcendental probability P (f|) of the class is represented by the height C), F represents the result of the fern (if the fern has s nodes, there is a total of 1+2^s results).

  different kinds of samples pass the same fern and get different prior probability distributions.


The above procedure can be regarded as training the classifier. When a new untagged sample is added, it is assumed that the result of the fern is 00011 (i.e. 3), and then the largest posterior probability is sought from the known distribution. Since the sample set is fixed, the denominator of the lower right corner formula is the same, so as long as it is found in the f=3, the category of the new sample is the highest.


  Sorting with only one fern can be a big coincidence. A new fern can be formed by taking 5 new eigenvalues. By classifying the same sample with many Ferns , the class with the largest number of votes is classified as a new sample, which improves the accuracy of the classifier to a great extent.

  Nearest neighbor Classifier (Nearest Neighbor Classifier). Calculating the relative similarity of a new sample, such as greater than 0.6, is considered a positive sample. The similarity provisions are as follows:
Image Element Pi and PJ similarity, the formula n is the normalized correlation coefficient, so the value range of S is between [0,1],


Positive nearest neighbor Similarity,


Negative nearest neighbor Similarity,


Relative similarity, the value range is between [0,1], the greater the value represents the higher the similarity degree,


Therefore, the detector is the watchdog of the tracker, because the detector is correcting the error of the tracker, while the tracker is the supervisor when training the detector, as the results of the tracker are used to monitor the classification results of the detector. Using another procedure to supervise the training process, rather than being supervised by a person, is also called p-n learning as a reason for "semi-supervised" machine learning.
the work flow of the TLD is as shown. first, the detector is generated by a series of bounding box samples, through the cascade classifier to produce a positive sample, put into the sample set, and then use the tracker to estimate the new position of the object, p experts based on this position and produce a positive sample, n experts from these positive samples to select one of the most credible, while the other positive samples marked negative Finally, the classifier parameter of the detector is updated with positive sample, and the position of the bounding box of the next frame object is determined.




The principle of TLD algorithm--Learning comprehension (II.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.