Real-time tracking of adaptive hybrid background model Chris stauffer W. e.l grimsonbob Kuo Translation: Artificial Intelligence Laboratory of the Massachusetts Institute of Technology, Zhangye City, Ma 02139

**Summary**Common methods for real-time segmentation of dynamic regions include the background subtraction division or the error threshold between the estimated image and the current image without moving objects. There are many ways to solve this problem. The difference lies in the type used by the background model and the program used to update the model. This article discusses how each pixel acts as a Gaussian mixture model and uses linear approximation to update the model. In background processing, Gaussian distribution of the adaptive hybrid model can be used to evaluate and determine which pixel value is the most likely background point. Each pixel is classified based on whether it is a Gaussian distribution. It is very effective for background modeling. The result is a stable and real-time outdoor tracking that can handle Lighting changes, repetitive operations in complex scenarios, and long-term changes. The system has been running for 16 consecutive months, 24 hours a day, including rain and snow days.

**1 Introduction**
In the past, computing barriers restricted complex real-time video processing applications. As a result, most systems are too slow to be practical, or have too strict control conditions for success. Recently, faster computers have enabled researchers to consider more complex and robust real-time data stream analysis models. These new methods allow researchers to start processing real-world models under different conditions. Consider video monitoring issues. A sound system should not rely on careful lens placement. It should be robust, whether in its visible area or when the light changes. It should be able to process the movement of complex areas, the coverage of objects in the visible area, the shadow, the change of light, the impact of Moving Elements in the scene (such as the swing of the tree), and the slow movement of objects, and the entry or removal of objects in the scenario. The failure of the traditional Background Modeling method lies in the fact that the above conditions cannot be met. Our goal is to create a robust and adaptive tracking system that is flexible enough to handle the movement of various lighting conditions and complex scenarios, moving multiple targets and changing observation scenarios at will. This target tracking is preferred in the monitoring industry.

**1.1 previous work and current deficiencies**
Because manual Initialization is required, most researchers have abandoned the non-Adaptive Background Modeling method. Without reinitialization, the background error will accumulate over time. Yes, this method is only useful in highly supervised and short-term tracking applications, and it will be meaningless after the scenario changes. A standard Adaptive Background Modeling method is an average time series image, which creates a static image similar to the current non-moving object. However, this method is effective when the object is continuously moving and the background is visible. This method is not robust when multiple objects are moving slowly. It cannot process the double-peak background either. When the background is exposed, the recovery is slow and there is a single set threshold for the entire scenario. For many background modeling methods, the change of scene lighting may lead to problems. In the Ridder et al. [5] model, each pixel uses the Karman filter to make their system more robust for changes in the light in the scene. However, this method has a pixel automatic threshold, which is still slow to restore and cannot handle the double-peak background well. Koller et al. [4] has successfully integrated this method to automatically track and monitor applications. Pfinder [7] uses multi-class statistical models for tracking targets, but the background model uses pixel single Gaussian. The system performs well when the room is empty and initialized. This tracker does not perform well in outdoor scenarios. Friedman and Russell [2] recently completed the pixel-level EM framework to detect vehicles, which is most similar to our work. Their approach attempts to clearly distinguish three pixel values, in advance to distinguish the color of the road, the color of the shadow and the color of the vehicle. They try to eliminate the effects of shadow, but they are not sure about the behavior of pixel values not in these three distributions. For example, pixels may be the result of repeated motion, shadow, or reflection of a single background color or multiple background colors.

**1.2 Our Method**
Compared to all pixel values as a notable model for a specific distribution, we simply consider a specific pixel as a Gaussian mixture model. Based on the variance between durability and each Gaussian mixture, we can determine which Gaussian matches the background color. It is considered a background point until a Gaussian always supports and satisfies this pixel point, otherwise it is a foreground point. Our system can well handle the changes in light, the repeated movement of scene elements, tracking of complex areas, and the entry or removal of slowly moving objects in scenarios. Slowly Moving objects take longer to blend into the background, because their color difference is much greater than the background color difference. Repeated changes can be learned, and the background model can be maintained. when an object is moved, it may be temporarily replaced by other distributions, but it can be recovered soon. Our background model includes two important parameters: Alpha, learning constant, t, and the proportion of Background Data. We do not need to change the parameters. Our system has been used in indoor scenarios for 16 months in man-machine interaction and is continuously used to monitor outdoor scenarios.

**2 Method**
If each pixel value is under a specific illumination in a specific scenario, the single Gaussian Model of the pixel will be sufficient, but it will produce some noise. If only the light changes over time, the adaptive single Gauss of pixels is sufficient. In fact, multiple surfaces appear when the cone and light conditions of specific pixels change. Therefore, adaptive multi-Gaussian is required. We use Hybrid Adaptive Gaussian to approach this process. Each time the Gaussian parameter is updated, the Gaussian function is evaluated with a simple inspiration, assuming that it is most likely part of background processing. Gaussian distribution values that do not match any background pixel are grouped by connected components. Finally, the connection component uses a multi-hypothesis tracker to be tracked in the video. 1:

Figure 1: Program Execution. (A) Current image, (B) the most likely background model's Gaussian average value image, (c) Foreground pixel, (d) the current image with overlapping tracking information. Note: In this example, the shadow is regarded as the foreground. If the surface is overwritten by the shadow for a long time, this Gaussian function has enough reason to consider this point as the background.

**2.1 online hybrid model**
We regard specific pixel values that change over time as a "pixel process ". A "pixel process" is a time sequence of pixel values, such as a scalar of a gray image or a vector of a color image. T is the time, {x0, y0} is the specified pixel value, and I is the image sequence.

Some "pixel processes" are represented by (R, G) scalar points in figure 2 (a)-(c)

Figure 2: The red and green scalar values of multiple images change with time sequence. It illustrates some differences in actual scenarios. (A) Two-pixel scalar point changes within 2 minutes. (B) shows two-way model distribution of pixel values in the mirror reflection of the water meter. (C) shows another two-way model with a mirror blinking. The requirements of the adaptive system for automatic threshold are described. The highlights of Figure 2 (B) and (c) require a multi-model representation. The value of each pixel represents the measured value of the radiant light that the light emits to an object of interest and is reflected to the sensor. In a fixed scenario and in a fixed light, this value is a constant. Assume that it is independent, and Gaussian noise is generated during the sampling process. The density distribution is described by the single Gaussian distribution at the mean of a center. Unfortunately, most video sequences include light changes, scene changes, and moving objects. If light changes occur in static scenarios, it is necessary to use Gaussian Functions to track these changes. If a static object is put into the scene and is not integrated into the background, unless it is placed more time than the previous object, the corresponding pixels are considered as foreground at any time. In the prospective estimation, this may cause accumulation errors and lead to poor tracking behavior. These factors indicate that the closer the observed Gaussian parameter is, the more important the decision is. If a mobile object appears in the scene, an auxiliary transformation will occur. Even a moving object with a relatively fixed color is expected to generate a larger variance than a static object. In addition, generally, more data should support the background distribution model, because they are replaced, and different object pixel values have different colors. There are dominant factors in our selection model and update program. The historical value of each pixel, {x1,..., xt}, is modeled by a mixed K Gaussian distribution. The current value is obtained in the following way:

K is the number of distributions, WI, T is a weight value evaluated by the I-th Gaussian (how much data is occupied by this Gaussian function), UI, T is the mean value of the I-th Gaussian function at the T moment. Σ I, T is the matrix covariance of the I-th Gaussian function at the T moment. Gini is the Gaussian probability density function.

K is determined by the available memory space and computing power. Currently, 3-5 is used. In addition, due to the computing power, the covariance matrix is assumed to be in the following format:

This assumption is that the values of red, green, and blue pixels are independent and have the same variance. However, this is uncertain. This assumption can avoid an expensive matrix conversion problem at the expense of some precision.

Therefore, the distribution of each pixel value in the scenario is a Gaussian mixture feature distribution. A new pixel value is usually represented and updated by the most important component of the hybrid model.

If the pixel process is a stable process, a standard expectation maximization method is used to maximize the possible observed data. Unfortunately, each pixel process changes with the changing state of the world, so we use an approximate method to essentially treat the new observed value as a sample with a size of 1, and integrate the new data with a standard learning rule.

Since each pixel of an image has a Gaussian mixture model, it is expensive to execute a precise EM algorithm in the window of recent data. Instead, we execute an online K-means approximation algorithm. Each new pixel value xt is used to check whether the existing K-Gaussian distribution exists until a matching value is displayed. The matching is defined as a pixel value within the 2.5 Standard Deviation distribution. This threshold can be weakly disturbed in performance. Each pixel value/distribution threshold is valid. This is useful when different areas have different light (see figure 2 (a) because the noise of an object in the shadow is less than that in the light. A unified threshold often causes the object to disappear when it enters the shadow area.

If no K distribution matches the current pixel value, the minimum probability distribution will be replaced by the average value, initial variance, and low-priority weight of the current value.

The weight priority of K distribution at t time is ω K, T. The formula is as follows:

In the formula, α is the square of the learning rate. mk, T = 1 indicates matching, and MK, T = 0 indicates the residual model. After approximate calculation, the weight value is normalized again. 1/α is defined as a time constant, which determines the speed at which the distribution parameter changes. ω K, t is too tively

A causal low-pass filtered average of the (Thresholded) posterior probability that pixel values have matched model k given observations from time 1 through T. this is equivalent to an exponential window on the expected value.

The μ and α parameters are still the same for unmatched distributions. The distribution parameter matches the new observed value and is updated as follows:

Unable to continue translation due to English proficiency and professional knowledge ······