bing:binarized normed gradients for objectness estimation at 300fps
Ming-ming Cheng, ziming Zhang, Wen-yan Lin, Philip Torr, IEEE CVPR, 2014
General object estimation based on two-valued normed gradient feature
Summary:
By training the common object estimation method to produce a set of candidate object windows, the traditional sliding window object detection method can be accelerated. We observe that the general object will have a well-defined closed contour, and by resetting the associated image window to a fixed size, it can be differentiated by the gradient amplitude. Based on the above observations and complexity considerations, in order to define the training method, we fixed the window as 8*8, and transformed the gradient amplitude into a simple 64-dimensional feature to describe the window.
We further illustrate this binary-normed feature (BING), which can be useful for general object estimation. And only a few atomic operations (such as addition, bitwise movement, etc.), we use the Pascal VOC 2007 dataset, which is very challenging, the experiment proves that our method is very efficient to produce a series of categories of independent, high-resolution object windows, by using 1000 suggested windows, Our object detection accuracy is up to 96.2%. The results can be increased to 99.5% by increasing the number of suggested windows or by considering the color space to calculate the Bing feature.
1. Introduction
As an important field of computer vision, object detection has made great progress. However, most advanced detectors require design-specific classifiers for each class, and many image window "17,25" needs to be evaluated. In order to reduce the detection window of the classifier, training the general class object detection method has become popular "2,3,21,22,48,49,57". The object state typically represents a probability value for an image window that contains any category of objects. A general-purpose detection method can be conveniently used to improve the preprocessing process: 1) Reduce the search space, 2) improve detection accuracy by using a strong classifier. However, designing a good general-purpose class method is very difficult and requires: having a good detection rate, finding all the foreground objects, and making some suggestions, It is used to reduce the computational time of object detection, to achieve high computational efficiency, to be easily extended to other real-time and large-scale applications, with good versatility and ease of use in various categories of detectors, which can reduce the amount of computation
As far as we know, there is no way to meet all of these requirements at the same time.
Cognitive psychology and Neurobiology studies show that people have a strong ability to perceive objects. Through in-depth research and reasoning on the transmission speed of cognitive response time and signal in biological pathway, the theory of human attention is assumed that the human visual system only deals with some parts of the image in detail, and the rest of the image is almost ignored, which means that before the object is recognized, There are simple mechanisms in the human visual system to locate possible objects.
In this article, we present a very simple and robust feature (BING) that assists in detecting objects by using the object state score. Our motivation comes from the fact that the object is generally independent and has a well-defined closed contour "3,26,32". We observe that the image is normalized to a similar scale (for example: 8*8), with a strong connection between the enclosing contour of the general object and the gradient norm (see Figure 1 (C)). In order to effectively quantify the object state in the image window, we reset the size to 8*8 and combine the amplitude of the pixel gradient of the window as a 64-bit feature to learn a common object detection method through the cascaded support vector machine framework. We further illustrate this binary-normed feature (BING), which can be useful for general object estimation. And only some CPU atomic operations (such as addition, bitwise movement, etc.) are required. Most of the existing advanced methods, which generally employ complex classification features, and need to be accelerated so that the computational time is controllable, the Bing feature is simple and straightforward compared with this.
We have evaluated our approach broadly in the Pascal VOC2007 data set. Experimental results show that our approach is very effective (up to 300fps in a simple desktop CPU) that produces a series of data-driven, category-independent, high-resolution object windows, with a detection rate of 96.2% by using 1000 windows (about 0.2% of the entire sliding window). With 5,000 suggestion windows and 3 different color spaces, our approach can reach 99.5%. With reference to "3,22,48", we have also verified the versatility of the method. We trained 6 known categories and then tested them on 14 unknown categories, which had good results (Figure 3). Compared to other popular methods, Bing features enable us to achieve better detection rates and increase the rate by more than 1000 times. Implemented the requirements we mentioned before about a good detector.
2. Related work
The ability to perceive an object before it is recognized, very close to the bottom-up visual significance. According to the definition of significance, we will broadly consider the relevant areas of research atmosphere three categories: Local area prediction, significant object detection, object state recommendations.
Local area detection: This model is designed to predict the significant point "4,37" of the human eye movement. Inspired by the early visual systems of neurobiology, Itti et "36" presents the first computational model for significant detection, which takes advantage of the center-periphery differences of multiscale image features. Ma and Zhang "42" propose another method of local contrast analysis to produce significant images and extend them with the fuzzy growth model. Harel and other people "29" put forward the normalization of the central distribution features to highlight the significant part. Although several local area detection models and excellent developments have been made, they tend to produce high-value values at the edge, rather than uniformly projecting the entire object, so this approach is not suitable for object detection.
Significant object detection: The model is designed to detect the most noticeable objects in the current field of view and then split the entire section of "5,40". Liu et "41" by introducing local, regional, global significance measurements in the CRF framework. Achanta and other people "1" proposed a frequency tuning method. Cheng and other people "11,14" a significant object detection based on global contrast analysis and iterative graph segmentation is proposed. More recent research has also tried to generate some high-resolution significance graphs based on the filter frame "46", using some of the better-performing data "12", or using a hierarchical structure of "55". These significant objects are segmented in simple scenario image analysis "15,58", Content-aware editing "13,56,60" can achieve good results. It can also be used as a cheap tool to handle large-scale network images or to build robust application "7,8,16,31,34,35" with automated filtering results. However, these methods rarely apply to complex images that contain multiple objects, but in real life, such images do make the most sense. (ex: VOC "23")
Object state Recommendation: This method does not make a decision, but rather provides a number (for example: 1000) of the window "3,22,48" that contains all category objects. By producing the rough partition set "6,21", as the object state suggestion has been proved to be an effective way to reduce the classifier search space, and can use the strong classifier to improve the accuracy rate. Then, the two methods are computationally large, and an average image takes 2-7 minutes. Alexe and others "3" put forward a clue comprehensive method to achieve better and more effective prediction effect. Zhang et "57" a cascade ranking SVM method is proposed by using the directional gradient feature. Uijlings et "48" presents an alternative search method for older, better predictive results. We present a simple and intuitive method, which achieves better detection compared to other methods, and is faster than 1000 times more than other popular methods.
In addition, it is very important for an effective method of detecting sliding window objects to ensure that the computational quantity is controllable. "43,51". Lampert et "39" presents an elegant branch-bound method for testing. However, these methods can only be used to speed up the classifier, and the user has provided a good border. Some other valid classifiers "17" and the approximate nuclear method "43,51" have also been raised. These methods are designed to reduce the amount of computation for a single window and, naturally, to reduce losses by combining object-based recommendations.
3. Methods
Inspired by the human vision system, which is able to perceive its "20,38,47,54" between recognizable objects, we introduce a 64-dimensional gradient amplitude feature, and the binary gradient amplitude feature (BING) effectively obtains the object state of the image window.
To find a generic object in the image, we scan a defined quantization window (either by scale or by aspect ratio). Each window gets a score with a linear model w∈r64
SL =<w,gl> (1)
L= (I,x,y) (2)
SL represents the filter score, GL represents ng characteristics, l represents coordinates, I represents the scale, and (x, y) represents the window position. Using non-maximum suppression (NMS), we provide some suggestion windows for each scale. With respect to other windows (for example: 100*100), some scales (for example: 10*500) have a small chance of containing objects. So we define the object state score (calibrate the filter score):
OL = Vi*sl+ti (3)
In the Vi,ti∈r, different independent learning coefficients are obtained for the window of different scale I. Using the calibration function (3) is very fast and usually only needs to be done after the final suggestion window is re-queued.
3.1 Gradient amplitude (NG) and object state
Objects are generally "3,26,32" with well-defined closed outlines and centers. When resetting the window, it is equivalent to reducing the actual object to a fixed size, because in the closed contour, the image gradient is very small, so it is a good distinguishing feature, like in Figure 1, the ship and people in the color, shape, texture, lighting and other aspects are very different, they have in the gradient space common. In order to effectively utilize the observed results, we first reset the input image to different scales and calculate the gradient at different scales. Then take the 8*8 size box, as a 64-dimensional NG feature of the corresponding image.
The NG feature we use is a dense and compact objectness feature that has the following advantages: first, because the support domain is normalized, its corresponding NG feature will not change, regardless of how the object window changes position, scale, and aspect ratio. In other words, the NG feature is insensitive to position, scale, aspect ratio, which is useful for any class of object detection.
Figure 1 Although the object (red) and background (green), in the image Space (a) is very different, through an appropriate scale and aspect ratio, we reset it to a fixed size (b), their corresponding NG feature (c) shows a large commonality, based on the NG characteristics, We learned a simple 64D linear model (d) that was used to filter the object window.
This insensitive feature is a good object detection method that should be available. Second, the compactness of NG features makes calculation and verification more efficient and can be used in real-time applications.
The disadvantage of NG features is that there is not enough recognition capability. In general, however, the detector is used to end the false alarm rate of the result of the defect. The 4th part, we will show the experimental results, on the very challenging VOC2007 data set, contains 96.2% of the object window.
3.2 Objectness Metric
In order to learn the image window, we used the two-level cascaded SVM. We learn a linear svm[24 by formula (1). Using a window with a foreground object and a randomly selected background window as a positive and negative sample of the training set, using the linear SVM learning Vi,ti, the training image of different scale i is estimated by the formula (1), then the selected window is used as the training sample, and the filter score is calculated as a one-dimensional feature. Then use the training image annotation to verify the tag.
Discussion: As shown in Figure 1, the linear model W is similar to the multi-scale center periphery pattern setting of the biological primate's architecture "27,38,54", which has a greater weighting along the bounding area to distinguish the surrounding background. Compared to the manual design center around the mode "36", our learning Model W is able to obtain more complex and more natural prospects. For example, low-level objects are more congested relative to the high-level ones. It also means that the lower level of confidence in the model W is given to low levels of the object.
3.3 Two value gradient amplitude (BING)
In order to utilize the advantages of the binary approximation model "28,59", we present a accelerated version of the NG feature, a binary gradient amplitude, and an accelerated feature extraction and testing process. The linear model w∈r64 that we learn can be approximated as a combination of a series of base vectors
where NW represents the number of base vectors, αj∈{-1,1}64 represents the base vector, and βj∈r represents the calibration factor. Αj can be further represented as a two-value vector and its complement:
,
b obtained after α-binary can be directly used for testing, and only the bitwise AND and byte statistics are required to operate "28"
The key process is how to binary and efficiently calculate NG features. We approximate the use of the gradient amplitude (and the conversion to 01 bytes) of the front ng bit to be used for the binary value.
Figure 2 Variable Description: The Bing feature Bx,y, its last line is rx,y, the last element bx,y. Note that the subscript i, X, Y, L, K, which appears in the formula (2) and formula (5), is the index that locates the entire vector instead of the vector element. We can use a simple atomic variable (INT64 and byte) to represent the Bing feature and its last line, which makes the feature calculation more efficient.
Therefore, the 64-D NG feature GL value can be approximated by the pre-ng bit binary gradient amplitude (BING)
Note: These Bing features have different weights, depending on their original byte bits. Getting 8*8 's Bing features typically requires traversing 64 bits, and based on two features of the 8*8 Bing feature, we propose a fast feature calculation that can be used to avoid cyclic computations using only a few simple atomic operations (bitwise OR and bitwise movement).