The previous blog post deformable Parts Model (DPM) detection acceleration algorithm has been mentioned in the introduction,
[1] ECCV Exact acceleration of Linear Object Detectors
By using FFT, the convolution calculation of model and hog feature in airspace is transformed into multiplication operation of corresponding position element in frequency domain, which realizes the acceleration of DPM detection. In addition, this paper gives the entire DPM detection algorithm C + + implementation, called Fast Fourier Linear Detector, referred to as FFLD (click Download code), reference value is very large. In this paper, we will briefly review the process of DPM algorithm detection, then briefly describe the method of accelerating the convolution operation by FFT in [1] ffld, and finally explain how to accelerate further on the basis of FFLD code.
Algorithm Review
The DPM detection process is reviewed first, as shown in Figure 1.
Figure 1 DPM Inspection Standard process
FFLD: Accelerating convolution operations with FFT
[1] In the explanation of why the FFT can accelerate the convolution calculation of the model and the Hog feature, the detailed and understandable explanations are not repeated here. The following is a simple description of the one-dimensional signal distance.
Assuming that there are two one-dimensional vectors x and y of length n, we want to calculate the convolution z of the two vectors directly on the time domain, then the computational complexity is O (n*n). But the conversion to the frequency domain calculation is not the same. The convolution on the time domain corresponds to multiplication on the frequency domain. X and y FFT, get x and Y, the complexity is O (Nlogn), in the frequency domain X and y two sequence corresponding position elements do complex multiplication, get the frequency domain sequence Z, its complexity is O (n), and then the frequency domain Z to ifft the time domain Z, the complexity is O (NLOGN), Therefore, using FFT to calculate the convolution can reduce the total complexity from O (n*n) to O (Nlogn), and the larger the sequence length, the more reduced.
Suppose we have model F and three hog feature layers Hog0,hog1 and hog2,f and the convolution results of these three hog feature layers are recorded as R0,R1 and R2, as shown in Figure 2, note that when convolution is calculated, F does not span beyond the hog feature, So the size of the resulting convolution calculation is smaller than the size of the hog feature.
Figure 2 Using convolution calculation directly
Below we use FFT to accelerate the convolution calculation. Since the dimensions of the Hog feature are inconsistent with the dimensions of the model F, it is not possible to multiply in the frequency domain if the FFT is performed according to the hog characteristics and the respective dimensions of model F. Since f is small in size, it needs to be filled with F, the value of the filled area takes 0, and then the FFT. The following is given in [1] to give three kinds of FFT implementation methods.
The first type is shown in Figure 3. The f is filled to the size of Hog0,hog1 and HOG2, the FFT is multiplied in the frequency domain, the Ifft is selected, and the appropriate region in the inverse transformation result is obtained, and the convolution result is r0,r1 and R2. The disadvantage of this approach is that you need to fill the f in multiple sizes. DPM detection in the Hog layer size, and the root model and a large number of models, if this method, you need to save a lot of memory model FFT, memory consumption is large.
Figure 3 The first method of FFT implementation
The second type is shown in Figure 4. Only the f is filled to the HOG0 size for the FFT. The HOG0 does not need to be filled directly to FFT,HOG1 and HOG2 are also populated to the HOG1 dimensions, perform the FFT, then multiply in the frequency domain, then ifft, and remove R0,r1 and R2 from the appropriate areas of the inverse transformation results. The advantage of this approach is that only the model needs to be populated to a size FFT, memory consumption is small, the problem is that all the frequency domain multiplication operations are based on the maximum size of the hog feature, multiply the number of times, the Ifft operation is more time-consuming, in general, the computational volume is large.
Figure 4 The second method of FFT implementation
The third type is shown in Figure 5. The f is filled to the HOG0 size for FFT. The HOG0 does not need to be populated directly for FFT. The smaller hog feature layer is made of non-coincident stitching, which is filled into the HOG0 size rectangle for FFT. After multiplying in the frequency domain, the IFFT is taken and the R0,R1 and R2 are removed from the appropriate area of the inverse transformation result. This approach avoids the disadvantages of the first two methods. On the one hand, only the model needs to be populated to HOG0 dimensions for FFT calculations, and the model FFT consumes less memory than the first method described earlier. On the other hand, the different layer hog features do not overlap to the HOG0 size, then FFT, frequency domain multiplication, and ifft operation, the HOG0 size of the calculation than the second method mentioned earlier is less. So in practice, the idea of splicing combination is used to accelerate the convolution operation with FFT.
Figure 5 The third method of FFT implementation
The DPM detection algorithm using FFT acceleration is shown in Figure 6, which is also a flowchart of the FFLD program.
Figure 6 An FFT-accelerated DPM detection process
further optimization of the FFLD program
FFLD gives only a sample program, if applied directly to the actual detection project, there are some shortcomings. In addition to other papers also mentioned a lot of accelerated detection methods, so the natural idea is to further optimize the FFLD program. caching a model FFT
In the test sample program given in FFLD, it is necessary to walk through each of the steps in Figure 6 to detect each image. If the size of the image to be detected is the same, then the calculation of this part of the model FFT actually needs to be done only once, so that it can save a lot of time for detection.
However, the actual application environment, the size of the picture may be many kinds of, we have no way to scale these images to the same fixed size, in such cases, we need to find new ways to avoid repeating the model of the FFT calculation.
One of the ways to solve this problem is to cache the model FFT. When you want to detect a picture, now in memory to find the size of the model of the FFT is already present, if it already exists, then directly return to this part of memory, the subsequent detection operations, if not, then calculate the size of the model FFT and return, for subsequent detection operations. Each time you want to create a new model FFT of the size, see if the memory used by the model FFT has reached the preset maximum, and if so, delete some model FFT of some size that has been unused for a long time. The test flow under this optimization is shown in Figure 7.
Figure 7 DPM Accelerated detection process with model FFT cache Hog Pyramid acceleration Setup
The Hog pyramid is constructed with a parameter λ, and the resolution of the Hog feature in layer I is twice times the resolution of the Hog feature of the i+λ layer. As a result, the Hog feature of layer I is adjacent to the 31-D feature of the four position corresponding to the 31-dimensional feature of a position on the i+λ layer, as shown in Figure 8. It is obvious that there is correlation between the characteristics of low resolution and the four positions in high resolution. The position of the red box in Figure 8 actually corresponds to the same area in the original image. Twice times the resolution of the relationship, it is very easy to think, can use the high resolution of the Hog feature 4 values to approximate the low-resolution hog feature of a value.
Figure 8 Correspondence of hog characteristics
Figure 9 shows a realization method, first calculate the hog characteristics of layer I, where I is greater than or equal to 0 and less than λ, and then use a high resolution of four values near the lower resolution of the next value method, and constantly calculate the i+λ layer hog characteristics, i+2λ layer hog features, and so on. This method calculates low-resolution hog features with lower accuracy because of a normalized and truncated nonlinear calculation in the calculation of the complete hog feature.
Figure 91 A fast method for calculating the characteristics of hog
Figure 10 shows another implementation method, first calculate the soft binning histogram of the layer I, where I is greater than or equal to 0 and less than λ, and then use a high resolution of four values near the lower resolution of the next value method, the i+λ layer soft binning histogram is calculated continuously, i+2λ layer soft Binning histogram, and so on. The hog characteristics of each layer are obtained by normalization and truncation according to soft binning histogram. This method approximates the low-resolution hog features more accurately, and is fast in the construction of the Hog feature pyramid. It should be pointed out that the hog feature pyramid, the most time-consuming is to calculate the gradient this step, as long as in this step can save time, hog pyramid build can be much faster.
Figure 10 A fast calculation method for Hog features
According to the method given above, only the highest resolution lambda layer hog feature needs to be fully calculated, the remaining layer of the hog feature can be reduced by the dimension of the method of rapid calculation. Layered Detection
Here we review the role of the root model and the partial model in DPM. The root model mainly locates the potential area of the object, obtains the position of the object, but whether or not we expect the object to exist, we need to put some model to calculate and confirm it.
Therefore, we can get such a rapid detection method. Firstly, the root model is used to perform convolution operations on the Hog feature layer of the root model, or FFT is used to accelerate the convolution operation, and the results of the root model are obtained, and then the results of each layer are suppressed by the non-maximal value, and the position is greater than a preset threshold, which is the candidate position of each layer. For each candidate location, go back to the part of the model layer (note that part of the model resolution is twice times the root model, so that part of the model's layer number is the root model layer number minus λ), in each part of the model's desired position in the neighborhood, calculate the partial model and hog feature convolution results minus the maximum value of the deformation penalty, The root model of the candidate location and the final score of the partial model are obtained, as shown in Figure 11. If the score is greater than the threshold, then there are objects to be detected at that location. After all the calculations are completed, the results of the final test are obtained by deleting the position of the lower score and the high coincidence area of the test result with high score.
Figure 11 based on the location of the root model, on the part of the model layer, look for some of the model and the Hog feature convolution minus the deformation penalty to reach the maximum value of the position. The root model and some models are shown in the upper right corner of the figure. actual running Results
With the original FFLD detection program, using a hybrid DPM model with two models, it takes about 100 milliseconds for pedestrian detection of 400x300 images, where the detection of hog features and subsequent calculations take 45 milliseconds and 55 milliseconds respectively. According to the Hog Pyramid fast-build method described in 3.2, the computation time of the hog feature can be reduced to about 20 milliseconds, while the subsequent calculation drops to about 15 milliseconds, and the total detection period drops to about 35 milliseconds.