Fast, accurate Detection of 100,000 Object Classes on a single machine (reprint)

Source: Internet
Author: User
Tags cos

Fast, accurate Detection of 100,000 Object Classes on a once read records

This article is CVPR best paper.

  This article of cattle, mainly reflected in some key figures, can be classified into 100000 categories, 20,000 times times faster than base line. However, the drawback is that its stand-alone processing of an image of the speed, need 20s, and, 100000 class map is 0.16, looks beautiful, but there are some distance practical distance.   This article's selling point is the speed, specifically for the multi-class detection problem, the detection speed can be achieved and the number of categories Independent.   for object detection containing Class C, a basic framework is to train the C classifier, for each candidate location, with each classifier to determine once, and then do the post-processing fusion. The disadvantage is that the speed is too slow and the processing speed is inversely proportional to the object class (linear, algorithmic complexity O (C)).   The base line algorithm referenced in this article is the DPM model, that is, the model of each object consists of a model of several parts (assuming P), The model of each part can be thought of as a dot product of the filter and the location feature (which can be considered as a convolution process as a whole), and then the position of the object is determined based on the possible part candidate position constraints. In practice, the most time-consuming is the convolution process, each object classifier filter (corresponding weight) need and candidate location characteristics of a dot product processing, assuming the number of candidate Windows is W, the candidate Window feature dimension is M-dimensional, The complexity of the operation is w*c*p*m.   This article takes advantage of the previous work result, can be two vector dot product (actually dot product and cos distance has a very strong association, if the two vectors involved in the Cos distance operation is normalized to the normalization process, Then the cos distance and dot product of the two vectors are the same as the Hamming distance of the similarity conversion to two hash values.   The process of converting feature to LSH hash is as follows (the resulting hash is the LSH hash, also the WTA hash): Assuming a total of M-dimensional feature, set K to the number of reserved intermediate elements set a rearrangement array (randomly obtained, the number is M, The element is 0-m-1, and the ordinal of each element represents a temporary feature corresponding to the ordinal in the original feature, thus transforming the original feature into a temporary feature, taking the former K dimension of the temporary featue, and forming the final new feature, Then the largest element ordinal in the new feature (new feature) is K, the K is represented as a log (k) bit of the binary string continues to set a total of n rearrangement array (randomly obtained), then get a n*log (k) bit of binary string, according to the first rearrangement array corresponding to the string in the lowest bit principle , get a hash value (N*log(K) bit integer).   corresponding, the dot product between two feature is converted to Hamming distance between two corresponding hashes.   Intuitively, because the resulting numbers are only related to the size of each other, and the maximum number of messages is retained each time, the disturbance to the numbers is very robust. Thus, the similarity of the Hamming distance between the two hash values obtained is more robust for the eigenvalues and is more efficient (whether this is the case or not, please refer to J for additional information). Yagnik, D. Strelow, D. A. Ross, and R.-s. Lin. The Powe of comparative reasoning. In IEEE international Conference on Computer Vision, 2011.).   Since calculating the Hamming distance between two hashes is very fast (and can also look up a table), the most time-consuming part is calculated on the feature of each window and calculates the hash value, which is independent of the number of classes.   The above can be used for dot product to measure the similarity of features, can be a variety of characteristics, in object detection inside the most commonly used to hog characteristics of the number.   The following is an example of the Hog feature, which shows the comparison between the base line algorithm and the improved calculation time proposed in this paper: the base line algorithm is calculated as follows: 1, calculates the multi-scale edge strength and edge direction image; 2, iterates over all windows for each window, The Gaussian weighted Hog histogram feature is calculated to calculate the hog feature and the dot product of Class C P filter 3, and the window with local maximum response as a candidate is obtained, the final object detection results   The improved algorithm are calculated as follows: The calculation process is calculated in advance c* P filter corresponding hash value 1, calculate multi-scale edge strength and edge direction image; 2, traverse all windows for each window, Calculates the hash value corresponding to the Gaussian weighted hog histogram feature to calculate the hog characteristic hash value and the hash value of Class C P filter respectively Hamming distance 3, the window with the local maximum response as a candidate, The cumulative synthesis of the possible center of the object is obtained by the final object detection results   Contrast can be seen, as the improved algorithm, the calculation of Hamming distance is very fast, can be ignored, so the final result of the multi-class detector and the number of classes independent.   Further, for fast operation, the above Hamming distance calculation can be converted to table operation, in order to eliminate the need to continue the calculation when the cumulative similarity is higher than the threshold value, the hash value is divided intoSeveral different parts (so that each table is smaller). The hash of the N*log (k) bit is divided into the n/m Group (band), each group is an integer of M*log (k) bit, and for each part of each category the filter (training model) corresponds to the n/ M Group Lookup table (the list of lookup tables is the hash value of the current Window feature on the band, the value of the lookup table record is the similarity between the Featurehash value and the model hash value), thus avoiding the Hamming distance calculation process. Each filter takes the cumulative value of the N/M Group lookup table and the corresponding dot product value (similarity). To accumulate and calculate the n/m group, when the calculation finds that the similarity is greater than the threshold, the subsequent operation is discarded, and the position distribution of the estimated object is accumulated directly.   finally obtains the position of the object where the position of the object is most accumulated. The   article also has some special details (such as root filter, and continue to calculate the similarity with dot product after a quick calculation), and no longer repeat it.   There is a point in the article that is worth discussing is that the author of the 100000 types of data are crawled by the search engine, without manual calibration, so the results have some inaccuracies. But qualitatively, it's a lot quicker.   Of course, compared to the base line algorithm, the algorithm presented in this paper is still reduced in precision (see the comparative results of VOC 2007 in the paper, map by 0.26->0.24). While consuming 20G of memory, presumably it should be the memory that the lookup table corresponds to.   Inspiration: 1, this idea, for the basic operation of the dot product (x*y) operation, can be accelerated, this operation is very common, such as the linear Svm,cos distance, as well as neural network and LR inside the WX and so on, can be used. A relatively easy to think of can be applied to the Multi-model detection framework (such as multi-class object detection, multi-posture face/car detection, etc.); 2, for multi-model detection, speed is a very important aspect, The general idea is to increase the feature sharing (LAB feature image, Vector boosting) in the case of increasing the speed of a single model (feature and classifier calculation speed), in fact, the most desirable feature shared is deep Learning model (only in the last layer is different, the other layers are common, each hidden node can be considered as feature, all categories shared feature, only in the output layer, the calculation of a wh+b item, is very ideal for sharing), but unfortunately a single deep Learning model is too slow, when traversing multiple detection candidate window, the final speed is now looking too slow, who is interested can think about this problem;  add: In fact, this method may be used to addFast Neural network model (of course, including deep learning), the difficulty is that the dot product into a hash distance of the approximate relatively large, may not have good results, who to think about it to try it;

Fast, accurate Detection of 100,000 Object Classes on a single machine (reprint)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.