Source Address: http://blog.sina.com.cn/s/blog_6ae183910101gily.html
Yesterday, Baidu launched a new similarity map (similar image search), try the scenery, characters, text and other different types of query, the effect is very good feeling. Especially for the character search, the returned results are very similar in color and attitude. Especially in the input of a pose beautiful picture, will search a series of similar pose beautiful pictures, really is the blessing of otaku AH. In the spirit of entertainment, paste a search results for everyone yy.
We know that the bottom of the product technology is Kaiyu teacher led by the Baidu Multimedia image group to do, but in the end is how to do, I believe we must be very curious. Here, I follow my own understanding, I think the specific technical solutions behind it, I would like to have a more in-depth discussion with you, we progress together.
First, the Baidu map (shitu.baidu.com) input is a picture (can support local image upload and image URL), the output is the same/similar/same face image returned. If a face is detected in the image (that is, to determine if a position is a face other than a cat face or other), then there are three tabs on the results page, namely "All", "similar image" and "face search", such as:
If there is no face in the picture of query, there will be only the first two tabs "All" and "similar pictures".
In fact, the underlying image-related technology (content based) supports three different functions, namely the same image search (near duplicate search), a similar image search (similar images), and a face search Search), and there is a big difference between the three features. From the functional point of view, the "All" tab contains two main functions, the same image search (and the corresponding image around the text, the picture as a message carrier) and the following similar image search function.
I'll look at some of the techniques behind the three different features of the same image search (near duplicate search), similar image search (similar image) and face search.
One of the most intuitive ideas for image search is to represent the image (expressed as a feature) and then calculate the similarity (or distance) between the features of the query image and all the images indexed in the library, and then sort the results to get the search result. In the early days of image retrieval academics, this was generally enough. The problem, however, is that you can only handle databases with small data sizes (for example, millions of levels), otherwise the speed will be very difficult to accept.
In order to improve the speed of search, one can learn from the idea of text search. In text Search, the basic framework is handled based on the TF-IDF representation and the inverted index structure. The so-called TF is the word frequency in the article, and IDF is the inverse of the document. If the image and the article correspond to visual Word and the word (term), the image also has a corresponding TF and IDF. Therefore, the real and practical large-scale image search system is generally used similar text search framework, first through the visual word expression for the visual frequency, or the use of some kind of hash method to convert binary code to deal with, improve the search speed. This is the most basic and core idea.
For the same image search, do the earliest should be counted tineye.com, the domestic Sogou and Baidu map has this function, of course, before I think the best is image.google.com. The main goal of this technology is to find different variants of the same image (brightness changes, partial cropping, watermark, etc.), one of the important factors to measure the effect is the ability to combat the above deformation size, in addition, in order to provide a recall, so the index must be very large. This application is mainly used to find a higher quality version of the same image or to be used for image copyright tracking.
Below, to talk about the basic framework inside.
Among them, the point of interest detection and local feature description There are many ways to choose, sift features are common features. After the image is represented as a SIFT feature defined on many points of interest, the offline training of a vocabulary (generally using hierarchical kmean methods, and of course other methods like the random projection), in short, is a vector quantization, thus, The ability to get an image of the frequency statistics about the word (also can be seen as a histogram accumulation). As a result, the process behind it can draw on the method used in the text. As the above process, does not take into account the point of interest location, therefore, in order to filter out the false test results, generally in the Re-rank module, will be based on the point of Interest location constraint filtering method to filter out the point of interest location and query image different results, of course, because at this time, the result is far less than all the index amount , you can further rearrange the results using a more complex method of operation.
In order to improve the recall rate, a better method is to use the query expansion method, using the results of query to select the most similar items (similarity is greater than a certain threshold), to get the most similar results with query, Then combine the results and query into a new query and re-search in the index. An easy way to do this is to do feature level averaging. In this way, a higher recall rate can be obtained.
Of course, the above is only a review of the same image detection framework, the specific treatment may be different, may also use some special trick to improve response speed and accuracy and recall rate. [1-3] Recommended reading references.
Next, we will talk about the technical scheme of similarity graph search. The similarity diagram is more difficult than the same diagram, the definition of the similarity diagram itself is not very certain, at present, before Baidu launched a similar map, Google's similarity test is done better. But Baidu's new similar image search is really amazing. At the moment, what everyone is doing is a visual similarity, not a real semantic similarity.
The same graph detection can be counted as a relatively mature technique, while the similarity search method is uncertain. Next, I guess Baidu's plan.
First of all, Baidu's similarity map detection should be based on full-image features, and the same image detection is different, the similarity is not used in front of the point of interest detection, but directly after the full resolution normalization, directly expressed as a certain feature (feature representation). This feature directly determines the merits and demerits of subsequent search results. Since an image is represented as a fixed-length feature, there is no way to use visual word to represent it (or I don't know how to do that). In order to search quickly, I guess it should be followed by the conversion of this feature into a hash sequence, and then map it to multiple intervals based on the hash value, processing only the images falling in the same interval, thus reducing the number of indexes that need to be processed and improving the search speed. Specifically, a more easily thought-out approach is to use similar minhash and other LSH methods for processing, the feature is represented as a K independent hash, and according to the hash results mapped to m different intervals, and then only need to process and input query fall in a certain interval index, so, You can greatly reduce the amount of computation that is being searched. The basis for this is that the similarity between the two images can be converted to the similarity measured by the Minhash, by adjusting the number of hash functions described above, and the number of intervals divided, if the corresponding minhash similarity of two images is greater than a certain threshold, then the two must fall in at least one common interval , thus guaranteeing a recall rate. Specific principles can be Google Simhash understand.
Based on the above guess, I drew a block diagram:
Because of the approximation and the possible collisions of the hash in the above process, after obtaining the candidate index, we need to further use the Re-rank method to rearrange the results. At this point, because the amount of data to be processed has been reduced, a more primitive representation of the feature can be used for processing (there is, of course, a lot of trick in guessing this step, which is a very important factor in determining the effect).
Let's go back and guess the global representation of the similarity graph search. For the global representation of the image, there are a lot of traditional methods, such as color histogram, texture, edge, shape, etc., but, Baidu's similarity map search should be used deep learning technology, why? One is because the effect is very good, feel the traditional method is difficult to achieve such a result (forgive me as the depth of the brain residue of learning powder bar), another important reason is Kaiyu teacher himself admitted, haha. Of course, before seeing him admit, many students have been highly skeptical of using deep learning to do.
So what is the specific way to achieve it? I can be clear: the first input must be a color image (not a grayscale image), because a look at the results can see the similarity of color, and secondly, because of its position on the shape of a good robustness and position correlation, because, in the case of the feature-by-layer, on the one hand, in the global, the image location-related structure, In the small part, the pooling technique which can improve the local robustness is used. Of course, there is no category concept in a similar image search, so it can be inferred that the deep learning is a more abstract and differentiated representation of the image using unsupervised (unsupervised learning) methods. As for the specific proposal, I think there can be different options. We might as well speculate on the structure of the deep CNN. When the upper input and the current layer are linked, the overlapping areas are extracted for processing, the local small blocks share the weights, and then the pooling is used to reduce the dimension of the feature by layer. Of course, other DNN structures can also be used on the top floor, which should be feasible.
For similarity graph Search, if you only need to achieve visual similarity (regardless of semantic similarity), if you can solve the global representation problem, then the fast search should be able to use relatively mature technology to solve. Deep learning is a very promising way to make a global representation. Of course, in the concrete implementation, there must be a lot of skills and difficulties, did the people know, I have not done, had to yy a bit.
OK, let's look at the face search. For the face search, Baidu launched a full network face search does not have precedent, although there are still some problems, but, is already a very cow of a try. Actually, the face search and the similarity search are on the frame and I think it will be very similar. The specific is also divided into two steps, one step is to express, one step is a quick search. For the face, the specific needs of a few steps, the first step, using face detection to find the number and position of the face in the image, when there are multiple faces in an image, you can choose the largest one or the highest confidence as a query; the second step, face alignment Alignment) to locate the center of the Eye and the center of the mouth and the position of the cheek contour points, according to the location of the characteristic points (can have a lot of different choices, generally using the center of the Eye and the center of the mouth), cut to the normalized face area, and then, the face region extraction identification features, Get the expression of a human face. Here's a simple one.
There are more mature algorithms for face detection and feature point localization, such as reference literature [5][6]. And for the face, it is worth mentioning that, according to Kaiyu teacher in public, Baidu has also adopted a deep structure method to do, that is, the use of deep learning methods to do. On the application of deep learning methods in the field of face recognition, one of the best results I have seen [4] is the result of 4 features fused to 92% in LFW data, which is not good enough, for example, the latest MSRA work-order feature can achieve 93% results [7]. And Baidu's method belongs to the self-creation, in the public literature is unable to find the reference. However, I think all roads lead to Rome, other deep learning framework should also be able to achieve good results, but also need us to explore for some time.
After getting a face representation, here's how to quickly search for a human face. Similar to similar image search, the characteristics of human face search, for each face is a fixed length of a feature, easy to think of the method is similar to the Minhash Lsh method to deal with. Because, in the query celebrity photos, there are a lot of results in the results of the input face, so I tend to think that in the face search also used the query expansion method to improve the recall rate. Here's my idea of a face search process flowchart:
In the same way, the Re-rank can be used to improve the precision of the results.
Okay, guess time's over. If you think that I guess is wrong or have a better idea, please put forward to discuss, help me improve.
More than yy a few words.
We see, one, in the field of image search, Baidu as the representative of the domestic industry to achieve a very high standard, and the world's highest level of PK (this sentence I was serious compared to say); second, deep learning methods in all areas of image understanding (face recognition, OCR, and similar image search) application and the results obtained are very good; Thirdly, in the large-scale image processing methods, there are many similarities, such as the text search method can be inspired by the method of image search, further, can be inspired by the method of human face search; its four, Large-scale image data and deep learning bring new ideas and methods to traditional image comprehension.
Technically, on the one hand, this period is an exciting time for computer-vision people, with large-scale data and deep learning, which has made a significant improvement in many of the previously slow-moving applications. On the other hand, we also need to see the application based on intelligent image has not been a very successful precedent, even Baidu's new face search and similar image search, in the application has not found a very successful application. Therefore, the revolution has not yet succeeded, do computer vision comrades still need to work hard.
In terms of application, I think the following points can be followed: one is about the mobile aspect of the image application, the voice has become an important means of interaction, the image can grab a little entrance? Another is the hardware with the smart image technology has become a very important interactive device, because of the hardware sensor/light source help, technically can be relatively mature, can be large-scale product, which includes the Kinect, leap motion, etc., whether the future can appear and mobile phones, Smart TV combined with better interactive devices?
BTW: It is mentioned that Minhash related methods may have problems on large-scale data, perhaps the use of visual word+ inverted can solve the problem of human face indexing, no validation, for reference only.
In addition, with regard to the similarity graph search, considering that it is ideal for certain categories of results, there is an object recognition module in the guessing system, which outputs the possibility of containing an object in the image, such as a total of N-class objects, there is a feature, The first dimension indicates that the image contains the possibility of the category I object, and then, at search time, use this feature to measure the similarity of two images (the semantic level, not just the apparent level, thus improving the search effect of the common category).
Since I will formally join Baidu on August 21, 2013, subsequent edits will not be made to this article.
Reference papers:
1. J. Sivic and A. Zisserman. Video google:a Text retrieval approach to object matching in videos. ICCV 2003.
2. Herve Jegou, Matthijs douze, Cordelia Schmid. Hamming embedding and weak geometric consistency for large scale image search. ECCV 2008.
3. Relja Arandjelovic, Andrew Zisserman. Three things everyone should know to improve object retrieval. CVPR 2012.
4. Xinyuan Cai, Chunheng Wang, Baihua Xiao, Xue Chen, Ji Zhou. Deep nonlinear metric learning with independent subspace analysis for face verification. ACM Multimedia 2012:749-752.
5. Jianguo Li, Tao Wang, Yimin Zhang. Face detection using SURF cascade. ICCV Workshops 2011:2183-2190.
6. Xudong Cao; Yichen Wei; Fang Wen; Jian Sun face Alignment by Explicit Shape Regression. CVPR 2012.
7. Dong Chen, Xudong Cao, Fang Wen, Jian Sun. Higher is better:high-dimensional Feature and its efficient Compression for face verification. CVPR 2013.
Baidu Image Search Quest