In the Web2.0 era, especially with the popularity of social networking sites like Flickr and Facebook, images, videos, audio, text and other heterogeneous data are growing at an alarming rate every day. For example, Facebook has more than 1 billion registered users and uploads more than 1 billion images a month. Flickr images submitted to the site in 2015, the number of users uploaded pictures of 728 million, the average daily upload about 2 million of the picture; China's largest e-commerce system Taobao's back-end system holds more than 28.6 billion pictures. In view of these massive images which contain abundant visual information, how to search and retrieve the images which users need or interested in the vast image library conveniently, quickly and accurately, become the research hotspot in the field of multimedia information retrieval. Content-based image retrieval method has fully developed the advantage of computer to deal with repetitive tasks, freeing people from the human, material and financial resources. After ten years of development, content-based image retrieval technology has been widely used in search engines, e-commerce, medicine, textiles, leather industry and other aspects of life.
Image retrieval can be divided into two categories, one is text-based image retrieval (Tbir, text Based image retrieval) and the other is content-based image retrieval (CBIR, content Based image Retrieval).
The method of text based image retrieval began in the 70 's in the last century, it uses the way of text annotation to describe the content of the image, so as to form the key words describing the image content for each image, such as objects and scenes in the image, this way can be the manual annotation method, can also be used for semi-automatic annotation by image recognition technology. In the search, users can be based on their own interests to provide query keywords, the search system based on user-provided query keywords to find those labeled with the query keyword corresponding to the picture, and finally the results of the query returned to the user. The method of image retrieval based on text description is easy to realize and has human intervention in annotation, so its precision is relatively high. Some of today's small and medium sized image search Web applications are still available, but the drawbacks of this text-based approach are obvious: first, this text-based approach requires human intervention in the tagging process, making it only suitable for small-scale image data. It takes a lot of human and financial resources to complete this process on large scale image data, and at any time the external image in the storage can not be separated from the human intervention; second, "A picture wins thousand words", to need the accurate query, the user sometimes is very difficult to use the short keyword to describe the image which oneself really wants to obtain; The process of manual tagging will inevitably be influenced by the cognitive level, verbal use and subjective judgment of the callout, which will result in the difference of the text description picture.
With the rapid growth of image data, aiming at the problem of text-based image retrieval, in 1992 the National Science Foundation of America reached a consensus on the new development direction of image database management system, that is, the most effective way to express index image information is based on the image content itself. Since then, content-based image retrieval technology has been gradually established and developed rapidly in the last more than 10 years. The basic framework for a typical content-based image retrieval is shown in Figure 1.1 above, it uses the computer to analyze the image, establishes the image characteristic vector description and stores the image characteristic library, when the user input a query image, uses the same feature extraction method to extract the query image characteristic to obtain the query vector, then calculates the query vector to the characteristic storehouse under some similarity measure criterion The similarity of each feature is sorted and the corresponding picture is output sequentially according to the similarity size. Based on content-based image retrieval technology, the image content expression and similarity measurement are presented to the computer for automatic processing. This paper overcomes the defects of using text for image retrieval, and gives full play to the advantages of computing, which greatly improves the efficiency of retrieval, and opens a new gate for the retrieval of mass Image Library. However, its shortcomings also exist, mainly manifested in the feature description and high-level semantics there is a difficult to fill the semantic gap, and this semantic gap is not to be eliminated.
content-based Image retrieval technology has a wide application prospect in the fields of electronic commerce, leather fabric, copyright protection, medical diagnosis, public safety, Street View map and other industries. In E-commerce, Google's goggles, Alibaba's racket, such as flash shopping application allows users to capture the upload to the server side, in the server run image retrieval applications to find the same or similar clothes and provide a link to buy the shop; in the leather textile industry, Leather cloth manufacturers can take the model into pictures, when the clothing manufacturer needs a certain texture of leather cloth, you can retrieve the same or similar leather fabric in the library, making leather fabric sample management more convenient; in the area of copyright protection, The service provider of the copyright protection can apply the image retrieval technology to carry on the attestation management which the trademark has already registered; In medical diagnostics, doctors search the medical image Library to find similar parts of many patients, thus helping doctors to diagnose the disease ... content-based Image retrieval technology has been deeply into many fields, which has provided great convenience for people's life production. Content-based Image retrieval technology
Image retrieval of the same object
The same object image retrieval refers to an object in the query image, from the image Library to find the image containing the object. The user is interested in a specific object or target contained in the image, and the retrieved image should be those that contain the object. As shown in the 1.3 figure, given a portrait of the Mona Lisa, the object of the same object is to retrieve images from the image library that contain the "Mona Lisa", which, after a sort of similarity measure, can be ranked in front of the search results as much as possible. Similar object retrieval is commonly referred to as Object retrieval (object retrieval) in English literature, and approximate sample search or detection (Duplicate search or detection) can also be categorized as retrieval of the same object. And the same object retrieval method can be directly applied to approximate sample search or detection. Retrieval of the same object is of great value both in research and in the business image search industry, such as searching for clothes and shoes, and face search in shopping applications.
For the same object image retrieval, it is easy to be influenced by the photographing environment, such as illumination change, scale change, view change, occlusion and background clutter, etc, which will affect the retrieval result greatly. Fig. 1.3 Shows the examples of these changes, and in addition, the deformation of the object will have a great effect on the retrieval result for the non rigid object.
Due to environmental interference is relatively large, so for the same object image retrieval, in the selection of features, often choose those anti-jamming relatively good invariant local characteristics, such as SIFT1, SURF2, ORB3 and so on, and on this basis through different coding methods to build image of the global description, Representative work with Word Bag Model 4 (BoW, Bag of Words), local feature aggregation descriptor 5 (VLAD, vector of locally aggregated descriptors), and Fisher Vector 6 (FV, Fisher Vecto R), this class of SIFT based image retrieval method, because of the combination of the characteristics of class sift invariance, and the use of local to global characteristics of the expression, and in the actual application in the extraction of SIFT can also use SIFTGPU acceleration sift extraction, As a result, a better retrieval effect can be obtained in the whole, but the characteristic dimension of this kind of method is often very high, As shown in Figure 1.2, in order to obtain high retrieval precision, the clustering number is usually set to hundreds of thousands of in the Oxford Building image database, so it is necessary to design an efficient indexing method for them, because the dimension of the final representation is as high as hundreds of thousands of dimensions. Same category Image retrieval
For a given query picture, the goal of similar image retrieval is to look for images from the image library that belong to the same category as the given query image. What users are interested in is the object, the category of the scene, where the user wants to get pictures of objects or scenes with the same class attributes. In order to better distinguish between the same object retrieval and the same category retrieval, the two retrieval mode areas, the "Mona Lisa" as shown in Fig. 1.3, for example, if the user is interested in the painting of "Mona Lisa", then the retrieval system should be retrieved in the same way as the same object. But if the user is not interested in the "Mona Lisa" of the painting itself, but "portrait" of the kind of picture, that is, the user is interested in the specific painting has been the concept of the category of abstraction, then the retrieval system should be retrieved in the same category. The same category of image retrieval has been widely used in image search engine, medical image retrieval and other fields.
For the same category of image retrieval, the main problem is that there is a great change in the same category of image, but the difference between different image classes is small. As shown in Figure 1.3, for the "lake" image, the image belonging to this category is very different in expression form, but for the graph?????? The "Dog" class and the "woman" class, shown below on the right figure, are two images, although they belong to different classes, it is difficult to separate the two by using low-level features such as color, texture, and shape, so that the differences between classes are very small. Therefore, the same category of image retrieval in the feature description of the larger changes in the class and small differences between classes and other challenges. In recent years, the automatic feature which is based on the Advanced Learning (DL, Deep Learning) has been applied to the same category of image retrieval, which can greatly improve the accuracy of the retrieval, so that the retrieval of the same object has been solved better in feature expression. At present, the feature expression mode which is dominated by convolution neural network (CNN, convolutional neural network) also begins to be carried out on the same object image retrieval, and has some corresponding work 7, However, because the same object is not as convenient as the same kind of image retrieval in the construction of the class sample training data, the same object image retrieval has to be further explored in the training of CNN model and the automatic feature extraction. Whether it is the same object image retrieval or the same category of image retrieval, when using the CNN model to extract the automatic feature, the resulting dimension is generally 4096-dimensional, its dimension is relatively high, and the method of reducing dimension by using PCA directly, although it can achieve the goal of reducing the feature dimension, But in the premise of keeping the necessary retrieval precision, the dimensionality that can be reduced is still limited, so it is necessary to construct efficient and reasonable fast retrieval mechanism for this kind of image retrieval, so as to adapt to the large-scale or mass image retrieval. Features of large-scale image retrieval
Whether for the same object image retrieval or the same category of image retrieval, on large-scale image datasets, they have three typical features: Large image data, high feature dimension and short time requirement. Here's a description of the three main features:
(1) Large amount of image data. Thanks to the development of multimedia information capture, transmission, storage and the improvement of computer operation speed, content-based image retrieval technology has been developed for more than more than 10 years, and the scale range of image that needs to be applied from the original small image library to large scale image database and even mass image dataset. For example, in the early stage of the 90 's development of image retrieval technology, the researchers used a lot of corel1k to verify the performance of the image retrieval algorithm, which has a total of 1000 images, compared with the most popular image classification library imagenet data set that can be used for image retrieval today. Its magnitude has been multiplied by tens of thousands of times, so image retrieval should meet the requirements of the large data age and should be scalable on large scale image datasets.
(2) High characteristic dimension. As the cornerstone of directly describing the visual content of image, the feature expression of image directly determines the highest retrieval accuracy that can be achieved in the process of retrieval. If the predecessor feature is not well expressed, it will not only complicate the construction of the model, but also increase the response time of the retrieval query, and the retrieval precision can be improved very limited. Therefore, at the beginning of feature extraction, it should be conscious to select those higher level features. If the local feature expression is also regarded as a kind of "high dimension", so the description ability of feature has a great correlation with the dimension of feature, so large-scale image retrieval has obvious characteristic dimension in feature description, such as word bag model bow, VLAD, Fisher Vector and CNN feature. In order to have a quantitative understanding of the dimension of these high-dimensional features, this paper takes the feature vector constructed by the word bag model as an example, and tests the influence of the feature dimension (equal in number of the numbers of clustering words) on the retrieval accuracy in the building image dataset of Oxford University. As you can see from Figure 1.2, the feature dimensions of the word bag model are very high. Therefore, another typical feature of large-scale image dataset retrieval is the high dimension of image feature description vector.
(3) Request the response speed quickly. For the user's query, the image retrieval system should have the ability to respond quickly to the user's query, at the same time, because of large scale image data and high feature dimension, it is difficult to satisfy the real-time requirement of the system by using the Brute search index strategy (also called linear scan). Figure 1.2 shows the time spent averaging each query on the Oxford University Building image dataset. The Oxford University Building image set, which can be seen with only 4063 images, takes about 1 seconds in terms of a word number of 1 million and a rearrangement depth of 1000. And the whole program is running on a high performance server, therefore, large-scale image retrieval needs to solve the problem of real-time response of the system.
Image retrieval technology based on hashing the concrete frame as shown in Figure 1.4, according to the steps can be divided into feature extraction, hash code, Hamming distance sorting and rearrange four steps:
(1) Feature extraction. In the image database, the image is extracted individually, and it is added to the feature library by means of the image filename and the image feature one by one.
(2) hash code. Hash encodings can be split into two phases, and a set of hash functions is required before the feature is encoded, and the set of hash function is obtained through the learning phase of the hash, so that the two phases are the hash function learning phase and the formal hash encoding phase, respectively. In the learning phase of hash function, the feature library is divided into training set and test set. The constructed hash function set H (x) =h1 (x), H2 (x),..., HK (x) H (x) =h1 (x), H2 (x),..., HK (x) is trained on the training library; The formal hash coding phase, The original feature XI (i=1,2,..., N) XI (i=1,2,..., N) is added to the learned hash function set h (x) H (x), so that the corresponding hash code is obtained. It is noteworthy that if the design of the hashing algorithm has been validated experimentally, then in the actual application system, when dividing the dataset, the whole image library can be used as a training set as well as an image database, so that the hash function which is learned in the large-scale image has better adaptability;
(3) Hamming distance sorting. In the Hamming distance sorting stage, for a given query image, the corresponding hash code of the query image is computed to the Hamming distance between the other hash codes, and then the similarity is sorted by the order of small to large, thus the retrieval results are obtained.
(4) rearrangement. For the result of the Step (3) Hamming order, you can select the first m (m«n) m (m«n) result or rearrange the result of Hamming distance DCDC that is less than a set. Generally, the result of rearrangement is achieved by using Euclidean distance as a similarity measure in the rearrangement of the rows. Thus, it can be seen from here that the hashing process can be seen as a process of filtering candidate samples or coarse sorting. In the application system of large-scale image retrieval using hashing method, there is usually a rearrangement step, but when designing the hashing algorithm, the performance evaluation is based on Hamming distance, that is, when evaluating the performance of hashing algorithm, it is not necessary to rearrange this step.
With the rapid growth of visual data, content-based image retrieval technology oriented to large-scale visual data has received great attention both in commercial applications and in computer visual communities. The traditional brute force (Brute-force) Search method (also called linear scan) is calculated and sorted by similarity to each point in the database. This simple, crude approach, though easy to implement, will increase the search costs as the size of the database and the dimension of the feature increase, So that the brute force search is only suitable for small scale image database with small data quantity, in large-scale image library, this kind of violent search not only consumes huge computational resources, but also the response time of single query increases with the increase of data sample and feature dimension, in order to reduce the space complexity and time complexity of searching space. , over the past more than 10 years, researchers have found an alternative scheme-approximate nearest neighbor (ANN, approximate nearest neighbor) search method, and proposed a number of efficient retrieval techniques, including the most successful methods based on tree-structured image retrieval methods, The method of image retrieval based on hashing and the method of image retrieval based on vector quantization. Approximate nearest neighbor Search
The nearest neighbor search method based on tree structure and the nearest neighbor search method based on hashing in theoretical computer science, machine learning and computer vision is a very active field, these methods by dividing the feature space into many small units, so as to reduce the space search area, so as to achieve the degree of linear computational complexity.
The tree-based image retrieval method organizes the corresponding feature of the image into the tree structure, which makes the computational complexity of the retrieval time reduced to the complexity of the logarithm of the number of samples in the image library. The search method based on tree structure has the kd-tree 8, M-Tree 9 and so on. In many tree structure search methods, the kd-tree is the most widely used, the kd-tree in the stage of building a tree, constantly with the largest variance of the dimensions of the space division, the storage of the corresponding tree structure is constantly downward growth, and the tree structure is stored in memory, Figure 2.1 illustrates a simple kd-tree partitioning process: In the search phase, when the query data reaches the leaf node from the root node, the data under the leaf node and the query data are compared and retraced to find the nearest neighbor. Although the retrieval technology based on tree structure greatly reduces the response time of single retrieval, but for high-dimensional features such as dimension hundreds of, the indexing method based on tree structure will decrease dramatically in the retrieval time, and may even fall to the performance of close or less violent search. As shown in table 2.1, when indexing 512-D gist features on a LabelMe dataset, a single query spill tree (the kd-tree) takes more time than a violent search. In addition, the tree-structured retrieval method can occupy a much larger storage space than the original data and is sensitive to the data distribution, so that the tree-structured retrieval method will face the problem of memory limitation in the large-scale image database.
Compared with the method of image retrieval based on tree structure, the hash based image retrieval method can encode the original feature into a compact two-value hash code, which makes the hash-based image retrieval method greatly reduce the memory consumption, and because of the XOR difference or operation of the computer's internal computing apparatus when calculating the Hamming distance, Thus, the calculation of Hamming distance can be completed in microsecond scale, which greatly reduces the time required for single query response. As shown in table 2.1, on LabelMe image datasets, compared to brute force search methods and tree based search methods, by encoding the feature of the image and searching for it, the hash-based search method has reduced the number of orders of magnitude by a single time compared with the brute-force search and the tree-structured approach, and the feature dimension is from the original The 512-D is reduced to 30-d, thus greatly improving the efficiency of the retrieval.
The key point of the image retrieval method based on hashing is to design an effective hash function set, so that the data of the original space is mapped by the hash function set, and the similarity between the data in Hamming space can be better maintained or enhanced. Because the characters that are not encoded are contiguous on a number of domains, the hash code obtains a binary hash code, which means that the hash function set is a process of transforming the numerical value from a continuous domain to a discrete one, which leads to the difficulty of solving 10 when optimizing the hash function set. This makes it extremely difficult to design a valid hash function set. In the past more than 10 years, despite the great challenge of designing valid hash function sets, researchers have proposed many hash based image retrieval methods, the most classic of which is the locally sensitive hashing method (LSH, locality sensitive hashing).
Locally sensitive hashing is considered to be an important breakthrough for fast nearest neighbor search in high-dimensional space (e.g., hundreds of dimensions). When constructing a hash function, it uses a random hyperplane method, which divides the space into many subregions using a random hyperplane, each of which can be considered a "bucket", as shown in Figure 2.1 on the right. In the construction phase, a locally sensitive hash only needs to generate a random hyperplane, thus there is no training process; In the indexing phase, the sample is mapped to a binary hash code, as shown in Figure 2.1, the binary hash code, which has the same binary hash code, and the sample is stored in the same "bucket", and in the query phase, The query sample can lock the query sample in which bucket after the same mapping, and then compare the query sample with the sample in the bucket in the locked bucket to get the final neighbor. The validity of the locally sensitive hash is guaranteed in the theoretical analysis, but because the local sensitive hash is not used in the process of constructing the hash function, it makes the application of the local sensitive hash very long in order to obtain a high precision cable. However, the collision probability of similar samples in the hashing process is reduced under the long coded digits, which leads to a large decrease in the recall rate, so the local sensitive hashes of multiple hash tables appear. At the same length of code, compared to a locally sensitive hash of only one hash table (that is, a locally sensitive hash of Tanhashi tables), the encoding length of each hash table in the locally sensitive hash of Dohashi tables is reduced to a ll-per-fold of the Tanhashi table's locally sensitive hashing encoding length (assuming ll is a Dohashi table locally sensitive hash) Therefore, the local sensitive hash of multiple hash tables can obtain a higher recall rate than the locally sensitive hash of the Tanhashi table with the same encoded length, however, either the Dohashi table local sensitive hash or the Tanhashi table local sensitive hash, their encodings are not compact, making them not very efficient in terms of memory usage efficiency.
In the face of large-scale image retrieval, in addition to the use of image hashing method, there is another kind of method, namely vector quantization method, vector quantization method is more typical representative of product quantization (PQ, product quantization) method, which is to decompose the feature space into a number of low Wizi space Cartesian product , and then quantify each child space individually. In the training phase, each subspace after clustering to get a KK-class heart (that is, quantizer), all of these classes of the Cartesian product of the heart of a dense partition of the whole space, and to ensure that the quantization error is relatively small; after quantification, for a given query sample, The asymmetric distance between the sample of the query and the samples in the library can be calculated by the way of look-up table 12. The product quantization method, although it is more accurate in the approximate distance between samples, however, the data structure of the product quantization method is usually more complex than the binary hash code, and it can not obtain the low dimensional feature representation, in addition, in order to achieve good performance must be coupled with asymmetrical distance, and it also requires the variance of each dimension is more balanced, If the variance is unbalanced, the product quantization method gets poor results. Reference documents
LOWE D G. Distinctive Image Features from Scale-invariant keypoints, Int. J. Comput. Vis., 2004, 60 (2): 91–110. ↩
BAY H, Tuytelaars T, GOOL L J. surf:speeded up robust Features, Proc. IEEE Int. Conf. Comput. Vis., 2006:404–417. ↩
Rublee E, Rabaud V, Konolige K, et al. Orb:an efficient alternative to SIFT or SURF, Proc. IEEE Int. Conf. Comput. Vis., 2011:2564–2571. ↩
Csurka G, Dance C, FAN L, et al. Visual categorization with Bags to keypoints, Workshop on statistical learning in compute R Vision, Eur. Conf. Comput. Vis., 2004,1:1–2. ↩
Jegou H, Douze M, SCHMID C, et al. aggregating local descriptors into A Compact Image representation, Proc. IEEE Int. Conf . Comput. Vis. Pattern recognit., 2010:3304–3311. ↩
Perronnin F, Sánchez J, Mensink T improving the Fisher Kernel for large-scale Image classification, Proc. Eur. Conf. Com Put. Vis., 2010:143–156. ↩
Kiapour M H, HAN X, LaZebnik S, et al. Where to buy it:matching street clothing Photos in Online shops, Proc. IEEE Int. Conf. Comput. Vis., 2015:3343–3351. ↩
BENTLEY J L. Multidimensional Binary Search trees Used for associative searching, commun. ACM, 1975, 18 (9): 509–517. ↩
Uhlmann J K. Satisfying general proximity/similarity Queries with Metric trees, Inf. Process. Lett., 1991, 40 (4): 175–179. ↩
GE T, he K, SUN J. Graph Cuts for supervised Binary coding, Proc. Eur. Conf. Comput. Vis., 2014:250–264. ↩
Datar M, Immorlica N, Indyk P, et al. locality-sensitive hashing Scheme Based on p-stable distributions, Proc. Symp. Comput. Geom., 2004:253–262. ↩
DONG W, Charikar M, LI K. Asymmetric Distance estimation with sketches for similarity Search in high-dimensional spaces, P Roc. ACM Sigir Conf. Develop. Inf. Retr., 2008:123–130. ↩from:http://yongyuan.name/blog/cbir-technique-summary.html