Google's image-based search technology used to look for images technology, the following is to introduce the past decade to map the development of technology and technology, the depth of learning is now how to do to map search.
The search for images is also known as content-based retrieval (abbreviated as CBIR), as shown in:
  
Image retrieval You must have a database of images, or where does your image data come from? Then you must have a query image, in the image database to retrieve the image to query, from the pixel level of the image to compare, this is certainly not very realistic, computationally large. So here is the feature of the image to extract, transformed into a form of eigenvector, so that only need in the image vector space to do the comparison can be, that is to calculate some cosine distance ah, L2 distance ah, European distance AH and so on. This gives the result of the image comparison. So the most important point of Cbir is feature extraction. Different ideas or different kinds of work actually differ in how our features are extracted.
The above mentioned is to convert each image into a feature vector form, but how in the industry is the rapid real-time response? Everyone in the use of Baidu or Google when you throw a picture in fact it quickly retrieved, basically the millisecond level. But if you think about it, if you have a lot of images in the image database, especially the magnitude of the search engine, it must be calculated by the smallest unit of billions. So the distance between the 22 eigenvectors is certainly time-consuming, so there's an indexing technique inside the industry, called the hash Index technique, as shown in:
      
The indexing technique for this hash is to map the feature vectors of the image, such as 2048-D, to a smaller subspace, such as 128-dimensional. And this 128-dimensional is not 0 is 1 of the binary code, called hashing codes. So the calculation here can be greatly reduced by a large amount. First, the dimension is reduced, and it becomes the number of the binaries. Such techniques do not reduce the effectiveness of the pair (which may be reduced from 95% to 94%, but the speed efficiency is increased by hundreds of times), and the computational volume is greatly reduced.
So the above is the key technology in Cbir, in general you not only have to play features, but also to play the characteristics of small, light, fast. The next step is to do the above two notes.
      
Above this graph is the performance of traditional machine learning algorithms and deep learning algorithms with the increase of data performance. This picture is often encountered during the interview. Will let you draw this picture. Because the traditional machine learning in the case of a small amount of data is still very dominant.
Let's look at the following 10 years of dominance in computer vision from 2000 to 2010, just like the term "deep learning" in recent years. It's called bow or BOVW. called visual words. So what is a visual word model?
    
Let's just throw away the vision and look at the word model (the word bag model). Throw a question: how do you measure the similarity of two essays? You look at the main feature words that appear in these two articles and count the frequency of these words. For example, for instance:
        
So the frequency of the occurrence of each word in the statistics, the form of a histogram, that this histogram is a feature vector. This is the word bag model.
Then in the field of computer vision is the visual word bag model, also known as the visual word model. Just now we are modeling a document for characterization, but in the field of computer vision we are characterizing a picture. How to model, how to characterize its situation?
            
How do you describe a person like that? Her face is relatively slender, the hair is very characteristic, but also with headband, black necklace, nose more slender, cherry small mouth what. In other words, you are dividing this image into small details that characterize the image. Now it is time to use the computer to realize the things that have just been described. How does that make it?
      
  
A histogram of each image in the statistics, what is the statistics of the histogram? You might want to count the frequency of the different part of its histogram. You should be hoping to get a unified, comparable, statistically-counted metric. This standard, for example, can be similar to the Xinhua dictionary, there are about 3,000 words in the Xinhua dictionary, that in computer vision, there are more than 3,000 visual words, that each word, may represent a concept, we look at the first picture, to the histogram of his word frequency statistics, What are the concepts of Word frequency statistics relatively high? Oh, part of the nose, part of the eye. Which parts are relatively low? Oh, the seat of the bike, the edge of the lower part of the guitar. So only I like the corresponding word frequency will be relatively high. What about the second picture? And so on, he should be a bicycle-related parts of the corresponding word frequency is relatively high, the corresponding part of the guitar is relatively low. The third picture is the same.
The above is the idea of the word bag model, summed up is: I hope in computer vision in this characterization space, building a kind of visual dictionary similar to the Xinhua dictionary, and I want to be able to get every picture he in these visual Xinhua Dictionary of the frequency of the histogram statistics, then once this statistic is completed, Then the visual representation of each image is actually represented as a histogram statistic. Then the distance between the image and the image is transformed into a histogram sense distance.
What is a visual word (visual word)? Visual word is essentially a local feature thing. The visual word is actually a local characteristic operator thing, the most basic local feature actually has very very many, such as sift and so on. But it does not run out of two essential things, the first is that the local feature has two parts, the first part is an X, Y, which is a coordinate (location information). That is, the red and blue part of it, which is where it appears? The second part is what the local feature descriptor is, let's say that the top part of the tower is a local feature, so how do you tell me that the top part of the tower is similar to the part of another tower?
You will always have a descriptor describing it, this descriptor may be a 128-dimensional eigenvector, or a 256-dimensional eigenvector, depending on the local feature may have different dimensions of the eigenvector, but it doesn't matter, as long as there is a eigenvector, The similarity or distance of these two parts can be quantified. This allows you to measure which areas of the two images are matched.
          
In computer vision The most important local feature is SIFT (scale invarient feature transform), so it can not escape the two components we have just mentioned, one is the detector, detection position. The second is a descriptor, which is used to describe its eigenvector, as shown in. In other words, sift this thing is not learned by machine learning. Instead, the scientists calculated them based on mathematical statistics or gradients.
      
So OPENCV inside this sift function returns two things: the location information of the local feature, the coordinates of X y, and the local feature feature descriptor. For example, a 128-dimensional vector, which is the operator that describes the coordinate position of the front.
How did the SIFT matching this thing? As shown in the two pictures, do sift matching to it.
        
How does it get done? For each local feature on the left, you go to the map on the right to find the closest point, but for sift matching, it will go to the nearest two points, that is, nearest number and second nearest number, what do you mean? For one point on the left, you need to find the two closest points to the right, and then make a ratio between the two points, and if the ratio is greater than a certain threshold, he will decide if you can match the right side of the picture. The advantage of doing this is to reduce some mismatch. OpenCV inside this function can modify the parameters of the threshold value.
Fancy there is a matching error in the place, people's clothes to match the human forehead, then how to modify this error? You see most of the matches are correct, and the matching points are mostly parallel, are horizontal lines past, only the black match the wrong place is wrong, is not in line with most directions. Here is a method called geometric check, which is based on the geometric check is called the consistent sampling.
Let's summarize how visual words are constructed:
(1) Feature extraction
  
(2) Codebook construction
          
Suppose you have 10,000 images in your data base, and you can extract 1000 feature points per image, what is the number of feature points you have? That's 10000 * 1000 feature points. But if you think about it, can you think of these 10 million feature points as your codebook content? First you will feel too much, the second is that the description of the 10 million features is certainly not the same, because each feature point has 128 dimensions, there must be a difference, although most of the place is the same, not so coincidental. If you take these 10 million feature points as your codebook content, you take each picture histogram statistic, the end, exactly each picture histogram statistic the word frequency he does not overlap the area, because these 10 million each characteristic point is unique, You put each picture to the projection to 10 million of such a magnitude, the histogram of the horizontal axis is 10 million, each map, there are 1000 points, that each point is unique, then each point you projected down, you add 1, the image of the two points from another area to add 1, So how do you measure these two images? There is no overlap of the area is 0 ah, then you do not have two pictures can be coincident.
So how do we make them coincide, we build a similar to the number of the Xinhua dictionary, 3,000, or 5000. Then you want to project this local feature of every image onto it, to do the right thing, so there's a public part. So how do you turn 10 million into 5000? Then use the clustering method to do dimensionality reduction, 10 million I can gather it into 5000 classes, each class has a central point, that each of the center of the class is my codebook, this time you can both characterize the 10 million feature points, but not so high, and there are public areas.
(3) Vector quantization
      
The XX in each black color represents the local feature or key point. The red dots are clustered central points, or codewords. For areas with a yellow color, 4 key points can be expressed using the Y5 point. This is the third step to do the work, in fact, you enter the image of the feature point, for example, the four points can be used Y5 to express, Y5 to add 1, the final statistic is 4. This is the calculation of the histogram.
Summarize:
  
Visual words of Image search technology