Source Address: http://grunt1223.iteye.com/blog/828192
First, Introduction
Multimedia recognition is a problem in information retrieval which is more difficult and more demanding. Taking image as an example, according to the information used in image retrieval, the image can be divided into two categories: text-based image retrieval and content-based image retrieval (cbir:content Based image retrieval). Text-based image retrieval completely does not analyze and utilize the content of the image itself, its retrieval quality depends entirely on the correlation between the text information associated with the image and the image content, so it is necessary to introduce content-based image retrieval. This is the main discussion of the latter.
In computer vision, image content is usually described by image features. In fact, image retrieval based on computer vision can be divided into three steps of similar text search engine: Extracting feature, building index build and querying. This article also follows these three steps to elaborate separately.
Second, the image feature extraction
At present, the image recognition on the internet can be attributed to two types of problems, one is "near-repeated search", mainly for the same source map through different deformation (including illumination, watermark, scaling, local missing replacement, etc.), or for the general similar objects to identify, mainly used in copyright protection, illegal identification, Image deduplication and basic similarity search and so on; the second is "local search", refers to the two pictures as long as there are some objects to repeat, can be matched to, for example, we can imagine, different offer models are not the same, but as long as they are across the same LV package, it can be considered similar image, That is, to achieve the real meaning of image retrieval.
Corresponding to this, image features can also be divided into two categories: Global features and local features. Most image signature algorithms use the global features of images to describe the contents of an image, such as color histogram, color distribution, shape, or edge information, and use a string or an array as the hash value of an image.
Overall, the global feature is a high-level abstraction of the content of the image, answering only "What is the image", and most of the occasions in the user's view, more likely to answer "what is the image." For example, when a user retrieves an image, it is often more concerned with scenes, objects, or specific tasks in the image, where a single global feature cannot differentiate information, and local features are introduced. One of the most famous is the "image retrieval based on scale invariant feature transformation", invariant Feature Transform, which is the famous sift. The basic idea is to break the image into many high-dimensional feature points, so the image on the internet is saved in the form of visual Thesaurus. Because the SIFT feature is not affected by scale transformation and rotation in the description vector, it is insensitive to image noise, affine deformation, illumination change and three-dimensional view, so it has a very strong degree of distinction, and is widely used in object recognition, video tracking, scene recognition, image retrieval and so on.
For the sake of simplicity, this paper mainly discusses the image similarity retrieval technology based on the global feature, and the local feature can be expanded on this basis.
MPEG (i.e. Moving Picture Experts Group) is an international standard, the so-called ISO11172. Accurately speaking, MPEG-7 is not a compression coding method, but a multimedia content description interface. After the MPEG-4, to solve the contradiction is the increasingly large image, sound information management and rapid search. MPEG7 is the solution to this contradiction. MPEG-7 seeks to quickly and efficiently search for the different types of multimedia imagery the user needs, such as searching for fragments of the three-gorge lens in the image material. The programme is expected to be finalized and published in early 2001. Although there is no implementation code, MPEG-7 published a number of image description interfaces, and developed some criteria such as color distribution, texture, edge, body color. This paper mainly introduces the principle of the edge histogram description algorithm used in the rear. The main steps to calculate the edge histogram are as follows:
- First, an original image is divided into a 4x4 total of 16 sub-images, then the processing is to each sub-image of the local edge of the histogram calculation. Each local edge histogram is processed using five 5-edge operators. Finally, a 80-dimensional vector is obtained to uniquely identify the image.
- Each sub-image is segmented into a series of image blocks, and the area varies with the area of the image. The number of images in each sub-image is fixed, refer to Figure I.
- Calculates and counts the five edge types of each image block (horizontal, vertical, 45°, 135°, and no direction), which is the five recommended edge detection operator for MPEG-7, resulting in a maximum of five edge orientations.
- The value of the obtained edge histogram is normalized and quantified. Considering the non-uniformity of the human eye vision, the values of the 80 straight bars after normalization are nonlinear quantified, each histogram is encoded using a fixed-length 3-bit (that is, the quantization range is 0~8), and a total of 240 bits are used to represent the edge histograms.
- Consider the two edge histogram descriptors, by calculating the Euclidean distance between the histogram to get two texture image similarity, very intuitive, the distance is 0 two pictures of the edge texture is exactly the same, the greater the distance indicates the less similarity.
The build of Image feature index and the query based on image
In the massive (million) image features, the search for sub-linear time complexity matching algorithm is very challenging, in particular, because it is an approximate retrieval, we need the number of non-exact matching, let us look at the method can be thought of:
- Linear Scan: A exhaustive sequential scan of the entire sample vector set, calculating its Euclidean distance from the query image, and then sorting the output. Accuracy 100% But too much time complexity results in a very poor usability.
- Tree-based indexes: such as Kd-tree,sr-tree recommended by SIFT authors. However, due to the existence of "Dimension disaster (Curse of dimensionality)", when the vector dimension is greater than 10 to 20, the tree-based index needs to scan the majority of the whole vector set, plus the cost of the algorithm itself, the effect is even worse than the brute force found above.
- Clustering: After abandoning the tree structure to index, many researchers then use the vector quantization method based on Kmeans clustering (hierarchical clustering), whose essence is to map vectors to scalars and achieve some success. However, the time complexity of the method is closely related to the number of images, and when the scale is enlarged, the time overhead of biuld and query still cannot reach the point of online use.
- a hash-table-based index: similar to the above, it is essentially the conversion of vectors to scalars for matching. The main benefits of the hash table are two points, one is that the query time is independent of the size of the data structure, basically the time complexity of O (1), and the other is that incremental build is more convenient than other methods. Of course, the image features are stored directly in the hash table, or placed in a database of a field, can only be precisely matched, can only return the same image, for image retrieval, its value point is almost zero, so simple hash list technology can not meet our needs.
- commonly used hash functions (CRC, MD5, etc.), are essentially cryptography-based fragile hash functions, characterized by the input as long as there is a slight difference, the resulting results should be as large as possible changes. This paper uses the local sensitive hash (Locality sensitive Hashing, LSH) method, in terms of vector matching and indexing more than the traditional tree structure and clustering algorithm to several orders of magnitude, to support the non-accurate search, in my opinion, is currently known as the most suitable for multimedia retrieval algorithm.
LSH is mainly used to solve the problem of K-nearest neighbor (K Nearst neighgor) for multidimensional vectors, that is, to find the K most-similar vectors between a multidimensional vector. This is a probabilistic algorithm, its principle is similar to bloom filter, there is a certain false positive, but in exchange for retrieval efficiency leap.
The main principle of LSH is: Set up a hash table to hold the index, each hash table TI contains m storage bucket, plus two sets of function family GI and hi associated with it. The local sensitive hashing algorithm maps similar vectors to the same bucket in probabilistic sense, which can be used to perform non-exact matching of image features. In order to minimize the probability of error, the use of multiple hash functions to map to different hash table, scatter error, two is shown.
The specific process for indexing image features using LSH is as follows:
- The vector P-ID is converted into a binary vector in Hamming space (only 1 or 0 per dimension), the vector value of one dimension is x, and the maximum value is C, which is expressed as a continuous x 1 followed by a C-dimensional binary vector of c-x 0. Therefore the distance of the vector in the original space is consistent with its hamming distance.
- The hash function g acting on the result of the preceding, the definition of G is to select the target vector of k-dimensional binary vector splicing. It can be seen that the greater the similarity between the target vectors, the greater the probability of their resulting hash value being equal. This is the key to a non-precise search.
- Based on the results of 2, and then using the regular hash function (MD5) two times hash The result of two hashes to a bucket in a hash table, and then use the next function to calculate, the cycle. As we said earlier, multiple hash functions are used to reduce the error of similar searches. In this way, similar images are placed in the same bucket, and different images are placed in another bucket, as shown in three.
- When the query image is evaluated, its eigenvalues are computed and the results are queried from the multiple tables created earlier, together, as shown in four.
- For the approximate image returned in the previous step, it calculates its Euclidean distance, sorts, and further removes the false positive on the minimum probability.
The time complexity of the multi-dimensional approximate retrieval is further reduced to the sub-linear level, while the number and type of hash functions can be reasonably chosen to balance the retrieval accuracy and recall rate.
Iv. Results of the experiment
In order to verify the results of MPEG-7 edge histogram with local sensitive hashing algorithm, this paper uses the forbidden database in the hidden Net project to test. The test environment is the company's Dell PC, and the test conditions are as follows:
Number of sample libraries: 14085
Sample Category: National security category, cultural media category, restricted, pharmaceutical equipment
Persistent index file size: 3.07MB
From image build time: 406ms
Build time from index file: 15min
Query time: 0ms
Example of test effect:
Type of image to retrieve:
Query to get similar picture results:
V. Follow-up work
1. Introduction of SIFT feature and realization of local image retrieval
2, LSH algorithm, automatic optimization of parameters
3. Millions data test
4. Analysis of strategies under different scenarios and categories
Internet similarity image recognition and retrieval engine--based on image signature method