in the actual visual slam, the closed-loop detection adopts the DBOW2 model https://github.com/dorian3d/DBoW2, and the bag of words uses the data mining K-means Clustering algorithm, the author only through the bag of words model used in image processing for image interpretation, and does not involve too much on the slam of closed-loop detection applications.
Introduction to the 1.bag-of-words model
Bag-of-words model is a common document representation method in information retrieval field. In the information retrieval, the bow model assumes that for a document, ignoring its word order and grammatical, syntactic and other elements, it is only regarded as a set of several words, each word in the document appears independent, does not depend on whether other words appear. That is, any word that appears anywhere in the document is independent of the semantic influence of the document. An example would be good to understand:
For example, there are two documents:
1:bob likes to play basketball, Jim likes too.
2:bob also likes to play football games.
Based on these two text documents, construct a dictionary:
Dictionary = {1: "Bob", 2. "Likes", 3. "To", 4. "Play", 5. "Basketball", 6. "Also", 7. "Football", 8. "Games", 9. "Jim", 10. "Too"}.
This dictionary contains 10 different words, using the index number of the dictionary, each of the two documents above can be represented by a 10-dimensional vector (with an integer number 0~n (n is a positive integer) that indicates the number of occurrences of a word in the document):
1:[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
2:[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
Each element in the vector represents the number of times a related element in the dictionary appears in the document. However, as you can see in the process of constructing a document vector, we do not express the order in which the words appear in the original sentence.
Bag-of-words the model is applied to the image representation:
To represent an image, we can view the image as a document, a collection of several "visual vocabularies," and, similarly, there is no order between the visual vocabularies.
the process of generating visual dictionaries:
Since the words in the image are not readily available in the text document, we need to first extract the independent visual vocabulary from the image, which usually takes three steps: (1) feature detection, (2) feature representation, and (3) the generation of the word book. is to extract separate visual terms from the image:
The observation will find that although there are differences between different instances of the same class of goals, we can still find some common places between them, such as human faces, although different people's faces are relatively large, but the eyes, mouth, nose and other relatively small parts, but observed not much difference, We can extract the common parts of these different instances as a visual vocabulary to identify this type of goal.
Build Bow Code This step:
The Word table is constructed using the K-means algorithm. Using K-means to cluster the n sift features extracted in the second step, the K-means algorithm is a kind of indirect clustering method based on the similarity measure between samples, the algorithm takes k as the parameter, divides n objects into K clusters, so that the clusters have higher similarity and less cluster similarity. The cluster center has K (in the bow model, we call them visual words), the length of the codebook is K, calculates each image of each SIFT feature to the distance of the K-visual word, and maps it to the nearest visual word (the corresponding word frequency of the visual term + 1). After this step, each image becomes a word frequency vector corresponding to the sequence of visual words.
Suppose we set K to 4, then the construction of the Word table is as follows:
Step Three:
Use the words in the Word table to represent the image. By using the SIFT algorithm, many feature points can be extracted from each image, which can be substituted by the words in the word list, and the image can be represented as a k=4-dimensional numerical vector by counting the number of occurrences of each word in an image in the Word table. Mapping these characteristics to the codebook vector, codebook vector normalization, and finally calculate its distance from the training codebook, corresponding to the nearest distance training image is considered to match the test image. Please see:
We extract the different visual words from the three target images of human face, bicycle and guitar, and the constructed glossary combines the visual words of similar meanings into the same class, and after merging, the glossary contains only four visual words, which are marked as 1,2,3,4 by index value respectively. As you can see, they belong to bicycles, faces, guitars, and face types, respectively. Statistics on the number of occurrences of these words in different target classes can be represented by a histogram of each image:
Human face: [3,30,3,20]
Bike: [20,3,3,2]
Guitar: [8,12,32,7]
In fact, the process is very simple, for the face, bicycle and guitar three documents, extract a similar part (or the word meaning similar to the combination of visual words into the same class), construct a dictionary, the dictionary contains 4 visual words, namely dictionary = {1: "Bicycle", 2. "Human Face", 3. "Guitar", 4. "Face"}, the final face, bike and guitar These three documents can be represented by a 4-dimensional vector, and finally according to the corresponding portion of three documents appear in the corresponding histogram. In general, the value of K in the hundreds of to thousands, here to take k=4 is only for the convenience of explanation.
summarize the steps:
The first step: using the SIFT algorithm to extract the visual lexical vectors from different categories of images, these vectors represent the local invariant feature points in the image;
The second step: Set all the feature point vectors to a piece, using the K-means algorithm to combine the visual words with similar meanings, construct a word list containing k words;
The third step: count the number of occurrences of each word in the image in the Word table, thus representing the image as a k-dimensional numerical vector.
A brief analysis of the Slam (bag of words) model and K-means Clustering algorithm (1)