Image Retrieval (4): If-idf,rootsift,vlad

Source: Internet
Author: User
Tags scalar square root idf
    • Tf-idf
    • Rootsift
    • VLAD
Tf-idf


TF-IDF is a commonly used weighted technique for information retrieval, which evaluates the importance of words for one of the documents in a file database in text retrieval. The importance of words increases in proportion to the frequency with which it appears in the file, but decreases inversely as it appears in the file database. Such common words as ' ', ' we ', ' ground ', and so on, appear in all articles in high frequency and do not represent the content of a document well.



In the same way, the IF-IDF weights are introduced in the image retrieval,


    • Word frequency (term FREQUENCY,TF) avisual wordvery high rate of occurrence in an image, it means that it can represent the content of the image well. \[TF = number of occurrences of a word in the \frac{image} {Image The total number of word}\]

    • Inverse document frequency (inverse documents FREQUENCY,IDF) Some common word, will appear in each image is very high frequency, but these word does not very well represent the content of the image, so to this part of word lower weight. IDF, which describes the general importance of a word, such as ancient European word appears in many images with high frequency, gives it a lower weight. \[IDF = \log (the total number of \frac{images} {contains the number of images in Word + 1}) \]
      The denominator +1 is to prevent the occurrence of a divisor of 0. As you can see from the above, the greater the number of images that contain the current word, the smaller the IDF value, the less important the word. Conversely, the more important the word.


TF and IDF were calculated, then there were
\[TF-IDF = TF * idf\]



As can be seen from the calculation formula of TF and IDF, IDF is for the entire image database and can be computed once the training is completed. The TF is for a specific image, it needs to be calculated multiple times.



The TF-IDF weights are assigned to the bow vector, and the normalization of \ (l_2\) can be used to obtain a vector for image retrieval.


C + + implementation

void compute_idf(const vector<<vector<int>> &bow,vector<float> &idf){

    int img_count = bow.size();
    int clu_count = bow[0].size();

    idf = vector<float>(clu_count,1.0);

    for(int i = 0; i < img_count; i ++){
        for(int j = 0; j < clu_count; j ++){
            if(bow[i][j] != 0) idf[j] ++;
        }
    }

    for(int i = 0; i < idf.size(); i ++){
        idf[i] = log(img_count / idf[i]);
    }
}


The above code calculates the IDF for the Image Library (IDF is for the entire image library).
For a single image, you need to calculate the TF once. The TF calculation formula:\ (tf = number of occurrences of a word in the \frac{image} {Image Word total number of}\), you can see the bow vector of the image (l_1\) normalization.



void compute_tf(const vector<int> &bow,vector<float> &tf){

    tf = vector<float>(bow.size(),0);

    int sum = 0; // All words in the image
    for(int i = 0; i < bow.size(); i ++){
        sum += bow[i];
    }

    for(int i = 0; i < bow.size(); i ++){
        tf[i] = (float)bow[i] / sum; 
    }
}
Rootsift


Papers in Arandjelovic and Zisserman 2012 [Three things everyone should know to improve object retrieval] (Www.robots.ox.ac.uk/~vgg /publications/2012/arandjelovic12/arandjelovic12.pdf "Three things everyone should know to improve object retrieval") presented The rootsift.



When comparing histograms, the use of Euclidean distances is usually worse than that of the chi-square or Hellinger cores, but why is the Euclidean distance always used when using SIFT feature points?
Whether the SIFT feature points are matched or the SIFT feature sets are clustered to obtain a visual glossary, or bow encode the image, the Euclidean distance is used. But the SIFT feature descriptor is also a histogram in nature, why is it possible to use Euclidean distance when comparing sift feature descriptors, is there a more accurate comparison method?


Sift descriptive sub-statistic is the key point neighborhood gradient histogram, more detailed introduction can refer to image retrieval (1): Re-discussion sift-based on vlfeat implementation


Zisserman that the Euclidean distance is used to measure the similarity of the SIFT feature, because when the SIFT is presented, the Euclidean distance measurement is used to find a more accurate method of measuring Euclidean distance. Arandjelovic and Zisserman proposed Rootsift to extend the SIFT feature.



After extracting the SIFT description vector \ (x\) , the following processing can be obtained rootsift
1. Normalization of the eigenvectors \ (x\ ) ( l_1\) (\ (l_1-normalize\)) get \ (x ' \)
2. Square root for each element of \ (x ' \)
3. Perform \ (l_2-normalize\), optional


In the final step, it is not consistent with the normalization of \ (l_2\) . The paper does not point to the need for a normalization of \ (l_2\) , but in presentation, there is a step in the normalization of \ (l_2\) . There is also the view that explicitly performing L2 normalization is not required. By adopting the L1 specification and then the square root, there are already L2 standardized eigenvectors that do not require further standardization.

Python implementation

# import the necessary packages
import numpy as np
import cv2

class RootSIFT:
    def __init__(self):
        # initialize the SIFT feature extractor
        self.extractor = cv2.DescriptorExtractor_create("SIFT")

    def compute(self, image, kps, eps=1e-7):
        # compute SIFT descriptors
        (kps, descs) = self.extractor.compute(image, kps)

        # if there are no keypoints or descriptors, return an empty tuple
        if len(kps) == 0:
            return ([], None)

        # apply the Hellinger kernel by first L1-normalizing and taking the
        # square-root
        descs /= (descs.sum(axis=1, keepdims=True) + eps)
        descs = np.sqrt(descs)
        #descs /= (np.linalg.norm(descs, axis=1, ord=2) + eps)

        # return a tuple of the keypoints and descriptors
        return (kps, descs)

From www.pyimagesearch.com/2015/04/13/implementing-rootsift-in-python-and-opencv/

C + + implementation

for(int i = 0; i < siftFeature.rows; i ++){
        // Conver to float type
        Mat f;
        siftFeature.row(i).convertTo(f,CV_32FC1);

        normalize(f,f,1,0,NORM_L1); // l1 normalize
        sqrt(f,f); // sqrt-root  root-sift
        rootSiftFeature.push_back(f);
    }
VLAD


Local aggregation vectors (vector of locally aggregated descriptors,vlad)



The bow method described above has been widely used in image retrieval and retrieval. Bow by clustering, the local features of the image are re-encoded, there is a strong representation of the ability, and the use of SVM based on the sample interval classifier, but also can obtain a good classification effect. However, in the case of large image size, due to the limitation of the size of the visual vocabulary, theVocabularybow will be more and more coarse, the image information is more lost after encoding, and the retrieval precision is reduced.



In 2010, aggregating local descriptors into a compact image representation presented a pair of new image representation methods, VLAD. Improvements from three areas:


    • Using Vlad to represent local features of an image
    • Pca
    • How to build an indexed ADC


Bow's representation is to count the frequency at which each feature word appears in the image. Vlad is the feature of the same cluster center and the accumulation of the residual of the cluster center. The formula is expressed as follows:
\[V_{I,J} = \sum_{x\ such\ that\ NN (x) =c_i}x_j-c_{i,j}\]
\ (x_j\) is the j\ feature point of the image, and \ (c_i\) is the cluster center closest to the feature point, and\ (x_j-c_{i,j}\) represents the difference between the feature point and its nearest cluster center. If you are using the SIFT feature, the size of the visual glossary vocabulary is \ (k\), you can get the \ (k\) 128-D vector \ (v_{i,j}\).
The \ ( k*d\ ) vector (\ (d\) is then stretched to one \ (k*d\) for the length of the image feature, for example sift to 128 dimensions)\ (v_{i,j}\) The length of a one-dimensional vector, and then the stretched vector to do a \ (l_2\) Normalization of the image to get the VLAD representation.



Because Vlad is the residual of the feature and its nearest cluster center, many components of the vector are 0, that is, the vector is sparse (sparse, very few components occupy most of the energy), so the VLAD can be reduced (for example, PCA) to further reduce the size of the vector.


Realize

void Vocabulary::transform_vlad(const cv::Mat &f,cv::Mat &vlad)
{
    // Find the nearest center
    Ptr<FlannBasedMatcher> matcher = FlannBasedMatcher::create();
    vector<DMatch> matches;
    matcher->match(f,m_voc,matches);


    // Compute vlad
    Mat responseHist(m_voc.rows,f.cols,CV_32FC1,Scalar::all(0));
    for( size_t i = 0; i < matches.size(); i++ ){
        auto queryIdx = matches[i].queryIdx;
        int trainIdx = matches[i].trainIdx; // cluster index
        Mat residual;
        subtract(f.row(queryIdx),m_voc.row(trainIdx),residual,noArray());
        add(responseHist.row(trainIdx),residual,responseHist.row(trainIdx),noArray(),responseHist.type());
    }

    // l2-norm
    auto l2 = norm(responseHist,NORM_L2);
    responseHist /= l2;
    //normalize(responseHist,responseHist,1,0,NORM_L2);

    //Mat vec(1,m_voc.rows * f.cols,CV_32FC1,Scalar::all(0));
    vlad = responseHist.reshape(0,1); // Reshape the matrix to 1 x (k*d) vector
}


With the help of the OPENCV implementation is still relatively simple.FlannBasedMatcherThe method used here ismatchto find the nearest cluster Center (visual vocabulary). Can be divided into the following three steps:


    • subtractCalculates the difference between a feature and its nearest neighbor's cluster center,addadding up the difference of the same cluster center
    • Normalization of the resulting matrixresponseHist(l_2\)
    • Using thereshapemethod, the matrix is stretched to one-dimensional \ (k*d () \) One-dimensional vector vlad.
Summary


Bow is usually used with TF-IDF, but Vlad is a good alternative due to the vocabulary size limit. Rootsift is an extension of native sift.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.