[10] In a paper published by Josef and Andrew in Iccv, in 2003, the method of document retrieval was borrowed from the object detection in the video. They first compare the feature description of the image into words, and establish a vusual word dictionary based on the SIFT feature, combining the ideas of stop words, TF-IDF and cosine similarity to retrieve the image frames containing the same object. Finally, the matching of the object is done based on the local feature matching and spatial consistency. The source of document retrieval and computer vision is very deep, in the CV field, we often encounter the problem of fusing multiple local feature descriptions into a eigenvector, such as commonly used BOVW, Vlad and Fisher vectors. Below, we take the document retrieval as the Pointcut, the simple study under these local special fusion methods.
Document Retrieval
The procedure for a document retrieval system that typically contains several flags:
- Divides the entire document into words. The word segmentation of English documents is easy to complete, because each word is separated by a space, but the Chinese word segmentation itself is a quite technical content of things. At present, the relatively good Chinese word segmentation project has Ictclas,paoding,fudannlp,ikanalyzer and Jieba and so on.
- Each word is represented by its root (this step is not required in Chinese). Walk,walking and walks, for example, are represented by walk.
- Use the Stop Word table to filter the most common words, such as, "Ah," and so on. However, in some cases, what we need may just be the words in the discontinued vocabulary, so some of the systems now have a subset of the discontinued words.
- The remaining words make up the entire corpus of dictionaries, each word having an ID for identification. For oversized corpora, the dictionary contains at least the number of words millions, so efficient data structures such as Hashe or trees are needed to build dictionaries to speed up search queries.
- Each document is represented by a vector \ (f=\{f_1,\cdots,f_k\}\), where each dimension table of a vector is the number of occurrences of the word in the document that corresponds to that dimension. Because the vector is sparse, it can be stored in the form of sparse vectors, where each non-0 element is stored in the form of (word id,fequency).
- Create an inverted index (inverted index), shown in 1.
- Calculates weights for each dimension of a vector to form a new vector \ (v=\{w_1,\cdots,w_k\}\): \begin{equation} w_i=\frac{f_i}{f}\log\frac{n}{n_i} \end{equation} where \ (f_i\) The number of occurrences of the i\ word in the document, \ (f\) The total number of words filtered for the document, and \ (n\) The number of documents for the entire corpus; \ (n_i\) is the number of documents containing the word \ (i\). The idea behind TFIDF is also straightforward: the words that appear most frequently in a document are more likely to be the main keywords for the document, and if the word appears in many documents, there is no rich amount of information.
- Based on the query keyword, the vector of the document containing the keywords is taken from the inverted index, and the cosine similarity is computed directly with the vector of the query keywords, and then returned to the user after the similarity reading and the weight of the document itself are sorted.
Bag of Visual Words
Large-scale image retrieval should not only consider the correct rate of retrieval, but also consider the cost of time and space. The most popular method of image characterization is the bag of viusal words (BOVW), which draws on natural language processing (Natural Language processing , NLP) in the idea of the word bag, 2 shows. The process of BOVW is described as follows in process Flow 3:
- Extracting Sift (\{d_i\in\mathbb{r}^n|i=1,\cdots,n\}\) from the image set;
- Based on these feature descriptors, Kmeans or GMM are used for cluster analysis, and the resulting cluster center \ (\{c_k\in\mathbb{r}^n|k=1,\cdots,k\}\) is the Viusal dictionary in each of the visual word, To the completion of the establishment of this dictionary;
- Extract the feature descriptor \ (\{d_j\in\mathbb{r}^n|j=1,\cdots,m\}\) of the image, using hard or soft quantization methods [11] to calculate the probability that the feature descriptor falls to each visual word \ (\gamma_j=[\ gamma_{j1},\cdots,\gamma_{jk}]\), and a histogram of \ (\sum_{j=1}^m\gamma_j\) represents each picture.
- The final BOVW vector also needs to be normalized using the \ (l_2\) paradigm, and then assigns weights to each visual word using the idea of IDF (inverse document frequency).
So, based on BOVW, how do we do image search? Can we get some inspiration from web search? Inverted indexes commonly used by search engines can also be useful here.
The success of BOVW mainly stems from two points: 1) strong local feature descriptors such as sift [7,12,13] laid a solid foundation for BOVW, 2) The feature description obtained in BOVW form is a vector of equal length with the Dictionary (visual Dictionary), It is easy to calculate the similarity between two samples with Euclidean distance or cosine distance metrics, and then use mature classification or regression algorithms such as SVM or LR to complete the follow-up task. However, in large-scale image search, BOVW-based eigenvectors are typically up to millions of dimensions. In order to reduce storage overhead, some scholars have proposed to further compress BOVW into binary [4,10], that is, to discard each visual word occurrence of the frequency, Only 1 or 0 indicates that it appears or does not appear. As a result, the storage overhead becomes the original one-eighth, but the accuracy is lower than the original BOVW. According to the experimental results in [4], the binary form of BOVW in visual dictionary dimensions up to 30, At 000, the accuracy is comparable to the original BOVW feature, 4. In fact, high dimensional vectors are usually sparse, and the other strategy for reducing storage overhead is to store non-0 elements with two-tuple (location, value), and for binary BOVW, only the location information can be stored. In addition, if [10] The inverted list (inverted lists) in will greatly speed up the retrieval.
Vector of locally aggregated descriptors
VLAD (vector of locally aggregated descriptors) [5] aggregates local features such as SIFT, resulting in a compact characterization, with computational complexity/storage overhead and effectiveness between BOVW and Fisher Vector[9] Between. The process of building a dictionary \ (\mathcal{c}=\{c_1,\cdots,c_k\}\) is similar to BOVW, except that the sum of all samples allocated to \ (c_i\) and \ (c_i\) is Vlad. Vlad can be seen as a simplified version of Fisher Vector, describing the richer first-order statistics of each eigenvector relative to visual word. Assuming that the dimension of the local feature descriptor is \ (d\), then Vlad's feature dimension \ (D=k\times d\), where the eigenvector associated with each visual word \begin{equation} v_i=\sum_{\text{x such that} NN (x) =c_i}x-c_i \end{equation} The result of these residuals overlays is the VLAD feature description shown in the last 5: \begin{equation} V=\begin{bmatrix} v_1\\ \vdots\\ v_k \end{ Bmatrix}\in\mathbb{r}^{k\times d} \end{equation} Finally, the resulting Vlad vectors need to be normalized, and the most common normalization side is to divide each \ (v_i\) by its corresponding \ (l_2\) norm [2]. or the Transform \ (sign (z) \sqrt{z}\) is performed on each element \ (z\) in \ (v\), then normalized with the global \ (l_2\) norm on this basis. Based on the results of the experiment, even if the number of visual word is relatively small, \ (k\in[16,256]\) can still achieve better results.
Fisher Vector
In the application of image classification and retrieval, the feature representation based on BOVW integration usually has the problem of sparse and high dimension. A more compact feature description can be obtained by using FV (Fisher vector) eigenvector, which is more suitable for image classification and retrieval problems. FV can be deduced as a special case of Fisher Kernel[3]. Fisher kernel, as a kernel function, also calculates the similarity between two samples. Unlike the common kernel functions, such as the polynomial kernel function and the RBF kernel function, Fisher kernel calculates the similarity between the sample \ (x\) and \ (x ') in the production model \ (P (X|\theta) \). The model parameter \ (\theta\) is usually based on the method of maximum likelihood parameter estimation for a batch of training samples, which makes the hypothetical probability model fit to the data's priori partial condition maximally. After learning the probability model \ (p (X|\theta) \), each sample is represented in the nuclear space as follows: \begin{equation} \hat{\phi} (x) =\nabla_{\theta}\log p (X|\theta) \end{ Equation} where \ (\hat{\phi} (x) \) is visible as to how much the sample affects the model parameters. Next, the \hat{\phi (x) \) Albino transform (whitening Transform) [1] is transformed into an albino vector (each dimension of the albino vector obeys a probability distribution with a mean of zero and a finite variance), and the corresponding covariance matrix is the unit array. The aim is to further eliminate the correlation between the dimensions and reduce the redundancy of the data. Because the model parameter \ (\theta\) is obtained with the maximum likelihood estimate, \ (\hat{\phi} (x) \) is the gradient of the logarithmic likelihood function and therefore has \ (E[\hat\phi (x)]=0\). Because \ (\hat{\phi (x)}\) has been centrally processed, its covariance matrix is as follows: \begin{equation} H = E_{x \sim P (X|\theta)} [\hat\phi (x) \hat\phi (x) ^t] \end{ Equation} The final FV encoding \ (\phi (x) \) is given by the gradient of the logarithmic likelihood function, which has been albino transformed: \begin{equation} \phi (x) = h^{-\frac{1}{2}} \nabla_\theta \log p (x|\ Theta). \end{equation} Finally, after Fisher Kernel calculates the similarity between samples in the form of: \begin{equation} K (x,x ') = \laNgle\phi (x), \phi (x ') \rangle = \nabla_\theta \log p (x|\theta) ^\top h^{-1} \nabla_\theta \log p (x ' |\theta) \end{equation} Fisher Vector is designed to encode local feature descriptions of images into vectors that are easy to learn and measure. We assume that the local feature \ (\{x_1,\cdots,x_n\}\) of the \ (d\) dimension is generated by \ (K\) a mixed Gaussian model (Gaussian Mixture model,gmm), where the parameters of GMM are \ (\theta= (\mu_k,\ SIGMA_K,\PHI_K;K=1,\CDOTS,K) \). To simplify the Gaussian mixture model, assume \ (\sigma_k=\sigma_k^2,\sigma_k\in\mathbb{r}^d_+\). The generation model for generating local feature descriptions is given by the density function of the mixed Gaussian model: \begin{equation} p (X|\theta) =\SUM_{K=1}^K\PI_KP (x|\theta_k) \end{equation} where, \ (\ theta_k= (\mu_k,\sigma_k) \), and \begin{equation} \begin{split} p (x|\theta_k) &=\frac{1}{(2\pi^{\frac{d}{2}}) \ sqrt{|\sigma_k|}} \EXP\LEFT[-\FRAC{1}{2} (X-\mu_k) ^t\sigma_k^{-1} (X-\mu_k) \right]\\ &=\frac{1}{(2\pi^{\frac{d}{2}}) \sqrt{\ sigma^2_k}}\exp\left[-\frac{(X-\mu_k) ^t (x-\mu_k)}{2\sigma_k^2}\right] \end{split} \end{equation} Fisher Vectors need to calculate the reciprocal of the logarithmic likelihood function to the parameters of each model, and the following only consider the parameters of a particular pattern \ (\theta_k\). Because the Gaussian density function contains exponential functions, its gradient can be written as follows: \begin{equation} \nabla_{\theta_k} p (x|\theta_k) = P (x|\theta_k) g (X|\theta_k) \end{ Equation} then the inverted logarithm likelihood functionThe number forms are as follows: \begin{equation} \nabla_{\theta_k} \log p (x|\theta) = \frac{\pi_k p (x|\theta_k)}{\sum_{t=1}^k \pi_k p (x|\Theta_k )} g (X|\theta_k) = Q_k (x) g (X|\theta_k) \end{equation} where \ (Q_k (x) \) is the probability that the sample \ (x\) is assigned to the first \ (k\) pattern in the form of a soft allocation. According to the method of [8], we do approximate processing: if \ (X\) is sampled from the ground \ (k\) mode, then \ (Q_k (x) \approx 1\), otherwise \ (Q_k (x) \approx 0\), so get the following formula: \begin{equation} \ Begin{split} &e_{x \sim P (X|\theta)} [\nabla_{\theta_k} \log P (x|\theta) \nabla_{\theta_t} \log P (x|\Theta) ^T]\\ \ap ProX &\begin{cases} \pi_k e_{x \sim P (x|\theta_k)} [G (X|\theta_k) g (X|\theta_k) ^t], & T = k, \ \ 0, & T\not=k. \end{cases} \end{split} \end{equation} on the basis of this approximate calculation, there is no correlation between the parameters of each Gaussian model. Next, Calculate \ (g (x|\theta_k\): \begin{equation} g (X|\theta_k) = \begin{bmatrix} g (x|\mu_k) \ \ g (x|\sigma_k) \end{bmatrix} \end {equation} where \begin{equation} [G (x|\mu_k)]_j = \frac{x_j-\mu_{jk}}{\sigma_{jk}^2} \end{equation} \begin{equation} [G ( x|\sigma_k^2)]_j = \frac{1}{2\sigma_{jk}^2} \left (\left (\frac{x_j-\mu_{jk}}{\sigma_{jk}}\right) ^2-1 \right) \enD{equation} then the covariance of the model is as follows: \begin{equation} \begin{split} H_{\MU_{JK}} &= \pi_k e[g (X|\MU_{JK}) g (X|\MU_{JK})] \ \ & =\frac{\pi_k}{\sigma_{jk}^2}\int\left (\frac{x_j-\mu_{jk}}{\sigma_{jk}}\right) ^2p (x|\Theta_k) dx_{j}\\ &=\ Frac{\pi_k}{\sigma_{jk}^2}\int \frac{-t^2}{\sqrt{2\pi}}e^{\frac{t^2}{2}}dt\\ &=\frac{\pi_k}{\sigma_{jk}^2\ Sqrt{2\pi}}\left[\left.-te^{-\frac{t^2}{2}}\right|^{+\infty}_{-\infty}+\int e^{-\frac{t^2}{2}}dt\right]\\ &= \FRAC{\PI_K}{\SIGMA_{JK}^2} \end{split} \end{equation} Similarly, you can get \begin{equation} h_{\sigma_{jk}^2} = \frac{\pi_k}{2 \ SIGMA_{JK}^4} \end{equation} Then, for a feature description of the image \ (x\) The FV encoded form is: \begin{equation} \PHI_{\MU_{JK}} (x) = H_{\mu_{jk}}^{-\ FRAC{1}{2}} q_k (x) g (X|\MU_{JK}) = Q_k (x) \frac{x_j-\mu_{jk}}{\sqrt{\pi_k}\sigma_{jk}} \end{equation} \begin{equation } \PHI_{\SIGMA^2_{JK}} (x) = H_{\sigma^2_{jk}}^{-\frac{1}{2}}q_k (x) g (X|\SIGMA^2_{JK}) =\frac{q_k (x)}{\sqrt{2 \pi_k}} \left[\left (\frac{x_j-\mu_{jk}}{\sigma_{jk}}\right) ^2-1 \right] \end{equation} According to paper [9] The study shows that the prior probability parameter \ (\pi_k\) contributes less to the characterization, so it is ignored here. Assuming that all local feature descriptions generated are subject to independent distribution, then for a graph of the \ (n\) feature description \ (\{x_1,\cdots,x_n\}\), its overall FV encoding form is as follows: \begin{equation} \begin{split} u_ {JK} &= {1 \over {N \sqrt{\pi_k}}} \sum_{i=1}^{n} Q_{ik} \frac{x_{ji}-\mu_{jk}}{\sigma_{jk}}, \ \ V_{jk} &= {1 \o ver {N \sqrt{2 \pi_k}} \sum_{i=1}^{n} Q_{ik} \left[\left (\frac{x_{ji}-\mu_{jk}}{\sigma_{jk}}\right) ^2-1 \right] \end {split} \end{equation} where \ (j=1,\cdots,n\) represents each dimension of the eigenvector. The overall feature description of each image \ (i\) can be represented by a vector superimposed by \ (\mathbf{u}_k\) and \ (\mathbf{v}_k\): \begin{equation} \phi (I) = \begin{bmatrix} \vdots \ \ Mathbf{u}_k \ \vdots \ \mathbf{v}_k \ \vdots \end{bmatrix} \end{equation in order to further improve the performance of Fisher Vector, also need normalization, commonly used include L2-norm , Power-norm. In addition, we can combine spatial pyramid matching (spatial Pyramid Matching) [6], the basic idea is to divide the image into a number of different sizes of windows, and then calculate the feature histogram of each window and pooled (Pooling) combination, At last, the weighted combination of different granularity is the final feature description, which is shown in 6.
According to the experiment of [9], combining L2-norm, power-norm and pyramid matching, the results are shown in comparison with 7.
References
- [1] Whitening transformationn. Http://en.wikipedia.org/wiki/Whitening_transformation.
- [2] Relja Arandjelovic and Andrew Zisserman. All about Vlad. In computer Vision and Pattern recognition (CVPR), IEEE Conference on, pages 1578–1585. IEEE, 2013.
- [3] Tommi Jaakkola, David Haussler, et al exploiting generative models in discriminative classifiers. Advances in neural information processing systems, pages 487–493, 1999.
- [4] herv? e J Egou, Matthijs douze, and Cordelia Schmid. Packing Bag-of-features. In computer Vision, IEEE 12th International Conference on, pages 2357–2364. IEEE, 2009.
- [5] herv? e J Egou, Matthijs Douze, Cordelia Schmid, and Patrick P Erez. Aggregating local descriptors into a compact image representation. In computer Vision and Pattern recognition (CVPR), IEEE Conference on, pages 3304–3311. IEEE, 2010.
- [6] Svetlana LaZebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features:spatial pyramid matching for recognizing natural scene categories. In computer Vision and Pattern recognition, 2006 IEEE Computer Society Conference on, Volume 2, pages 2169–2178. IEEE, 2006.
- [7] David G Lowe. Distinctive image features from Scale-invariant keypoints. International Journal of Computer Vision, 60 (2): 91–110, 2004.
- [8] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization. In computer Vision and Pattern recognition, 2007. CVPR ' 07. IEEE Conference on, pages 1–8. IEEE, 2007.
- [9] Florent Perronnin, Jorge S anchez, and Thomas Mensink. Improving the Fisher kernel for large-scale image classification. In computer VISION–ECCV, pages 143–156. Springer, 2010.
- [Ten] Josef Sivic and Andrew Zisserman. Video google:a Text retrieval approach to object matching in videos. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 1470–1477. IEEE, 2003.
- [One] Jan C van Gemert, Cor J veenman, Arnold WM smeulders, and J-m Geusebroek. Visual word ambiguity. Pattern analysis and Machine Intelligence, IEEE transactions on, 32 (7): 1271–1283, 2010.
- Simon Winder, Gang Hua, and Matthew Brown. Picking the best daisy. In computer Vision and Pattern recognition, 2009. CVPR 2009. IEEE Conference on, pages 178–185. IEEE, 2009.
- [] Simon AJ Winder and Matthew Brown. Learning local image descriptors. In computer Vision and Pattern recognition, 2007. CVPR ' 07. IEEE Conference on, pages 1–8. IEEE, 2007.
Aggregating local features for Image retrieval