Information retrieval model and evaluation __ Natural language

Source: Internet
Author: User
Tags idf

The premise of information retrieval is the indexing of information content, the so-called index refers to the item used to identify the content of the information. The method of establishing an index of information can usually be divided into two categories: one is to manually define the index and the other is to get the index automatically. The data sources we are facing are either modular or structured languages, such as HTML, or unstructured languages, such as natural languages, which may be context-related or context-independent. Words that may not be contextual May be a simple word and may be a phrase (here's the premise of the data source is English, for Chinese information retrieval, because there is already a molding information retrieval model, if you want to apply in Chinese information retrieval, need to do Chinese participle)
The key issue with indexing is how we can determine which words are available for indexing. We can choose what method to mark these words.
for the effect of information retrieval, we can quantify the reference by two parameters:
The value of the recall rate for the number of information that is related to the query
value of the accuracy of the information to query the project
of course, we want these two values to be as high as possible. So the index in each document and the corresponding recall rate and the accurate rate of valuation is the focus of our attention.
But we need to pay extra attention to what part of speech we should choose to be indexed. It is obvious that words such as conjunctions, prepositions and so on should be avoided as far as possible, while semantically-containing words are suitable for indexing.
The following are methods for centralizing the information retrieval model. a frequency-based retrieval model: TF-IDF

The retrieval index based on this model should avoid the consideration of functional words first. That is, the functional word will not be included in the computational vocabulary.
Calculates the frequency of the word t[j] that appears in each document D[i (I use frequencies in my job, shielding the uneven distribution of words, but may cause errors.) ) Tf[i,j]. The value of Tf[i,j] is the number of times (or frequency) in which the word J appears in document I
Select a threshold frequency (frequency) to filter out high-frequency words in the document.
This step is to filter out the collection of indexed items that can well identify the document, and by filtering out the index entries, we can distinguish the documents we are looking for from a number of documents to ensure that the recall rate is retrieved. When the frequency of a word is not high in other documents, the accuracy of the retrieval can be guaranteed.
But we measure by frequency, there is a problem, that is, the distribution of the number of words is uneven, which will affect the accuracy of the search results. So there are usually two ways to solve this problem: one is to normalize the standard (to calculate frequency), or to add another parameter, such as reverse document frequency
For the word j, the formula for calculating the reverse document frequency is as follows:
IDFJ=LOGNDFJ Idf_j = Log\frac{n}{df_j}

Here define DF[J] the frequency with which the word J appears for the document. So we can see that when the word j is only present in a document, Idf[j] is the largest value, log (N), and when the word j is present in each document, Idf[j] = 0.
In other words, the higher the IDF value, the higher the accuracy.
So we can calculate the weight of the word J in document I

WIJ=TFIJXIDFJ W_{ij} = tf_{ij} \times idf_j

We can tell whether a document is related by the weight of the word. One of the key purposes of indexing is to be able to differentiate documents. So we can set up the index of the document by the weight of the word.
We further summarize and abstract that a document can be described as a point that exists in the high-dimensional space. From this point of view, when two points are very similar in space, the two documents are very similar. If a high-frequency index vocabulary is not well separated from the document, it can increase the space density of the document space. Our goal is to better distinguish 22 documents to increase the accuracy of the search, so what we need to solve is how to distinguish the document index vocabulary so that the document space density is not particularly high. evaluation method of lexical differentiation

We define the lexical distinction value of dv[j] = Q-q[j] To mark the value of the distinction of the word J. Here the Q is the mean distance, Q[j] is the average distance after the reference mark Word J
Q=∑ni=1∑nk=1,i≠ksim (Di) Sim (Dk) N (n−1) Q = \frac{\sum^n_{i=1}\sum^n_{k=1, I \neq K}sim (d_i) Sim (d_k)}{n (N-1)}

Whether the word is a good index is determined primarily by the size of the dv[j]. If dv[j]>0 shows that the word j is a good sign, if dv[j]<0 indicates that the word j is a bad sign.
Another method of word weight value labeling, definition of the words in document I the weight of j is w[i,j] = tf[i,j] * Dv[j]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.