TF-IDF: A correlation ranking technique for traditional IR

Source: Internet
Author: User
Tags query idf

That year, Chrysanthemum is only chrysanthemum, 2B or exam when the use of the pencil, cucumber only vegetables function, information retrieval technology (information retrieval) is simply used in libraries, databases and other places.

It is also in that year, information retrieval related sorting technology is very popular is TF-IDF.

Perhaps at this moment you will be very want to ask, what is TF-IDF? Well, don't catch it, before looking for the answer to this question, let's look at a problem.

In a bunch of books, you want to look for information related to the Ooxx theme (don't think crooked), what criteria do you use to determine that a more than B in this pile of books is more in line with your theme?

Think for a minute.

You might say, take a look at the names of the books, see which titles contain information about the topics I'm looking for, and then look at the rest of the book to see which is more in line with what I want.

That's a good idea.

People think so, the information retrieval system has to do so in order to give us the most desired results, but a problem has been exposed-the program can not understand the text can not be judged.

Here, give you one more minute to figure out how to solve the problem with the program.

Well, as you can see, the topic you're trying to query contains a vocabulary that intersects with the vocabulary in a subset of this pile of books.

Yes, with the last in the search engine principle of the article we talked about the dictionary based word segmentation technology, to find the intersection.

Let's start with a dictionary, which is a collection of n words.

∑={T1,T2,...,TN}

And for your search criteria Q and one of the stacks of books in the D, you can say according to this dictionary:

Q={Q1,Q2,...,QN}

D={D1,D2,...,DN}

Where Q1 is the number of times T1 this term appears in your search term Q, Q2 is the number of times T2 this term appears in search condition Q, and so on. If QN is zero, then the nth Word does not appear in Q.

Set W1=D1/∑DN, then W1 is the frequency that the lexical T1 appears in D, and now D can be expressed as:

D=,wi (i=1,2,3,...,n) is the word frequency (term frequency).

For some high quality information (books, literature, etc.), word frequency is a good, can be implemented through the program language, the expression of words in the document in the weight of the way.

Yes, the question comes out, some words such as "we", "everyone" and so on will certainly appear in many articles, but with this to measure the above conclusion is obviously not tenable ah.

Congratulations on this step, this kind of vocabulary for the identification of the content of the document, Wood has much meaning.

Come on, look for features and get rid of the effects of this word.

Ah, these words will appear in multiple articles at the same time.

Using Ki (i=1,2,3,...,n) to represent the number of books involved in the book's set D, M represents the size of the book D, the value of ki/m can explain some of the problems that we define as the document frequency of TI frequency.

Obviously, the higher the document frequency, the lower the weight of the word.

For ease of calculation, the commonly used is a quantity inversely proportional to the frequency of the document, which we call the inverted document frequency (inverse document frequency), defined as:

IDFI=LG (M/ki)

Since then, the WI has become (Columbia found a formula from the Internet)

  

Given the quantitative design of a certain weight, the correlation between the document and the query becomes a certain distance between the D and Q vectors, the most commonly used is the cosine (COS) distance (this sentence is decidedly not understood, fully reproduced).

  

Although the above algorithm in theory seems to compare garbage (regardless of the meaning of the article, the article as a collection of words), but in practice, its value has been widely recognized (especially for the above-mentioned book search).

Of course, for the current web of these mixed web pages, relying solely on TD-IDF is not enough (it is easy to create a lot of keywords piled up the page to get a good ranking), which also contributed to the link based on a series of algorithms, such as the birth.

Original address: http://www.seosos.cn/search-engine/tf-idf.html



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.