TF-IDF: A correlation ranking technique for traditional IR

Last Update:2017-02-28 Source: Internet

Author: User

Tags query idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

That year, Chrysanthemum is only chrysanthemum, 2B or exam when the use of the pencil, cucumber only vegetables function, information retrieval technology (information retrieval) is simply used in libraries, databases and other places.

It is also in that year, information retrieval related sorting technology is very popular is TF-IDF.

Perhaps at this moment you will be very want to ask, what is TF-IDF? Well, don't catch it, before looking for the answer to this question, let's look at a problem.

In a bunch of books, you want to look for information related to the Ooxx theme (don't think crooked), what criteria do you use to determine that a more than B in this pile of books is more in line with your theme?

Think for a minute.

You might say, take a look at the names of the books, see which titles contain information about the topics I'm looking for, and then look at the rest of the book to see which is more in line with what I want.

That's a good idea.

People think so, the information retrieval system has to do so in order to give us the most desired results, but a problem has been exposed-the program can not understand the text can not be judged.

Here, give you one more minute to figure out how to solve the problem with the program.

Well, as you can see, the topic you're trying to query contains a vocabulary that intersects with the vocabulary in a subset of this pile of books.

Yes, with the last in the search engine principle of the article we talked about the dictionary based word segmentation technology, to find the intersection.

Let's start with a dictionary, which is a collection of n words.

∑={T1,T2,...,TN}

And for your search criteria Q and one of the stacks of books in the D, you can say according to this dictionary:

Q={Q1,Q2,...,QN}

D={D1,D2,...,DN}

Where Q1 is the number of times T1 this term appears in your search term Q, Q2 is the number of times T2 this term appears in search condition Q, and so on. If QN is zero, then the nth Word does not appear in Q.

Set W1=D1/∑DN, then W1 is the frequency that the lexical T1 appears in D, and now D can be expressed as:

D=,wi (i=1,2,3,...,n) is the word frequency (term frequency).

For some high quality information (books, literature, etc.), word frequency is a good, can be implemented through the program language, the expression of words in the document in the weight of the way.

Yes, the question comes out, some words such as "we", "everyone" and so on will certainly appear in many articles, but with this to measure the above conclusion is obviously not tenable ah.

Congratulations on this step, this kind of vocabulary for the identification of the content of the document, Wood has much meaning.

Come on, look for features and get rid of the effects of this word.

Ah, these words will appear in multiple articles at the same time.

Using Ki (i=1,2,3,...,n) to represent the number of books involved in the book's set D, M represents the size of the book D, the value of ki/m can explain some of the problems that we define as the document frequency of TI frequency.

Obviously, the higher the document frequency, the lower the weight of the word.

For ease of calculation, the commonly used is a quantity inversely proportional to the frequency of the document, which we call the inverted document frequency (inverse document frequency), defined as:

IDFI=LG (M/ki)

Since then, the WI has become (Columbia found a formula from the Internet)

Given the quantitative design of a certain weight, the correlation between the document and the query becomes a certain distance between the D and Q vectors, the most commonly used is the cosine (COS) distance (this sentence is decidedly not understood, fully reproduced).

Although the above algorithm in theory seems to compare garbage (regardless of the meaning of the article, the article as a collection of words), but in practice, its value has been widely recognized (especially for the above-mentioned book search).

Of course, for the current web of these mixed web pages, relying solely on TD-IDF is not enough (it is easy to create a lot of keywords piled up the page to get a good ranking), which also contributed to the link based on a series of algorithms, such as the birth.

Original address: http://www.seosos.cn/search-engine/tf-idf.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More