TF-IDF, vector space model, and cosine correlation, used in search engines

Source: Internet
Author: User
Tags idf

1. TF-IDF

TF-IDF is a weighted technique commonly used in information retrieval and data mining. It is a statistical method used to assess the importance of a word to a document in a collection or corpus.

The main idea of TFIDF is: if a word or phrase appears frequently in an article and rarely appears in other articles, this word or phrase is considered to have good classification ability and is suitable for classification.

Term Frequency refers to the frequency at which a given word appears in the file.


IDF document frequency is a measure of the general importance of words.

In general

The more documents that contain a certain entry in a document set, the lower the ability of the document to distinguish document category attributes and the smaller its weight;

On the other hand, the higher the frequency of a certain entry in a document, the higher the ability of the entry to distinguish document content attributes and the greater its weight.

In my experience, TF-IDF above 0.01 already has better differentiation ability. The author in the text classification, extraction of TF-IDF> = 0.01 or 0.02 words for features, classification effect is good, the first test has 90% accuracy.

Ii. vector space model

Vector space model (VSM: vector space model), as a vector identifier (such as an index), is an algebraic model used to represent text files. It is used for information filtering, information retrieval, indexing, and evaluation relevance.

Vector space model maps a document to a feature vector V (d) = (T1, ω 1 (d );...; TN, ω N (d), where Ti (I = 1, 2 ,..., N) is a column of entry items that are different from each other. ω I (d) is the weight of Ti in D. It is generally defined as a function for Ti to appear frequency TFI (d) in D.

Among them, WI and J indicate the Weight Values of the J-th feature item (word term) in the feature space in this document vector.

The formula for calculating WI and J has many variants. Below is a common one.


K indicates the dictionary dimension;

TFI, J indicates the frequency when a feature word appears in the document (0 if not );

N indicates the total number of documents in the corpus;

Dfj indicates the number of documents containing words in the corpus.

Iii. cosine correlation

The similarity between two documents can be expressed by the cosine of the angle between the corresponding vectors.

That is, the dot product of the two vectors divided by the length product of the two vectors.

It can be simplified:

Using the above things, I made a simple search engine two days over the weekend, indexing the 43 W web pages I crawled before. Although there is little data, some search results are quite reliable.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.