Lucene TF-IDF Correlation Formula
Lucene in keyword query, by default, using the TF-IDF algorithm to calculate the relevance of keywords and documents, using this data sorting
TF: Word Frequency, IDF: reverse Document Frequency, TF-IDF is a statistical method, or knownVector Space ModelThe name sounds complicated, but

The calculation of TF-IDF values may be involved in the process of text clustering, text categorization, or comparing the similarity of two documents.

very high, and a large number of dimensions are 0, the calculation of the angle of the vector effect is not good. In addition, the large amount of computation makes the vector model almost does not have in the Internet search engine such a massive data set implementation of the feasibility.

TF-IDF model

At present, the TF-IDF model is widely used in real applications such as search engines.

TF/IDF (Term Frequency/inverse Document Frequency) is recognized as the most important invention in information retrieval.
1. TF/IDF describe the correlation between a single term and a specific document
Term Frequency: indicates the correlation between a term and a document.Formula: number of times this term appears in the

TF-IDF and its algorithm
Concept
Analysis of TF-IDF:
TF-IDF algorithms play an important role in two aspects: 1. Extract keyword words of the Article 2. Search for highly relevant text based on keywords. This algorithm is recognized as the most important invention in the information retrieval field and is the basis of many algorithms and models.
What is TF-IDF
TF-IDF (Term Frequency-inverse Document Frequency) is

This title seems very complicated. In fact, I want to talk about a very simple question.
There is a long article. I want to use a computer to extract its key words (automatic keyphrase extraction) without manual intervention. How can I do it correctly?
This problem involves many cutting-edge computer fields such as data mining, text processing, and Information Retrieval. However, unexpectedly, there is a very simple classical algorithm that can pro

1, TF-IDF
The main idea of IDF is that if the fewer documents that contain the entry T, that is, the smaller the n, the larger the IDF, the better the class-distinguishing ability of the term T. If the number of documents containing the term T in a class of document C is M, and the total number of documents containing T in the other class is K, it is clear that

The more often a

The more often a

