1. TF-IDF
TF-IDF is a weighted technique commonly used in information retrieval and data mining. It is a statistical method used to assess the importance of a word to a document in a collection or corpus.
The main idea of TFIDF is: if a word or phrase appears frequently in an article and rarely appears in other articles, this word or phrase is considered to have good classification ability and is suitable for classification.
Term Frequency refers to the frequency at which a given word appears in the file.
IDF document frequency is a measure of the general importance of words.
In general
The more documents that contain a certain entry in a document set, the lower the ability of the document to distinguish document category attributes and the smaller its weight;
On the other hand, the higher the frequency of a certain entry in a document, the higher the ability of the entry to distinguish document content attributes and the greater its weight.
In my experience, TF-IDF above 0.01 already has better differentiation ability. The author in the text classification, extraction of TF-IDF> = 0.01 or 0.02 words for features, classification effect is good, the first test has 90% accuracy.
Ii. vector space model
Vector space model (VSM: vector space model), as a vector identifier (such as an index), is an algebraic model used to represent text files. It is used for information filtering, information retrieval, indexing, and evaluation relevance.
Vector space model maps a document to a feature vector V (d) = (T1, ω 1 (d );...; TN, ω N (d), where Ti (I = 1, 2 ,..., N) is a column of entry items that are different from each other. ω I (d) is the weight of Ti in D. It is generally defined as a function for Ti to appear frequency TFI (d) in D.
Among them, WI and J indicate the Weight Values of the J-th feature item (word term) in the feature space in this document vector.
The formula for calculating WI and J has many variants. Below is a common one.
K indicates the dictionary dimension;
TFI, J indicates the frequency when a feature word appears in the document (0 if not );
N indicates the total number of documents in the corpus;
Dfj indicates the number of documents containing words in the corpus.
Iii. cosine correlation
The similarity between two documents can be expressed by the cosine of the angle between the corresponding vectors.
That is, the dot product of the two vectors divided by the length product of the two vectors.
It can be simplified:
Using the above things, I made a simple search engine two days over the weekend, indexing the 43 W web pages I crawled before. Although there is little data, some search results are quite reliable.