1. Preface
In the process of natural language processing, it often involves how to measure the similarity between two texts, we all know that the text is a high-dimensional semantic space, how to abstract it, so as to be able to stand in the mathematical perspective to quantify its similarity.
With the measurement of similarity between texts, we can use the K-means of partition method, density-based dbscan or model-based probabilistic method to analyze the clustering of text. On the other hand, we can also use the similarity between the text to de-re-preprocess large-scale corpus, Or find the relevant name (fuzzy match) for an entity name.
There are many ways to measure the similarity of two strings, such as the most direct use of hashcode, the classic theme model , or the use of word vectors to abstract text into vector representations, and then through the Euclidean distance between eigenvectors or Pearson distance To measure. In this paper, a brief description of the first step textual vectorization of the Chinese text similarity calculation in NLP is presented.
2. Textual vectorization
Whether the text is Chinese or English, we must first turn it into a form of computer cognition. The process of translating into computer-aware forms is called textual vectorization.
To quantify the granularity we can divide into several forms:
- In words or words, Chinese is a single word, and English can be a word.
- In terms of words, it is necessary to add a word segmentation process. The word segmentation algorithm is an important basic subject in NLP, which is not explained in detail.
- In sentence units, the high-level semantics of a sentence are extracted, in short, the search for a subject model. Of course, if we've got a vector of all the words in one sentence, we can simply represent this sentence by averaging living other ways.
Below we mainly introduce the text vectorization method, Word set model, Word generation model, N-gram, TF-IDF, Word2vec. and sentence-based thematic models, LSA, NMF, pLSA, LDA, and so on.
2.1 Word Set model and word generation model
Both the word set model and the word generation model are all the words in the text form a dictionary vocab, and then according to the dictionary to count the occurrence of the word frequency. The difference is:
- A word set model is a word in a single text that appears in a dictionary, and it is set to 1, regardless of how many times it occurs.
- A lexical model is a single text in which a word appears in a dictionary, adding 1 to its vector value, and adding as many times as possible.
Both the word set model and the word generation model are based on the independence of the words, and no correlation is the precondition. This makes it easy to count, but it also loses the information about the relationship between the words between the texts.
2.2 N-gram
N-gram is an algorithm based on statistical language model. The basic idea is to manipulate the contents of the text in a sliding window of size n of bytes, forming a sequence of byte fragments with a length of N.
Take "I love China" as an example:
The Unigram model is divided into "I", "Love", "China"
Binary models (Bigram model) are divided into "I Love" "Love" "China"
Ternary models (Trigram model) are divided into "I Love" "Love China"
And so on, after the good words, can be like the word generation model processing, according to the thesaurus to compare the number of occurrences of the sentence. N-gram can be a good record of the relationship between sentence morphemes, the greater the completeness of N, the higher the degree of the sentence, but the resulting word dimension exponentially number of growth. So generally take n=2,n=3.
2.3 TF-IDF
TF-IDF is the abbreviation of term Frequency-inverse Document Frequency, namely "word frequency-inverse text frequencies". It consists of two parts, TF and IDF.
The previous TF is the word frequency we talked about earlier, and the vectorization that we did before is the frequency of the occurrences of the words in the text, and as a textual feature, this is a good understanding. The key is the following IDF, the "inverse text frequency" how to understand. Earlier, we said that almost all of the text will appear "of" its word frequency is high, but the importance should be lower than the word frequency of "watermelon" and "China" lower. Our IDF is here to help us to respond to the importance of the word, and to correct the eigenvalues of words expressed only by word frequency .
Therefore, the quantification of a word is more reasonable (the weight of Word frequency x words).
\[TF-IDF (x) = TF (x) * IDF (x) \]
2.4 Word2vec
Ord2vec, a NLP tool introduced by Google in 2013, is characterized by the quantification of all words, so that the words and words can be quantified to measure the relationship between them and to excavate the links between them. Word2vec generally have Cbow and Skip-gram models.
The training input of a cbow model is a word vector that corresponds to a context-sensitive word of a particular feature word, and the output is that particular word. The corresponding fixed dimension word vectors are trained through deep learning. The Skip-gram model and the Cbow model are reversed, the input is the central word, and the output is the context.
2.5 keyword Model
Imagine a problem, if I have two texts, namely "Sunday" and "Sunday", from the point of view of words, they have no intersecting words, then they use the statistical frequency of the method will be more difficult to deal with. But these two words, we know the meaning exactly the same at the first sight. They can be processed here using the theme model, assuming we find their 2 hidden themes "Vacation", "rest", and then calculate the similarity between them and the hidden subject distance.
The main topic models are the following, LSA, NMF, pLSA, LDA.
- The LSA breaks down the text by means of singular value decomposition as follows,\ (u_{il}\) corresponds to the relative degree of the first and second ( i\) text and the subject of section \ (l\) . \ (v_{jm}\) corresponds to the meaning of the word ( j\) and the first (m\) . \ (\sigma_{lm}\) corresponds to the meaning of the l\ topic and the first \ (m\) .
\[a_{m \times n} \approx u_{m \times k}\sigma_{k \times k} v^t_{k \times n}\]
- Although the NMF is also a matrix decomposition, it uses different ideas, its goal is to expect to decompose the matrix into two matrices. This is faster, and there is no case of a negative correlation in the LSA, which can be interpreted strongly.
\[a_{m \times n} \approx w_{m \times k}h_{k \times n}\]
- pLSA Although the topic model can be explained from the perspective of probability
- Select a document by probability \ (p (d_m) \) \ (d_m\)
- Depending on the selected document \ (d_m\), select an implied subject category (Z_k\) from the topic distribution by probability \ (p (z_k|d_m) \) (\ (\theta_{mz}\))
- According to the chosen subject \ (z_k\), select a word from the distribution of words ( p (w_j|z_k) \) i.e. (\ (\varphi_{zw}\)) ( w_j\)
- The LDA model considers the prior knowledge of subject probability distribution, for example, the probability of sports subject appearing in text is higher than that of philosophical subject, which derives from our prior knowledge. Specifically we will discuss later articles again.
3. Summary
This paper mainly introduces the first step of text similarity calculation and the vectorization of text. After vectorization, we can calculate the similarity between the text by some common distance calculation formulas.
(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])
1. Text similarity calculation-text vectorization