Text Mining Model Structure
1. Word Segmentation
Word splitting example:
Improve people's living standards: Improve People, people, people's livelihood, life, living water, and level
Basic word segmentation methods:
Maximum matching, maximum probability, and shortest path
1.1 Max Matching Method
Chinese word segmentation is the most fundamental in Chinese Information Processing. No matter machine translation, information retrieval, or other related applications, Chinese Word Segmentation is inseparable from Chinese word segmentation. Therefore, Chinese word segmentation has a very high position. The maximum positive matching algorithm is as follows:
Example: S1 = "the computational linguistics course is three lessons", set the maximum length of words maxlen = 5, S2 = ""
(1) S2 = ""; S1 is not empty. The candidate substring W = "computational linguistics" is extracted from the left side of S1 ";
(2) query the word table. In the word table, add W to S2, S2 = "computational linguistics/", and remove W From S1, in this case, S1 = "the course is three lessons ";
(3) S1 is not empty, so the candidate substring W = "three courses" is taken from the left side of S1 ";
(4) Check the Word Table. W is not in the Word Table. Remove the rightmost word from the word table. W = "course is 3 ";
(5) Check the Word Table. W is not in the Word Table. Remove the rightmost word from the word table. W = "course is ";
(6) query the word table. If W is not in the Word Table, remove the rightmost word from the word table to obtain W = "course"
(7) query the word table. In the word table, add W to S2. S2 = "computational linguistics/Course/" and remove W From S1, s1 = "three lessons ";
(8) S1 is not empty, so the candidate substring W = "three lessons" is taken from the left side of S1 ";
(9) Check the word list. W is not in the word list. Remove the rightmost word from the word list. W = "three lessons" is obtained ";
(10) query the Word Table. W is not in the Word Table. Remove the rightmost word from the word table. W = "three" is obtained ";
(11) Check the word list. W is not in the word list. Remove the rightmost word from the word list. W = "yes" is obtained"
(12) Check the Word Table. W is not in the Word Table. Remove the rightmost word from the word table to get W = "yes". W is a word and add W to S2, s2 = "computational linguistics/Course/Yes/", and remove W From S1. At this time, S1 = "three lessons ";
......
......
(21) S2 = "computational linguistics/Course/Yes/three/course class/", at this time S1 = "".
(22) If S1 is null, S2 is output as the word splitting result, and the word splitting process ends.
The Code is as follows:
# Include <iostream> # include <string> # include <fstream> # include <sstream> # include
Other matching-Based Word Segmentation Methods:
Maximum matching method: the matching direction is from left to right.
Reverse maximum method: the matching direction is opposite to the MM method, which is from right to left. Experiments show that the reverse maximum matching method is more effective than the maximum matching method in Chinese.
Bi-direction matching method: Compares the word segmentation results of mm and RMM to determine the correct word segmentation.
Optimum Matching Method (OM): sorts the words in the dictionary according to their occurrence frequency in the text, and ranks the words with high frequencies before, low-frequency words are placed behind each other to increase the matching speed.
Association-Backtracking Method: Uses Association and backtracking mechanisms for matching.
1.2 maximum probability method Word Segmentation
The basic idea is: (1) a string of Chinese characters to be split may contain multiple word splitting results (2) use the one with the highest probability as the word splitting result of the string.
S: disagree
W1: Opinion/disagreement/
W2: intentionally/See/disagree/
Where P (S | W) can be considered as a constant equal to 1, because any hypothetical word segmentation method generates our sentences accurately (only the demarcation between word segmentation can be discarded), and P (S) the comparison is always equal in different word segmentation methods, so the comparison is not affected. Therefore, P (w | S) is approximately equal to P (W ).
Maximum Probability Method word segmentation example:
1.3 Shortest Path Segmentation
Basic Idea: select the path with the least number of words on the word Graph
Advantage: Better than one-way maximum matching method
Maximum matching: Independent, independent, equality, mutual benefit, and principle (6)
Shortest Path: Independent independent \ and \ equality and mutual benefit \ principle (5)
Disadvantage: Most ambiguities cannot be solved.
For example, when \ component \ is combined
2. Document Model
There are three models: Boolean model, vector space model, and probability model.
2.1 Boolean Model
The Boolean model is based on the classical set theory and Boolean algebra. Based on whether each word appears in a document, the corresponding weight is 0 or 1, document Retrieval is also determined by boolean logic operations.
Advantages: simple, easy to understand, simple and formal.
Disadvantages: accurate matching and insufficient information requirements.
2.2 vector space model (VSM)
In the vector space model, the document is expressed as a vector, which is considered as a point in the vector space.
(1) Word weight
The contribution of each word in a sentence is different when determining the meaning of the sentence, that is, the weight of each word is different. For example, in the following sentence:
"Most scientists think that butterflies use the position of the sun in the sky as a kind of compass that allows them to determine which way is north ."
Important Words: butterflies, monarchs, scientists, compass
Unimportant words: Most, think, kind, sky
Word weight is a measure that reflects the importance of each word.
(2) Word Frequency (TF)
The more times a word appears in a sentence, the greater the contribution of the word in describing the meaning of the sentence. You can use one of the following two formulas to calculate the word weight of each word:
(3) Frequency of inverse document (IDF)
Generally, if a word appears in more documents, the contribution of the word to a specific document should be smaller, that is to say, the smaller the discrimination between documents, the inverse Document Frequency (IDF) can be used to measure this concept. Define another concept. Document Frequency (DF) indicates the number of documents containing a word. The formula for calculating the frequency of inverse document is as follows:
Sometimes the following formula is used to calculate the IDF range within [0, 1:
VSM is easy to calculate and easily represents the word weight. Its disadvantage is that it must assume the independence between words.
2.3 Probability Model
The probabilistic retrieval model is another commonly used information retrieval algorithm model. It uses the probability of a document and a query to calculate the similarity between the document and the query. Generally, the retrieval unit is used as a clue to obtain the relevant document set (corresponding to a query) of each retrieval unit through statistics) the probability of occurrence and non-occurrence in the document set unrelated to the query and the probability of not appearing in the query. Finally, the similarity between the document and the query is calculated using these probability values. Set document D to contain t retrieval units, respectively marked as (ω 1, ω 2 ,..., ω T), where ω I is the weight of the I-th retrieval unit. It can be understood that the appearance of this retrieval unit is the "contribution" of document D and query Q ", the similarity between document D and query Q is a combination of T retrieval units included in D "contribution.
In the research of information retrieval, for the probability statistics retrieval model, we usually need to make some assumptions for the convenience of calculation. For example, assume that the distribution of retrieval units in the relevant documents is independent of each other, the distribution of unrelated documents is also independent of each other. Although this assumption is not exactly the same as the actual situation, for example, if "China" and "Beijing" appear in a document at the same time, therefore, the two retrieval units cannot be considered independent of each other. However, considering the relevance of the retrieval unit will make the probability calculation very complex. Therefore, this assumption is still maintained in practice. The actual results show that although the probability statistics search model has such shortcomings, it can still achieve relatively satisfactory information retrieval results.
Specifically, under the premise of independence hypothesis, we also consider the probability that the retrieval unit appears in the document and the probability that it does not appear in the document. For a given query Q, a retrieval unit ω I, you can define wi:
Wi = log [R (n-r-N + r)/(r-r) (N-R)]
Where
N: Total number of documents in the document set;
R: Total number of documents related to query Q;
N: Total number of documents containing retrieval unit ω I;
R: Q-related documents contain the number of documents whose retrieval unit is ω I.
Because the information provided by the training set is not completely complete, we recommend that you correct the above formula. If the information is incomplete, add 0.5 next to each item.
Now we have obtained the weights of each retrieval unit. The next step is to use these weights to calculate the similarity between documents and queries. Considering our assumptions, because the distribution of each retrieval unit is independent of each other, we can simply use the product of these weights to calculate the similarity between the document and the query,
SC (Q, d) = Log tracing (WI) = Σ logwi
So far, we will only discuss the most basic retrieval method of the probability statistics search model. In actual use, the probability statistics search model will be much more complex. Generally, in the calculation of the weight of the retrieval unit, the frequency (TF) of the retrieval unit in the document, the frequency (qtf) of the retrieval unit in the query, and the document length (DL) are also considered, the bm25 algorithm is a common retrieval algorithm in the information retrieval system. The bm25 search algorithm was proposed by Robert ston in trec3 in 1994. bm25 calculates the similarity between document D and query Q. For each retrieval unit ω I in Q, a total of three weights are related to them:
U = (k2 + 1) PSI/(k2 + psi), where K2 is a parameter specified by the user, qtf (within query term frequency) is the frequency of retrieval unit ω I in Q ).
V = (k + 1) φ/ K * (1-B + BL) + φwhere K and B are user-specified parameters, phi is the frequency TF (Term Frequency) of retrieval unit ω I in D, L is the document length after regularization, the calculation method is to divide the length of the original document by the average document length in the document set.
W is the formula we mentioned above after 0.5. In the bm25 formula, the score of Q and document D is SC (Q, d) = Σ uvw.
3 Calculation of similarity between texts 3.1 correlation degree based on probability model
Wi = log [R (n-r-N + r)/(r-r) (N-R)]
SC (Q, d) = Log tracing (WI) = Σ logwi
See the above Probability Model
3.2 VSM-based relevance
Common methods based on vector space model: Euclidean distance, Vector Inner Product, vector angle cosine
(1) Euclidean distance
(2) Vector Inner Product
(3) cosine of vector angle
(4) jaccard Similarity
(5) Comparison of Several Methods Based on Vector Inner Product
(6) Several Methods Based on Set Computing
4. feature space changes
The main difficulty of machine learning lies in the difference between the lexical expression described and the semantics to be expressed. The main cause of this problem is: 1. A word may have multiple meanings and multiple usages. 2. synonyms and synonyms. Based on different contexts or other factors, different words may also indicate the same meaning. Lsa is a famous technology used to deal with such problems. Its main idea is to map a high-dimensional vector to a potential semantic space to reduce the dimension. Latent Semantic Analysis (LSA), also known as latent semantic index (LSI), is a mathematical and statistical method used to extract words in text and infer the semantic relationship between them, create a semantic index, and organize the document into the Semantic Space Structure. Its starting point is that there is a potential semantic link between the feature items of the document and the feature items, eliminating the correlation between words and simplifying the purpose of the text vector. It maps feature items and documents to the same Semantic Space through SVD, computes the document matrix, extracts K largest singular values, and approximately represents the original document. This ing must be strictly linear and based on the Singular Value Decomposition of the co-occurrence table.
Question: multiple meanings and synonyms of a word
Central idea: replace words with concepts (or features)
Basic Method: Convert the word frequency matrix to a singular matrix (K × K) using the Singular Value Decomposition (SVD) technique in matrix theory)
4.1 Singular Value Decomposition
Feature value decomposition is a good method for extracting matrix features, but it is only for square arrays. In the real world, most of the matrices we see are not square arrays, for example, if there are n students and each student has m scores, the matrix of N * m cannot be a square matrix, how can we describe the important features of such a general matrix? Singular Value Decomposition can be used to do this. Singular Value Decomposition is a method that can be applied to any matrix:
Assume that A is a matrix of N * m, the resulting U is a matrix of N * n (the vectors inside are orthogonal, And the vectors Inside u are called left singular vectors ), Σ is a matrix of N * m (except that the elements on the diagonal line are all 0, the elements on the diagonal line are called singular values), and V' (V transpose) is a matrix of N * n, the vectors in them are also orthogonal, And the vectors in V are called the right singular vectors. The following figure is available for reflecting the size of several multiplied matrices from the image.
So how does the singular value correspond to the feature value? First, we multiply the transpose of a matrix A by a square matrix. We can use this square matrix to obtain the feature value: the obtained V is the right singular vector above. In addition, we can also get:
Here σ is the singular value mentioned above, and U is the left singular vector mentioned above. The singular value σ is similar to the feature value. In the matrix Σ, it is also arranged from large to small, and the reduction of σ is particularly fast. In many cases, the sum of the singular values of the first 10% or even 1% accounts for more than 99% of the total Singular Values. In other words, we can use the Singular Values of the former R to describe the matrix. Here we define some Singular Value Decomposition:
R is a number far less than M and N, so the multiplication of the matrix looks like the following:
The result of multiplication of the three matrices on the right will be a matrix close to A. Here, the closer R is to N, the closer the result of multiplication is to. The sum of the areas of these three matrices (in terms of storage, the smaller the matrix area, the smaller the storage space) is much smaller than the original matrix, if we want to compress the space to represent the original matrix A, we can save the three matrices here: U, Σ, and V.
4.2 Semantic Analysis (LSA)
Input: Term-by-document Matrix
Output:
U: Concept-by-term Matrix
V: Concept-by-document Matrix
S: Elements assign weights to concepts
Procedure
1. Create a word frequency matrix
2. Calculate the Singular Value Decomposition of frequency matrix
Break down frequency matrix into three matrices U, S, and V. U and V are orthogonal matrices (utu = I), and S is the diagonal matrix of Singular Values (K × K)
3. for each document D, replace the original vector with a new vector that eliminates the words eliminated in SVD.
4. Use the converted document index and Similarity Calculation
Previously, Mr. Wu talked about the classification in matrix computing and text processing:
"The three matrices have very clear physical meanings. Each row in the first matrix X represents a type of words related to the meaning, and each non-zero element represents the importance (or correlation) of each word in this type of words. The greater the value, the more relevant. Each column in The Last matrix Y represents an article of the same topic, and each element represents the relevance of each article in this article. The matrix in the middle represents the correlation between the class words and the article thunder. Therefore, we only need to perform a Singular Value Decomposition on correlated matrix A, and W can complete both the synonym classification and the document classification. (Get the correlation between each type of article and each type of Word )."
The above section may not be easy to understand, but this is the essence of LSI. I will give an example below. The example below comes from LSA tutorial, the specific URL will be provided in the final reference:
This is a matrix, but it is not the same. Here, one line indicates the title of a word (one-dimensional feature as mentioned earlier ), A column represents the words in a title. (This matrix is actually a transpose in the form of sample as we mentioned earlier, this will change the meaning of our left and right singular vectors, but will not affect our computing process ). For example, the title T1 contains four words: Guide, investing, market, and stock. Each word appears once. We use the SVD matrix to obtain the following matrix:
The left singular vector represents some of the characteristics of the word, the right singular vector represents some of the characteristics of the document, the singular value matrix in the middle represents a line of the Left singular vector and an important program of a column of the right singular vector, the larger the number, the more important it is.
Looking at this matrix, we can also find some interesting things. First, the first column of the Left singular vector indicates the occurrence frequency of each word, although not linear, but it can be considered as a rough description. For example, book appears twice in the corresponding document of 0.15, and investing is 0.74 corresponds to 9 times in the document, rich is shown three times in the corresponding document of 0.36;
Secondly, the first line in the right singular vector represents the approximate number of words in each document. For example, T6 is 0.49, 5 words appear, T2 is 0.22, two words appear.
Then, we can take the left singular vector and the right singular vector into the last two dimensions (a matrix of the last three dimensions) and project it to a plane. We can get:
On the graph, each red point represents a word, and each blue point represents a document, so that we can cluster these words and documents, for example, stock and market can be put in one category, because they always appear together, real and estate can be put in one category, and dads and guide seem a little isolated, we will not merge them. According to the clustering effect, synonyms in the document set can be extracted, so that when users retrieve documents, they use the semantic level (Synonym set) for retrieval, instead of the level of the previous word. This reduces our retrieval and storage capacity, because the compressed document set is the same as that of PCa. Second, it can improve our user experience. Users enter a word, we can find this word in the word's synonym set, which is not possible in traditional indexes.
(Sina Weibo: @ quanliang _ machine learning)
Refer:
Http://www.cnblogs.com/LeftNotEasy/archive/2011/01/19/svd-and-applications.html
Yang Jianwu Text Mining Courseware of Peking University Computer College