NLP | Natural Language Processing-tagging and Hidden Markov Model (tagging problems, and hidden Markov models)
Source: Internet
Author: User
What is annotation?A common task in natural language processing is annotation. (1) Part-of-speech tagging (part-of-speech tagging): marks each word in a sentence as a part of speech, such as a noun or verb. (2) name entity tagging: Mark special words in a sentence, such as addresses, dates, and names of characters.
This is a case of word-of-speech tagging. When a sentence is entered, the computer automatically marks the part of speech of each word.
This is an entity labeling case. When a sentence is entered, the computer automatically marks the entity category of a special word.
In a rough view, this is not a simple problem. First, each word can have multiple meanings, and different meanings are expressed in different situations. Second, the meaning or part of speech of a word is also influenced by multiple words.
Before finding a solution, we 'd better describe the problem in a mathematical language. When we get a sentence, we can regard it as a vector. So that the sentence s has a total of N words, and the I-th word is expressed by XI. Obviously, S = x1, x2,... XN. Therefore, the problem can be described as: For each word Xi, We need to specify an annotation Yi, so we can obtain the annotation y = Y1, Y2,... YN of the sentence.
To sum up, when training the model, we expect that for any sentence S, we need to obtain the probability P (Y | S) of all possible annotations ), the Y with the highest probability is the result we need. The final expression is tagging (S) = Arg max (P (Y | S )).
Next, we need to consider how to establish a training set and learn the above model. First, I need to obtain a corpus that has been labeled. There are several sentences in the corpus, and each word in each sentence has an identifier. Then, we can learn the conditional probability P (Y, S) for all sentences in the corpus and the corresponding identifier y, that is, the probability of occurrence of a sentence and its corresponding identifier. Secondly, because the corpus cannot contain all possible sentences, we hope to get a broader expression. through Bayesian formula, we can see that P (Y, S) = P (y) * P (S | Y), P (Y | S) = P (y) * P (S | Y)/P (S ); we need to compare the maximum value in P (Y | S) without obtaining P (Y | S). Therefore, it is clear that the specific value of P (S) is not important, therefore, we only need to consider tagging (S) = Arg max (P (y) * P (S | y )).
Because the corpus cannot store all objective sentences, we must find a way to estimate the values of P (Y) and P (S | Y, one of the most famous methods is the hidden Markov model.
The Hidden Markov Model is still back to the above problem. Given a sentence S = x1, x2 ,... XN, we provide an identifier combination y = Y1, Y2 ,... YN, so that Y = Arg max (P (y) * P (S | y) = Arg max (P (x1, x2 ,..., XN, Y1, Y2 ,..., YN )).
According to the language model mentioned in the previous chapter, we still make some optimizations for each sentence: 1) Add a starting symbol "*", we define that all sentences start with "*", that is, X-1 = x0 = *; 2) add an ending symbol "stop", and we define that all sentences end with "stop.
At the same time, the hidden Markov model requires us to make some extra assumptions to simplify the model: 1) YK is only related to the first few elements, that is, the Semantic Relevance of the logo only affects the first few elements; 2) the word XK and the corresponding YK are not affected by other words, that is, P (XI | Yi) are independent of each other.
After simplification, we take the third-level hidden Markov model as an example. The expression is P (Y1, Y2 ,... Yn | x1, x2 ,... XN) = P (Y1, Y2 ,... YN) * P (x1, x2 ,... XN | Y1, Y2 ,... YN) = ∏ Q (YJ | yj-2, yj-1) * ∏ E (XI | Yi ). Obviously, in the simplified model, the frequency of appearance of a single word in the corpus is much higher than that of the sentence as a whole.
With the Hidden Markov Model, all we need to do is to estimate the parameters Q (YJ | yj-2, yj-1) and E (XI | Yi ). Q (YJ | yj-2, yj-1) has a detailed explanation in the previous chapter language model, and E (XI | Yi) can be easily obtained by counting the appearance of each word in the corpus. However, in some special cases, if some words do not appear in the corpus, E (XI | Yi) = 0 will lead to the probability of the overall sentence to 0. To solve this problem, we can adopt a simple solution:
1) First, all words in the corpus are divided into frequent words and non-frequent words (determined by a threshold); 2) E (XI | Yi) of frequent words) statistics are obtained directly from the corpus. 3) Non-frequent words are divided into multiple groups by predefined rules, and E (XI | Yi) is determined by calculating the Word Frequency of the group ).
For example, shows common grouping methods. This method works well for special words such as date, name, and abbreviation.
Algorithm complexity hypothesis we have trained to get Q (YJ | yj-2, yj-1) with E (XI | Yi), given a sentence S = x1, x2 ,... XN, how do we obtain y = Y1, Y2 ,... YN. Method 1: The brute-force method traverses all the possible Y1, Y2,... YN combinations to calculate the probability and find the maximum probability value. Obviously, the time complexity of the brute-force method is not satisfactory. Method 2: Dynamic Planning, defining a dynamic planning expression m (K, u, v), k Represents the K-bit, U, V indicates the identification of the last two words in the clause composed of the first K. Therefore, the recursive equation can be expressed as M (K, u, v) = max (M (K-1, W, u) * q (v | W, U) * E (X | V )). There are many cases in leetcode to describe the dynamic planning method.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.