Code download: supervised part-of-speech tagging Based on Hidden Markov Model

Part-of-speech tagging (part-of-speech tagging or POS tagging) means assigning a proper part of speech for each word in a sentence, that is, the process of determining that each word is a noun, verb, adjective, or other part of speech, also known as word class annotation or abbreviation annotation. Part-of-speech tagging is a basic task in natural language processing. It plays an important role in speech recognition, information retrieval, and natural language processing.

In essence, part-of-speech tagging is a classification problem. For every word w in a sentence, find a proper part-of-speech category T, that is, part-of-speech tagging. However, part-of-speech tagging considers the overall labeling quality, mark the sequence of the entire sentence. There are many existing mathematical models and frameworks for classification, such as hmm, maximum entropy model, Conditional Random Field, and SVM. In this blog, we will introduce Hidden Markov Model (HMM).

1 Hidden Markov Model (HMM)

What is hidden Markov model (HMM? To put it bluntly, a mathematical model is represented by a pile of mathematical symbols and parameters, including hidden state sets, observed State sets, initial probability vectors, state transfer matrix A, and confusion matrix B.

A good hmm example on Wiki introduces the basic concepts and problems of HMM easily. The first time you contact hmm, you can take a look at this example. Hmm is introduced in more detail on the hidden Markov models website. Here, we use the examples and diagrams on this website to further introduce hmm.

Imagine a scene where a poet is attacked by the Authority and is pushed into the dungeon. In the dark underground dungeon, the poet does not want to do nothing, so he writes poems on the wall every day to express his emotions. One day, he found some moss in the corner of the dungeon. Discovering another kind of life in a dead-life dungeon makes him very pleased. He talks with moss every day. A few days later, he found that moss was sometimes wet and sometimes dry. He guessed it might be related to the unknown weather outside.

Based on the above description, we can construct a HMM and then use the state of moss to predict the weather. First, the weather is an unknown state, a speculative amount, and hidden state in HMM. To simplify the process, let us assume that there are only three weather conditions outside the dungeon where the poet is arrested: Sun, rain, and cloud, as shown in 1.

Figure 1 conversion of weather status

Status transition can be classified into deterministic and non-deterministic types. The status transition of traffic lights is deterministic, that is, after the red light, we must know that the next status is a green light. However, the change in the weather status is non-deterministic, that is, today is a sunny day, and we cannot determine what the weather is tomorrow (even if the current weather forecast is very accurate, we still cannot know the weather of tomorrow by 100%, in fact, tomorrow's world is very uncertain ). For an uncertain state transition, a probability model is used to describe the state changes between them. Figure 2 describes the state transition matrix of the weather outside the dungeon:

Figure 2 transition matrix of weather conditions

The above matrix is row stochastic. The probability of each row is 1, meaning that no matter what the weather of yesterday, today is definitely (Sun, cloud, rain) one of the weather is that the probability of each weather is different. Assume there are n states. The state transition matrix of the hidden state is an N * n matrix, which is usually called.

In addition, we also need a prior probability of occurrence of different weather conditions, that is, the three weather probability obtained by the perennial statistics outside the dungeon, which is usually called. Assume that (Sun, cloud, rain) has a prior probability:

Now we have two hmm parameters, and one parameter about the observed state is missing. In the dungeon, the poet can observe only the state of moss. To simplify the process, it is assumed that there are only four changes in Moss: Very wet (soggy) and wet (damp) dry and dry ). The observed states are related to the hidden weather, as shown in figure 3. Each hidden weather condition may produce four mossy state, but the probability is different. To describe this probability, we need to introduce a confuse matrix, also known as the emission matrix. It is used to describe the probability of producing moss in different weather conditions, as shown in figure 4.

Figure 3 Relationship between weather and observation

The confusion matrix describes the third parameter of HMM, usually called B. Assume that there are m observability states, the confusion matrix is a matrix of N * m, and the probability of each row is 1, indicating that in a certain weather condition, moss must belong to (soggy, damp, dryish, dry.

Figure 4 confusion matrix of HMM

The entire HMM is composed of the preceding three tuples, which can be represented by HMM. With these three parameters, we can fully understand the entire HMM. Hmm can be used to solve three problems:

- How to calculate the probability of a specific observation sequence for a given model;
- Given a model and a specific sequence of observation, how can we find the sequence of hidden states that are most likely to generate the output;
- How can we estimate the three parameters of HMM based on sufficient observed data.

In the field of speech recognition, the first and third issues are the main concerns, and the second and third issues are the main concerns in part-of-speech tagging. The purpose of solving the first problem is to select the HMM with the highest probability when there are multiple hidden Markov models. In the field of speech recognition, a HMM model needs to be built for each word to recognize speech into words with the highest probability of HMM. The purpose of solving the second problem is to know the most likely hidden state sequence of the observed sequence. part-of-speech tagging solves this problem. The third problem is very important to all who use hmm, but it is also the most difficult, that is, training model parameters. The three parameters of HMM are not created out of thin air, but trained.

The first problem can be solved quickly through the forward algorithm. The second problem needs to be solved using the Viterbi algorithm. The third problem can be solved in two ways: supervised or unsupervised. Supervised parameter training is difficult to obtain relevant parameters by marking the training set statistics. unsupervised parameter training is obtained through the borm-welch algorithm iterative training, which is very difficult. Here we will introduce supervised part-of-speech tagging, that is, hmm parameter training is obtained through the statistical corpus.

2 part-of-speech tagging

The purpose of part-of-speech tagging is to first split a given sentence and then add different parts of speech to each word. Obviously, the observability sequence in HMM is the word segmentation of the given sentence in the part of speech tagging, while the hidden state is different parts of speech, and the prior probability of the part of speech is the parameter, to enable the part-of-speech tagging of sentences, we need to first use a corpus to train a hmm, and then perform word segmentation and tagging on sentences.

2.1 Chinese Word Segmentation

First, we will introduce Chinese word segmentation. This is because the user inputs a complete sentence and cannot directly obtain the observability sequence. The Chinese word segmentation using the statistical language model has been very effective and can be considered as a problem that has been solved. However, this requires training a new Markov model, which is beyond the scope of this blog's consideration. Here, we have implemented the simplest Chinese Word Segmentation: scanning sentences from left to right, searching the dictionary, finding the longest word match, and breaking them into single words when encountering unknown strings.

In the code, we have a dictionary of nearly 35 million words. The words in the dictionary are sorted by Unicode code for easy search. During word segmentation, first read the dictionary into the memory, and then search for the dictionary based on the matching principle from left to right. Because the Lexicon is sorted by Unicode code, we can use binary quick search phrase. When searching, we first read the first word of the original sentence, locate the start position and end position of the word in the dictionary, and then perform a binary search. During the search process, the maximum length of all words between the start and end positions is recorded, and the dictionary is searched from the maximum length. The length decreases one by one until it is found. Figure 5 describes the process of Word Segmentation:

Figure 5 Chinese Word Segmentation

2.2 hmm parameter Training

Hmm has three parameters to be trained. Represents the prior probability of a part of speech, a represents the State transfer matrix between parts of speech, and B represents the emission matrix or confusion matrix between parts of speech and words. This blog uses a supervised approach to train the above three parameters. Supervised means, that is, training parameters through the relevant information in the statistical corpus. Figure 6 shows the part of the corpus we use. Each line is a complete labeled sentence.

Figure 6 Corpus

Hmm parameter training is to obtain three hmm parameters by analyzing the above corpus. By analyzing the above corpus, we can obtain the number of occurrences of each part of speech, the number of occurrences of each part of speech and its successor, and the words corresponding to the part of speech. After counting the information, you can use frequency instead of probability to obtain the values of the three parameters.

The key to counting the above information is to parse the corpus, which is completed using the following regular expressions:

// Obtain different phrases (separated by spaces) in the expected corpus, with the corresponding part of speech text = content. tostring (). split ("\ s {1,}"); // removes part-of-speech tagging and only saves the phrase = content. tostring (). split ("(/[A-Z] * \ s {0 ,})"); // "/" followed by one or more letters followed by multiple spaces // obtain the part of speech characters = content of all phrases in the corpus. tostring (). split ("[0-9 |-] */| \ s {1,} [^ A-Z] *"); // start date or space + non-letter as Separator

The comments have explained in detail the meaning of the regular expression and will not be repeated here. After obtaining the above information, we can easily calculate the relevant information and calculate the probability by using the frequency. There is no difficulty in calculating the word-of-speech prior probability. Follow the formula to hide the state transition matrix:

To calculate the number of occurrences before and after different parts of speech, indicating the number of occurrences of the part of speech. The emission matrix of the observed State follows the formula:

To calculate the number of times a word and a part of speech appear simultaneously. In the calculation frequency, because some values are very small, we multiply the calculation result by 100 to avoid overflow in the subsequent calculation process. I personally cannot guarantee the reliability of this method. In fact, when the frequency is zero or the frequency is very small, we need to re-calculate it based on good-Turing's estimation (the beauty of mathematics p34 ), the log method is used to find the optimal hidden sequence. For convenience, ignore these details (do not care about these details ?). Assume that, by analyzing the corpus, we finally obtain N parts of speech and M phrases, which is a vector with N length and A is an N * n sentence, B is a matrix of N * m. When marking the parts of a sentence, make sure that all the words after word segmentation are in m; otherwise, the processing capability of HMM is exceeded.

2.3 Word Segmentation

Generally, after Hmm parameter training is completed, we can use HMM to complete specific tasks. However, before that, we need further word segmentation for our part-of-speech tagging system. We use the largest matching mode from left to right. However, the corpus used in the program tends to be in the least matching mode. Therefore, the first word segmentation result may not be in the corpus. Here, we attempt to break words that cannot be recognized by the corpus again to let the algorithm find more words.

The word segmentation algorithm is simple. Now that we have counted all the observed M states in Hmm, we can find the word splitting results in all States. The unfound word segmentation is divided into two parts as the new word segmentation.

2.4 Viterbi Algorithm

The well-known Viterbi algorithm is coming to an end, but in terms of difficulty, it is far less difficult to train model parameters, so it is actually very simple. To describe the algorithm more mathematical, we first declare several symbols:

- : The anterior probability of the hidden state;

- : Hides the transition matrix of a State. each item indicates the probability of transition from a state to a State;

- : The hidden state generates the emission matrix or confusion matrix of the observed state. each item indicates the probability that the hidden state produces the observed state;

Before introducing the advantages of the Viterbi Algorithm in calculating hidden state sequences, we should consider the exhaustive algorithm. We should also consider the poet's weather forecast problem at the beginning. Assuming that the poet has observed the state of moss (dry, damp, soggy) for three consecutive days, the most likely weather condition is now required. The simplest but most stupid way is to list all the weather conditions in three days, then calculate the probability of each combination, and select the combination with the highest probability, as shown in figure 7.

Figure 7 potential hidden sequence combinations of observed Sequences

According to the above exhaustive algorithm, the most likely state sequence method is as follows:

Assume that there are t observability states. Given a sequence of hidden states, the computing complexity is*O (2 T)*, So the overall complexity is*O (2tnt)*. Obviously, this complexity is exponential and cannot be applied to reality. The Viterbi Algorithm Based on Dynamic Planning came into being.

Since the most likely hidden state sequence is required, it must meet the maximum possibility of occurrence of the sequence, and the subsequence also meets the optimal sub-structure:

*X0, X1 ,..., XT*The probability of occurrence must also be the largest; otherwise, it can be replaced with a sequence with a higher probability to generate a better sequence, which is in conflict with the premise. The DP algorithm has two key points: recursive equation and initialization. Assume that we have obtained the most likely first

*T*Hidden status.

*T + 1*Status, we need

*T*Select the optimal status. Because at the moment

*T*, Total

*N*Optional hidden status, so

*T + 1*The computing of time is from this

*N*Select an enabled

*T + 1*The state probability is the highest. Initialization mainly depends on the prior probability. The steps for obtaining the Viterbi algorithm are as follows:

- Ling,
*I = 0, 1 ,..., N-1*;

Pair*T = 1, 2 ,..., T-1*,*I = 0, 1 ,..., N-1*, Computing:

At the moment, the T-1 will get the probability of ending with n different states, and select the State with the highest probability:

The purpose of calculating the maximum probability is not to find the hidden sequence that maximizes the probability. This requires you to save the optimal State selected during each step of calculation and then trace back.

The Calculation of Viterbi algorithms can be illustrated in figure 8. The yellow column is the column to be initialized. The calculation of the Red Square depends on the green column. The final result is the maximum value in the Blue column. After the calculation is complete, find the optimal hidden state sequence through backtracking.

Figure 8 Viterbi algorithm matrix calculation process

With the Viterbi algorithm, we can quickly obtain the optimal hidden sequence.*N * t*Elements. The computing complexity of each element is*O (N)*, So the overall complexity is*O (TN2)*. In the actual implementation process, we 'd better swap the hidden state and the observed state, that is, transpose the above matrix, because if we follow the method shown in figure 8, each column of elements is actually not adjacent, which leads to a very serious lack of cache, resulting in a reduction in computing performance. The illustration is drawn only for the convenience of description.

3. Conclusion

For part-of-speech tagging, we need to solve two problems when using Hmm: Training HMM with three parameters and searching for the optimal hidden sequence. There are a lot of corpus in the part-of-speech tagging field. Therefore, we use supervised training to obtain hmm parameters, and then use Viterbi algorithms to find the optimal hidden sequence. The key to the entire algorithm is to understand hmm. Only by truly understanding hmm can all subsequent tasks be easily solved.

4 references

[1] The beauty of mathematics, chapter 4, chapter 5, chapter 26;

[2] Hidden Markov Model;

[3] a revealing introduction to hiddenmarkov models;

[4] Hmm application in natural language processing 1: part-of-speech tagging.