I recently used Java to simulate a Chinese Pinyin input method. The reason for this is that this input method can only be entered in a specific text box (written in Java ). In order to realize the recognition of consecutive pinyin sequences, hmm is used as a model, and the famous Viterbi algorithm is also a classic application of dynamic planning. I do not need to explain this algorithm, the Internet has its own talents. In order to find the observed sequence (the first step in modeling) for building the HMM model, we need to obtain the syllables most likely to be input by users for the user-input consecutive letter sequences.
The first solution, of course, is positive longest matching. The method is simple enough with no difficulty, but it also has great defects. For example, if the user inputs xianguo, according to the forward maximum matching algorithm, Xiang is a legal pinyin, and xiangu is not, so we add segmentation after Xiang. However, the remaining uo cannot be divided, because both uo and u'o are invalid pinyin.
So what types of syllables are separated, which is a good separator for an input letter sequence?
First, the separated syllables should be as few as possible. For the input danteng, of course, it can be divided into d' a' n' t 'E' n' g, which is no problem, but it is indeed a pain point. The most likely intention of a user is Dan's Teng.
Second, the separated syllables should be as complete as possible. How can we define a complete definition? For a possible syllable, we define the costs function in three states. Costs (Py) = 0 when this syllable Py is a complete legal syllable; when this syllable Py is not a complete legal syllable, but when it is at least one valid syllable prefix, costs (Py) = 1; when it cannot be the prefix of the same syllable, we define costs (Py) = 2. we use the costs value of one syllable as a judgment on its integrity. Obviously, the lower the costs value, the better. For example, for the input: Gonga, For the first point, the division of Gong 'a and gon' ga both meet the requirements. But for the second requirement, GON is not a complete legal pinyin, it is just a legal complete pinyin prefix, and in the first split, both are complete legal pinyin, so, select the former as the optimal segmentation.
In this way, our formal definition problem is:
Input: A Pinyin sequence that may contain multiple syllables
Output: The separated syllables that meet the preceding conditions (with the least syllables and the most complete syllables)
For this input sequence, if we find the first separator through some method, we have two subsequences with a shorter length and process them in the same way, we can get the separation method of the two subsequences. The merging method is the separation method of the whole sequence. This is a basic divide-and-conquer strategy. Then, how can we find the separation points that satisfy the preceding conditions for a sequence? We naturally think of searching all possible separation points to compare their effects, and then selecting the optimal separation point as the separation point of the sequence. It seems that dynamic planning algorithms can be used to solve the problem. We have found the segmentation method of the subproblem. we can write a recursive expression based on this subproblem to indicate the score of each subproblem and serve as the basis for choosing the segmentation scheme in the future. In this example, we define a score for a substring from position I to position j in the original input sequence as follows:
M (I, j) = costs (INPUT (I, j) if I = j
M (I, j) = min (M (I, t) + M (t + 1, j) + costs (INPUT (I, j) if I! = J
For dynamic planning, we split sub-problems and found a recursive formula, which seems to be a success.
However, for input sequences whose input length is n, we can separate up to n syllables (n-1 separators are required) and at least one syllable is required, in this way, there are 2 ^ n-1 separation methods, not a polynomial time solution.
. Dynamic Planning is also the space for searching all possible problems, but it can be solved within a polynomial time. The reason is that the storage of the previously computed results of subproblems occurs when the same subproblems occur, you can directly use the results without repeated computation. For example, our input: xuanbu, where any substring, for example, an, is in many long substrings, such as xuan, uan, xuanb, uanb, uanbu, anb, anbu, and xuanbu. Including these substrings all show signs of mutual inclusion. In this way, instead of using recursive methods, I can use iterative methods to calculate in a reasonable order, starting from the smallest problem, and then based on the previous calculation results, calculate New results.
Because the length of the Child string is calculated as l, the m value of the Child string whose length is smaller than l must be known first, and the computation of the Child string with the same length is not dependent. Therefore, we start from all the substrings with a length of 1 and calculate them in sequence, knowing that the length is the same as the length of the input sequence n.
First, we create an M matrix (storing the optimal M Value of the substring) and a P Matrix of the same size (storing the optimal split position). The number of rows I indicates the starting position of the substring, the number of columns indicates the position at which the substring ends. The dimension of the matrix is equal to the length N of the input sequence. For the input sequence xuanbu, we have the following matrix (the number on the left is the computing order ):
|
X |
U |
A |
N |
B |
U |
X |
1 |
7 |
12 |
16 |
19 |
21 |
U |
|
2 |
8 |
13 |
17 |
20 |
A |
|
|
3 |
9 |
14 |
18 |
N |
|
|
|
4 |
10 |
15 |
B |
|
|
|
|
5 |
11 |
U |
|
|
|
|
|
6 |
P (I, j) = I if I = J
P (I, j) = M (I, j) = argmin T (M (I, t) + M (t + 1, J) + costs (input (I, j) If I! = J
Based on the recursive formula of M and P above, we can obtain the matrix P and matrix m.
In this way, we seem to have all the positions for Division. It seems that in recursion, we only consider the second point of our requirements for results, how can we convert the results in this matrix into the final separated results? If the P value is always split, we will divide it into a single letter into a pinyin alphabet. In this way, I first find the best separation position P from the entire sequence, and then split and split the two subsequences. For each split, as long as it is a complete valid syllable or the prefix of a complete legal syllable, it will not continue to be split. We will find the split position of the Child sequence that does not meet the conditions in matrix P, and then process it in the same way. Until each split sub-sequence is a complete and valid syllable or the prefix of a complete and legal syllable. Finally, the returned results are adjusted according to the input sequence.
The source code of the algorithm (Java) GitHub portal here:
Pyseparator. Java
Import java. util. arrayList; import java. util. using list; import java. util. list; import java. util. map. entry; import java. util. abstractMap. simpleEntry; import Building. trieTree; public class PYSeparator implements Separator {Dict dict; public PYSeparator (Dict d) {this. dict = d;} public ArrayList <String> separate (String str) {ArrayList <String> ret = new ArrayList <String> (); if (dict. costs (str )! = 2) {ret. add (str); return ret;} int size = str. length (); int [] [] vMat = new int [size] [size], posMat = new int [size] [size]; // int row, col; for (int I = 0; I <size; I ++) {int ceil = size-I; col = I; row = 0; for (int j = 0; j <ceil; j ++) {if (row = col) {vMat [row] [col] = dict. costs ("" + str. charAt (row); posMat [row] [col] = row;} else {int min = Integer. MAX_VALUE, pos = row; for (int t = row; t <Col; t ++) {int value = vMat [row] [t] + vMat [t + 1] [col] + dict. costs (str. substring (row, col + 1); if (value <= min) {min = value; pos = t ;}} vMat [row] [col] = min; posMat [row] [col] = pos;} col ++; row ++ ;}}partition list <Entry <Integer, Integer> splits = new partition list <Entry <Integer, integer> (); splits. add (new SimpleEntry <Integer, Integer> (0, str. length ()-1); while (! Splits. isEmpty () {Entry <Integer, Integer> span = splits. pollFirst (); int st = span. getKey (), end = span. getValue (); String nowSt = str. substring (st, end + 1); if (dict. costs (nowSt )! = 2) {ret. add (nowSt); continue;} int sp = posMat [st] [end]; if (sp + 1 <= end) splits. addFirst (new SimpleEntry <Integer, Integer> (sp + 1, end); if (st <= sp) splits. addFirst (new SimpleEntry <Integer, Integer> (st, sp);} return ret ;}
Dict. Java
public interface Dict {/** * * @param py * @return 0 when py is a legal Pinyin; * 1 when py is the prefix of a legal Pinyin * 2 when py is not possibly a legal Pinyin */public int costs(String py);}
Separator. Java
import java.util.List;public interface Separator {public List<String> separate(String str);}