Dynamic programming applications to separate phonetic sequences into syllables

Source: Internet
Author: User
Tags int size min split valid

Recently in the use of Java simulation of a Chinese pinyin input method, the reason for the simulation is because this input method can only be entered in a specific text box (written in Java). In order to be able to realize the recognition of continuous pinyin sequence, we use HMM as the model, and the famous Viterbi algorithm is also a classic application of dynamic programming, this algorithm has no need for me to explain the popularity of the network, the high man. In order to be able to find the observation sequence that constructs the HMM model (the first step in modeling), we need to get the most likely syllable for the user input for a sequential sequence of letters.

The solution that comes to mind first, of course, is the forward longest match. The method is simple enough, without difficulty, but there are also very big flaws. For example, the user input Xianguo, according to the forward maximum matching algorithm, Xiang is a valid pinyin, Xiangu is not, so we add a split after Xiang. However, the rest of the UO is not good, because, uo or U ' o are not legal pinyin.

What kind of syllable, then, is a good separation for an input sequence of letters.

First, the syllables should be as small as possible. For input Danteng, of course, it can be divided into d ' a ' n ' t ' E ' n ' G, so it's no problem, but it really hurts. The user's most likely intention should be Dan ' teng.

Second, the syllables should be as complete as possible. How to define integrity. We have 3 states for a possible syllable, and we define the costs function. When the syllable Py is a complete legal syllable, costs (py) = 0; When the syllable Py is not a complete legal syllable, but it is the prefix of at least one legal syllable, costs (py) = 1; When it is not possible to prefix a legitimate syllable, we define costs (py ) = 2. We use the costs value of a syllable as a judge of its completeness, and it is clear that the lower the costs value the better. For example, for input: Gonga, for the 1th requirement, the split gong ' A and Gon ' GA meet the requirements. But for the second request, GON is not a complete legal pinyin, it is only a valid full pinyin prefix, and in the first Division, two are full legal pinyin, so, here to choose the former as the best segmentation.

In this way, we formally define the problem as:

Input: A phonetic sequence that may contain multiple syllables

Output: A well-spaced syllable that satisfies the above conditions (minimum syllable and most complete syllable)



For this input sequence, let's say we find the first place in a certain way, we have two shorter sub-sequences, and in the same way we deal with these two sub-sequences, we can get the separation of the two sub-sequences, the merging is the whole sequence of separation. This is the basic divide-and-conquer strategy. So, for a sequence, how to find the separation point that satisfies the above conditions. It is natural to think of searching for all possible dividers, comparing their effects, and choosing the optimal separation point as the dividing point for this sequence. It seems that the dynamic programming algorithm can be used to solve the problem. We have found the sub-problem segmentation method, we can write a recursive formula based on this sub-problem to represent the score of each sub-problem, as the basis for future selection of the segmentation scheme. This we define a score of a substring in the original input sequence starting at position I to the end of position J:

M (i, j) = costs (INPUT (i, J)) If I = j

M (i, j) = min (M (I, T) + m (t + 1, J) + costs (INPUT (i, j))) if I! = J

For dynamic planning, we split the sub-problem, we find the recursion, it seems to be done.

However, for input sequences with an input length of n, we can delimit up to n syllables (which require n-1 separators), with a minimum of 1 syllables, so there is a 2^n-1 separation method, not a polynomial-time solution.

。 Dynamic planning is also a search problem all possible space, but can be solved in polynomial time, the reason is that for the previous sub-problem operation results of storage, when encountering the same sub-problem, do not have to repeat the calculation, you can directly use the results. For example, our input: Xuanbu, where any substring, for example, an, is in many longer substrings, such as Xuan,uan,xuanb,uanb,uanbu,anb,anbu,xuanbu. Including these substrings are signs of mutual inclusion. In this way, instead of using recursion, I can use iterative methods to calculate from the smallest problem in a reasonable order of calculation, and then calculate the new results based on the previous calculation results.

Due to the calculation of substrings of length l, it is necessary to know the M value of the substring with a length smaller than L, while the same length of the sub-Kushima computation has no dependency. So, we start with all the substrings of length 1, which we calculate in turn, until the length is the same as the input sequence length n.

First, we do an M-matrix (the optimal m-value of the preserved substring) and the same-sized P-matrix (where the optimal segmentation is saved), the number of rows I indicates where the substring begins, and the number of columns indicates where the substring ends. The dimension of the matrix equals the length of the input sequence N. For the input sequence Xuanbu, we have the following matrices (left, numbers in the order of calculation):

x u a n b u
X 1 7 12 16 19 21st
U 2 8 13 17 20
A 3 9 14 18
N 4 10 15
B 5 11
U 6

P (i, j) = i If i = j
P (i, j) = M (i, J) =argmin T (m (I, T) + m (t + 1, J) + costs (INPUT (i, j))) if I! = J

We can derive the matrix P and the matrix M based on the above recursive formula for M and P.


So, we seem to have all the split positions, it seems that in the recursive we only consider our 2nd of the result requirements, how to transform the results of this matrix into the final separation result. If we divide it by the value of P, we divide it into a single letter as a phonetic alphabet. I deal with this, first from the entire sequence to find the best separation position p, and then do the segmentation, split out the two sub-sequence, for each of the sub-sequence, as long as the complete legal syllable or a complete legal syllable prefix, do not continue to split, the sub-sequence of non-conforming to the matrix P find the Division position, And then treat it the same way. Until each of the separated subsequence is a complete legal syllable or a prefix of a complete legal syllable. Finally, the results are adjusted back in the order of the input sequence.



Algorithmic source code (Java) GitHub portal here:

Pyseparator.java

Import java.util.ArrayList;
Import java.util.LinkedList;
Import java.util.List;
Import Java.util.Map.Entry;

Import Java.util.AbstractMap.SimpleEntry;

Import Building.trietree;
	public class Pyseparator implements Separator {Dict Dict;
	Public Pyseparator (Dict d) {this.dict = D;
		} public arraylist<string> separate (String str) {arraylist<string> ret = new arraylist<string> ();
			if (dict.costs (str)! = 2) {ret.add (str);
		return ret;
		} int size = Str.length ();
		int[][] VMat = new Int[size][size], Posmat = new Int[size][size];//posmat record the position of the first letter of the split line int row, col;
			for (int i = 0; i < size; i++) {int ceil = size-i;
			col = i;
			row = 0;
					for (int j = 0; J < Ceil; J + +) {if (row = = col) {Vmat[row][col] = dict.costs ("" + Str.charat (row));
				Posmat[row][col] = row;

					} else{int min = integer.max_value, pos = row; for (int t = row; t < col; t++) {int value = Vmat[row][t] + vmat[t + 1][col] + dict.costs (str.subString (row, col + 1));
							if (value <= min) {min = value;
						pos = t;
					}} Vmat[row][col] = min;
				Posmat[row][col] = pos;
				} col++;
			row++; }} Linkedlist<entry<integer, integer>> splits = new Linkedlist<entry<integer, Integer>>
		();
		Splits.add (New Simpleentry<integer, integer> (0, Str.length ()-1));
			while (!splits.isempty ()) {Entry<integer, integer> span = Splits.pollfirst ();
			int st = Span.getkey (), end = Span.getvalue ();
			String Nowst = str.substring (St, end + 1);
				if (dict.costs (NOWST)! = 2) {ret.add (NOWST);
			Continue
			} int sp = posmat[st][end];
			if (sp + 1 <= end) Splits.addfirst (new Simpleentry<integer, integer> (sp + 1, end));
		if (St <= SP) Splits.addfirst (new Simpleentry<integer, integer> (St, SP));
	} return ret; }
Dict.java

Public interface Dict {
	/**
	 * 
	 * @param py 
	 * @return 0 if PY is a legal Pinyin;
	 * 		   1 when PY was the prefix of a legal Pinyin
	 * 2 when PY was not         possibly a legal Pinyin 
	 */public
	int C OSTs (String py);
}
Separator.java

Import java.util.List;

Public interface Separator {public
	list<string> separate (String str);
}







Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.