Pre-record
This article simply explains how to use the N-gram model combined with Chinese pinyin to make error correction in Chinese, and then introduces the application of the shortest editing distance in Chinese search error correction; Finally, from the dependency tree, we explain how to make long distance error correction (grammar correction), and get a revelation from this method. Using the characteristics of the dependency tree and the ESA algorithm to do the synonym clustering.
N-gram Model
In Chinese typos, we judge whether a sentence is valid by calculating its probability, assuming a sentence s = {w1, w2, ..., wn}, the problem can be converted to the following form:
P (S) is called the language model , which is the model used to calculate the probability of a sentence's legality.
However, there are many problems in the use of the formula, the parameter space is too large, the information matrix is seriously sparse, then there is the N-gram model, it is based on Markov model hypothesis , the probability of the occurrence of a word only depends on the word's first 1 words or the first few words, then there is
(1) The appearance of a word depends only on the first 1 words, namely bigram(2-gram):
(2) The appearance of a word depends only on the first 2 words, namely trigram(3-gram):
The greater the N value of the N-gram, the stronger the binding on the next word, because the more information is provided, the more complex the model is, and the more problems it has, so the general use of Bigram or trigram. Here is a simple example that illustrates the specific use of N-gram:
The N-gram model constructs a language model by calculating a maximum likelihood estimate (Maximum likelihood Estimate), which is the best estimate for the training data, and the formula for Bigram is as follows:
For a data set, assume that count (WI) is counted as follows (total 8,493 words):
and Count (WI, Wi-1) statistics are as follows:
The bigram probability matrix is calculated as follows:
The probability of the establishment of the sentence "I want to eat Chinese" is:
P (I want to eat chinese food) = P (i) * p (want| I) * p (to|want) * p (eat|to) * p (chinese|eat) * p (food| Chinese)
= (2533/8493) * 0.33 * 0.66 * 0.28 * 0.021 * 0.52.
Next we only need to train to determine a threshold value , as long as the probability value ≥ the threshold of the sentence is considered legal.
In order to avoid data overflow and improve performance, it is common to use the addition operation to replace the multiplication operation after taking log, i.e.
Log (P1*P2*P3*P4) = log (p1) + log (p2) + log (p3) + log (p4)
It can be found that the matrix in the above example has a value of 0, in the Corpus data set does not appear in the word pair we can not simply think of their probability is 0, then we use the Laplace matrix smoothing , the 0 value to 1 value, set to the occurrence of the word is very small, so it is more reasonable.
With the above example, we can take the N-gram model to do the choice of grammar in the blanks, of course, can also be corrected. Chinese text typos exist in the locality , that is, we only need to select a reasonable sliding window to check if there is a typo, the following example:
We can use the N-gram model to check the "wear" word is wrong, then we will "wear" the word converted to pinyin "Chuan", and then from the dictionary to find "Chuan" candidate words, a trial fill, with N-gram check, see whether reasonable. This is the N-gram model with Chinese pinyin to do Chinese text typos error correction . Chinese pinyin can be used in Java library pinyin4j .
import net.sourceforge.pinyin4j.pinyinhelper;import net.sourceforge.pinyin4j.format.hanyupinyinoutputformat;import net.sourceforge.pinyin4j.format.hanyupinyintonetype;import Net.sourceforge.pinyin4j.format.exception.badhanyupinyinoutputformatcombination;public class keyven { public static void main (String[] args) { HanyuPinyinOutputFormat format = new Hanyupinyinoutputformat (); format.settonetype ( Hanyupinyintonetype.without_tone); string str = "I love natural language processing, keyven"; system.out.println (str); String[] pinyin = null; for (Int i = 0; i < str.length (); ++i) { try { pinyin = Pinyinhelper.tohanyupinyinstringarray (Str.charat (i), format); } catch ( Badhanyupinyinoutputformatcombination e) { e.printstacktrace (); } if ( Pinyin == null) { system.out.print (Str.charat (i)); } else { if (i != 0) { system.out.print (" "); } system.out.print (pinyin[0]); } } }}
Minimum editing distance
Of course, real life also exist in Chinese pinyin is not wrong, is the word selected wrong, or N-gram check reasonable but the word does not exist, for example:
At this time the shortest editing distance , for this hot search , we only need to record n-top, and then use the shortest editing distance to calculate the similarity, to provide the highest similarity of the candidate can be.
The editing distance, also known as the Levenshtein distance, is the minimum number of edit operations required between two strings, from one to another. Permission edits include replacing one character with another character, inserting a character, and deleting a character. For example, turn the kitten word into a sitting:
Sitten (K→s)
Sittin (E→i)
Sitting (→G)
Russian scientist Vladimir Levenshtein introduced the concept in 1965. It is a DP dynamic programming algorithm that has similar topics in the POJ or ACM algorithm book. The main ideas are as follows:
First define such a function edit (i, J), which represents the length of the first string is the substring of I to the second string of length of a substring of the editing distance. Obviously, you can have the following dynamic programming formula:
if (i = = 0 and J = = 0), Edit (i, j) = 0
if (i = = 0 and J > 0), Edit (i, j) = J
if (i > 0 and J = = 0), Edit (i, j) = I
if (i≥1 and J≥1), edit (i, j) = min{edit (i-1, J) + 1, edit (i, j-1) + 1, edit (i-1, j-1) + F (i, j)},
where the first string of the I character is not equal to the second string of the section J characters, F (i, j) = 1; otherwise, f (i, j) = 0.
#include <iostream> #include <string.h>using namespace std; #define min (A, B ) (A<B?A:B) #define &NBSP;MIN3 (a,b,c) (A<min (b,c)? A:min (B,c)) Int main () { /* agtctgacgc agtaagtaggc sailn failing */ char astr[100], Bstr[100]; int dist[100][100]; memset (astr, '), sizeof (ASTR)) memset (bstr, ' sizeof '); (BSTR) memset (dist, 0, sizeof (Dist)); gets (ASTR); gets ( BSTR); int alen = strlen (ASTR); int blen = strlen (BSTR); for (int i = 0; i <= alen; i++) { dist[i][0] = i; } for (int i = 0; i <= blen; i++) { dist[0][i] = i; } for (int i = 1; i <= alen; i++) { for (int j = 1; j <= blen; j++) { int d = (astr[i - 1] != BSTR[J&NBSP;-&NBSP;1]) ? 1 : 0; dist[i][j] = min3 (dist[i - 1][j] + 1, dist[i][j - 1] + 1, dist[i&nBSP;-&NBSP;1][J&NBSP;-&NBSP;1]&NBSP;+&NBSP;D); } } for (int i = 0; i <= alen; i++ ) { for (int j = 0; j <= blen; j++) { printf ("%d ", dist[i][j]); } printf ("\ n"); } printf ("%d\n", dist[alen][blen]); system ("pause"); return 0;}
Chinese grammar correction
Before taking part in the Chinese grammar Error diagnostic evaluation cged (ACL-IJCNLP2015 Workshop) Contest, I am responsible for the Selection part, we look at the official given example (redundant, Missing , Selection, disorder, respectively, corresponding to 4 grammatical errors):
In the course of the game to use a dependency tree to solve the problem of selection (grammatical collocation error), grammatical collocation is not so much a grammatical category, but rather a semantic concept , such as "the film" We Judge "a" wrong is based on the word "movie" to judge, and such as " Mr. Wu is good at repairing bicycles. "Judging" is "wrong is the basis of" is "a word," good "is a verb, how can use" is + noun "structure? But at that time, more and more various kinds of frantic future uncertain, so did not do it. Later on-line search paper to see a "based on N-gram and dependency analysis of Chinese automatic error method", remember is 2 years ago saw, at that time on the dependency tree did not understand so did not care about the second half of the paper, now understand, write something also has a theoretical support, did not think the idea of a good fate ^_^.
The collocation between words and words is to look at the semantic correlation intensity between the two, and the edge of the dependent tree can be used to embody the semantic relevance, if there is a selection grammatical error in a sentence, then it is unreasonable to build a dependent tree, we can use this edge to judge whether there is a grammatical error. In the above paper, the authors call it used for long-distance Chinese error correction , while N-gram is a short-range Chinese error correction.
As for how to use existing knowledge to build a domain knowledge base, we can run through the correct corpus data set and count the dependent tree edges of those grammatically correct sentences ... CEGD the training set given by the game was a bit strange, and it was the reason why the game was not ideal and did not make the idea of dependency trees. i re-searched the internet for a few test samples (linguistics major courseware ppt), we look at how to take the dependency tree to do synonym clustering. Using the dependency tree to do selection syntax debugging is there, but also to correct it, how to implement an error correction algorithm, of course, the synonym is replaced, will produce selection class error is generally synonymous with misuse. I used to take the hit-irlab-synonyms Forest (extended version) contrast, the effect is not very good, so there are later synonyms clustering ideas.
dependency Tree Synonyms Clustering
There have been previous contacts with synonyms clustering papers, one of the more impressive one is "Computing Semantic relatedness using wikipedia-based Explicit Semantic analysis", The ESA (Explicit Semantic analysis) algorithm. The main idea of the ESA is to think of a wiki entry as a topic concept , and then to filter the words in the explanatory text of the entry using the TF-IDF inverse document frequency , and then use the Inverted index (word-topic). So we can construct the topic vectors , we can use these topic vectors to do semantic similarity computation and complete the clustering of synonyms.
But this kind of work for me is a little difficult to complete, later in the selection parallel corpus, found that the same interesting thing, is winning the yellow edge, instantaneous whim, is not able to take these dependent side as a topic, using inverted index to establish the theme vector, This can create a lot of rich primitive features, and then find an algorithm for feature selection filtering, and then complete the synonym clustering ...
Reference
Chinese automatic error method based on N-gram and dependency analysis (Mageumsan, Liu, Mr. Li)
"Language model (Language Modeling)", Stanford University, Natural Language processing, lesson four
Natural language processing--making the input method smarter
Analysis of Baidu written test--spelling correction
Analysis of Chinese grammar using Stanford parser
Stanford Parser
Using pinyin4j
The fifth Chapter N-gram language model
editing distance and editing distance algorithm
Pinyin4j-2.5.0.jar free Download the latest version
Using Wikipedia to calculate semantic similarity "multimedia paper reading"
Chinese grammar Error diagnostic evaluation cged (ACL-IJCNLP2015 Workshop)
From N-gram Chinese text correction to Chinese grammar correction and synonym clustering