Pangu word segmentation-Chinese Name Recognition
Author: eaglet
In ktdictseg, eaglet tried to use rules and statistics to identify Chinese (Han) people, but the results were not satisfactory. In pangu word segmentation, the eaglet uses a newAlgorithmTo recognize Chinese names, the effect is much better than the rule and statistical methods. The eaglet below describes how to recognize Chinese names.
To better recognize Chinese names, we need to process the sentences to be decomposed in two steps: preprocessing and ambiguity elimination.
Preprocessing
The first step for recognizing Chinese names is preprocessing. The Preprocessing process is to identify all possible Chinese names in the sentence to be recognized. The search method is to first match the last name based on the first name, and then match according to the common Chinese and double Chinese names. The pangu word segmentation dictionary directory contains three files: chssinglename.txt, chsdoublename1.txt, and chsdoublename2.txt, which indicate the names of single words, double names, and double names, based on the common Chinese names specified in the three files, we can complete the preprocessing process.
For example, Zhang Sanfeng and Li Shimin
Among them, "Zhang" and "Li" are surnames, "3" are commonly used single-word names and commonly used double-word names, and "world" is commonly used double-word names, "min" and "Feng" are commonly used at the end of Double names
The Preprocessing result is
Zhang San
Zhang Sanfeng
Li Shimin
Eliminate ambiguity
Due to the complexity of Chinese, ambiguity still exists after preprocessing. For example
The words "Zhang San" and "Li San bought a triangle table" both contain the word "Zhang San", but in the first sentence, Zhang San is the name of a person, the second sentence is not.
Therefore, to improve the recognition rate of Chinese names, eliminating ambiguity is a key step.
The solution provided by eaglet is to output all the Chinese names pre-processed in the first step and all the words matching in the dictionary, then find the combination of words with the smallest gap and the least words.
Let's take the sentence "Li San bought a triangle table" as an example.
In the dictionary, we break down the following words:
Three bought a triangle and three tables.
Add the name li Sanhe and Zhang San found in the previous preprocessing, and finally break down the following words:
I bought a three-triangle table.
At this time, we can combine these words. The combination rule is that there cannot be staggered words in a group of words. For example, one word and three cannot appear in one group of words.
After the combination, sort the rules I mentioned above to find the most matched combination.
We can see that the most matched combination is
Li San/bought/A/triangle/table/there is no gap between the word combination and the word, that is, the gap is 0, and the number of words is 6
If "Michael" is used as a word, it is combined
Li San/bought/Zhang San // table/this combination is missing because the "one" in front of the "Zhang" and the "angle" in the back are not in the dictionary, therefore, the gap is 1 + 1 = 2, and there is no small gap in the above best combination, so we do not take this combination.
So let's go deeper. If we add the word "1" and "angle" to the dictionary, will it be divided into errors? The answer is no.
After adding the word "1" and "angle", if you want to split the word "Zhang San" into one, it is combined
Li San/I bought/I/Zhang San/Jiao/table/at this time, the gap is 0, but the number of words is 7, which is better than the best combination above, therefore, this combination is ignored.
Summary
Pangu word segmentation is actually very simple, but it takes more than a year for eaglet to explore this algorithm. This algorithm has made great progress in the recognition of Chinese people's names, but it is not omnipotent. First of all, we cannot put all people's names in that name dictionary, this results in some uncommon names that cannot be split out. Secondly, for some situations where only the names have no surnames, this aspect cannot be identified at present, because dual-word names are divided into the first word and the last word respectively, although the number of combinations is increased, it is inevitable that some non-person names will be combined.
However, the technology is always advancing. eaglet is neither a Chinese language expert nor specialized in Chinese language research. The eaglet is here. If you have a better method, you may wish to share it with us. If you think there is still room for improvement in the current eaglet algorithm, you may wish to discuss it together.