Pangu word segmentation-Introduction to Chinese Name Recognition Algorithms

Source: Internet
Author: User

Pangu word segmentation-Chinese Name Recognition

Author: eaglet

In ktdictseg, eaglet tried to use rules and statistics to identify Chinese (Han) people, but the results were not satisfactory. In pangu word segmentation, the eaglet uses a newAlgorithmTo recognize Chinese names, the effect is much better than the rule and statistical methods. The eaglet below describes how to recognize Chinese names.

To better recognize Chinese names, we need to process the sentences to be decomposed in two steps: preprocessing and ambiguity elimination.

Preprocessing

The first step for recognizing Chinese names is preprocessing. The Preprocessing process is to identify all possible Chinese names in the sentence to be recognized. The search method is to first match the last name based on the first name, and then match according to the common Chinese and double Chinese names. The pangu word segmentation dictionary directory contains three files: chssinglename.txt, chsdoublename1.txt, and chsdoublename2.txt, which indicate the names of single words, double names, and double names, based on the common Chinese names specified in the three files, we can complete the preprocessing process.

For example, Zhang Sanfeng and Li Shimin

Among them, "Zhang" and "Li" are surnames, "3" are commonly used single-word names and commonly used double-word names, and "world" is commonly used double-word names, "min" and "Feng" are commonly used at the end of Double names

The Preprocessing result is

Zhang San

Zhang Sanfeng

Li Shimin

 

Eliminate ambiguity

Due to the complexity of Chinese, ambiguity still exists after preprocessing. For example

The words "Zhang San" and "Li San bought a triangle table" both contain the word "Zhang San", but in the first sentence, Zhang San is the name of a person, the second sentence is not.

Therefore, to improve the recognition rate of Chinese names, eliminating ambiguity is a key step.

The solution provided by eaglet is to output all the Chinese names pre-processed in the first step and all the words matching in the dictionary, then find the combination of words with the smallest gap and the least words.

Let's take the sentence "Li San bought a triangle table" as an example.

In the dictionary, we break down the following words:

Three bought a triangle and three tables.

Add the name li Sanhe and Zhang San found in the previous preprocessing, and finally break down the following words:

I bought a three-triangle table.

At this time, we can combine these words. The combination rule is that there cannot be staggered words in a group of words. For example, one word and three cannot appear in one group of words.

After the combination, sort the rules I mentioned above to find the most matched combination.

We can see that the most matched combination is

Li San/bought/A/triangle/table/there is no gap between the word combination and the word, that is, the gap is 0, and the number of words is 6

If "Michael" is used as a word, it is combined

Li San/bought/Zhang San // table/this combination is missing because the "one" in front of the "Zhang" and the "angle" in the back are not in the dictionary, therefore, the gap is 1 + 1 = 2, and there is no small gap in the above best combination, so we do not take this combination.

So let's go deeper. If we add the word "1" and "angle" to the dictionary, will it be divided into errors? The answer is no.

After adding the word "1" and "angle", if you want to split the word "Zhang San" into one, it is combined

Li San/I bought/I/Zhang San/Jiao/table/at this time, the gap is 0, but the number of words is 7, which is better than the best combination above, therefore, this combination is ignored.

Summary 

Pangu word segmentation is actually very simple, but it takes more than a year for eaglet to explore this algorithm. This algorithm has made great progress in the recognition of Chinese people's names, but it is not omnipotent. First of all, we cannot put all people's names in that name dictionary, this results in some uncommon names that cannot be split out. Secondly, for some situations where only the names have no surnames, this aspect cannot be identified at present, because dual-word names are divided into the first word and the last word respectively, although the number of combinations is increased, it is inevitable that some non-person names will be combined.

However, the technology is always advancing. eaglet is neither a Chinese language expert nor specialized in Chinese language research. The eaglet is here. If you have a better method, you may wish to share it with us. If you think there is still room for improvement in the current eaglet algorithm, you may wish to discuss it together.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.