Comparison of three Chinese word segmentation algorithms

Last Update:2017-02-27 Source: Internet

Author: User

Tags comparison

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

So far, Chinese word segmentation includes three methods: 1 segmentation based on string matching, 2 segmentation based on understanding, 3 segmentation based on statistics. So far, there is no way to prove which method is more accurate, each method has its own advantages and disadvantages, there are strengths and fatal weaknesses, the simple comparison is shown in the following table:

Comparison of the advantages and disadvantages of various participle methods

Word Segmentation method	segmentation based on string matching	The word segmentation based on understanding	The segmentation based on statistics
Ambiguity recognition	Poor	Strong	Strong
Recognition of new words	Poor	Strong	Strong
Need a dictionary	Need	Don't need	Don't need
Need corpus	Whether	Whether	Is
Rule Library Required	Whether	Is	Whether
Algorithmic complexity	Easy	Hard	So so
Maturity of technology	Mature	Not mature	Mature
Implementation difficulty	Easy	Hard	So so
Word Segmentation accuracy	So so	Accurate	More accurate
Word speed	Fast	Slow	So so

(1) Ambiguity recognition

Ambiguity recognition refers to a string has a variety of word segmentation methods, the computer is difficult to give the end of which word segmentation algorithm is the correct word series. such as "surface" can be divided into "surface/" or "Table/surface". The computer cannot tell which is the exact word breaker.

Word segmentation algorithm based on string: only compared with an electronic dictionary, it can not be ambiguous identification;

Based on understanding of the word segmentation algorithm: refers to the meaning of the string by understanding, it has a strong ability to identify ambiguity;

Based on the statistics of the word segmentation algorithm: According to the number of consecutive occurrences, get participle series, it is often able to give the correct choice of Word segmentation series, but also may be judged wrong situation.

(2) Recognition of new words

The new word recognition, also known as the unidentified word recognition, refers to the correct identification of words not appearing in the dictionary. Name, organization name, address, appellation and so on ever-changing, the dictionary is often not fully included in these words; In addition, the popular language appearing in the network is also a common source of unregistered words, such as "soy sauce" for the recent appearance in the network, and quickly popular, thus becoming a new word. A large number of studies have proved that the recognition of new words is an important factor in the accuracy of Chinese word segmentation.

Word segmentation algorithm based on string: cannot correctly identify the unregistered words, because this algorithm is only compared with the words in the dictionary;

Word segmentation algorithm based on understanding: understand the meaning of strings, so there is a strong ability to identify new words;

Segmentation algorithm based on statistics: This algorithm has a strong ability to recognize the second type of unregistered word, because of the number of occurrences, it will be treated as a new word; for the second type of unregistered words, such words have a certain regularity, such as name: "Surname" + name, such as Li Shenli; institution: prefix + appellation, such as Hope Group Therefore, it is necessary to recognize the rules in a certain way, and it is difficult to recognize them by statistical methods.

(3) Need a dictionary

Word segmentation algorithm based on string: The basic idea is to compare with the electronic dictionary, so the electronic dictionary is necessary. And the larger the dictionary, the higher the correct rate of word segmentation, because the larger the dictionary, the less the number of logins, which can greatly reduce the recognition of the error of the unidentified word;

Word segmentation algorithm based on understanding: understand the meaning of strings, so do not need an electronic dictionary;

Segmentation algorithm based on statistics: only according to statistics to get the final results, so the electronic dictionary is not necessary.

(4) Need corpus

Word segmentation algorithm based on string: The segmentation process is only compared with an existing electronic dictionary, so it does not need a corpus;

Word segmentation algorithm based on understanding: understand the meaning of strings, so do not need an electronic dictionary;

Segmentation algorithm based on statistics: need corpus for statistical training, so corpus is necessary, and good corpus is the guarantee of the accuracy of segmentation.

(5) Need rule base

Word segmentation algorithm based on string: The word segmentation process is only compared with an existing electronic dictionary, and no rules library is needed for word segmentation;

Word segmentation algorithm based on understanding: rules are the basis of computer understanding, so accurate, complete rules base is the premise of this algorithm;

Statistics based segmentation algorithm: According to the corpus statistics training, so the rule base is not necessary.

(6) Algorithm complexity

Word segmentation algorithm based on string: Only the comparison operation of strings, so the algorithm is simple;

Based on the understanding of the word segmentation algorithm: the need to fully deal with a variety of rules, so the algorithm is very complex; in fact, so far, there is no mature such algorithms;

Segmentation algorithm based on statistics: the need for corpus training, although the algorithm is also more complex, but has been more common, so the complexity of the word segmentation than the first one, more easily than the second. Nowadays, the practical word segmentation system adopts this algorithm.

(7) Maturity of technology

Word segmentation algorithm based on string: It is the earliest and most mature algorithm;

Based on the understanding of the word segmentation algorithm: is the most immature of a class of algorithms, so far there is no mature algorithm;

Segmentation algorithm based on statistics: There are many kinds of mature algorithms, which can basically meet the practical application.

So technology maturity: based on the matching word segmentation algorithm, based on the understanding of the word segmentation algorithm based on statistical segmentation algorithm.

(8) Implementation complexity

With the above reason, the implementation of complexity: Based on the understanding of the word segmentation algorithm based on the statistical segmentation algorithm based on matching segmentation algorithm.

(9) Word segmentation accuracy

So far, there is no accurate conclusion but theoretically, based on the understanding of the word segmentation algorithm has the highest accuracy, theoretically 100% accuracy; and based on matching segmentation algorithm and statistics based segmentation algorithm is a "shallow understanding" of the word segmentation method, does not involve the real meaning of understanding, it may appear wrong, Difficult to achieve 100% accuracy.

(10) Word speed

Based on matching segmentation algorithm: Simple algorithm, easy to operate, so the fast segmentation, so this algorithm is often used as another two algorithms preprocessing, the string of coarse;

Segmentation algorithm based on understanding: This algorithm often needs to operate a huge rule base, so the slowest speed;

Segmentation algorithm based on statistics: This segmentation algorithm is only compared with a statistical result, so the speed is general.

Therefore, the general speed of word segmentation from fast to slow, in turn, is: based on the matching word segmentation algorithm based on the statistical segmentation algorithm based on understanding of the word segmentation algorithm.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More