Comparison of three Chinese word segmentation algorithms

Source: Internet
Author: User
Tags comparison

So far, Chinese word segmentation includes three methods: 1 segmentation based on string matching, 2 segmentation based on understanding, 3 segmentation based on statistics. So far, there is no way to prove which method is more accurate, each method has its own advantages and disadvantages, there are strengths and fatal weaknesses, the simple comparison is shown in the following table:

Comparison of the advantages and disadvantages of various participle methods

Word Segmentation method segmentation based on string matching The word segmentation based on understanding The segmentation based on statistics
Ambiguity recognition Poor Strong Strong
Recognition of new words Poor Strong Strong
Need a dictionary Need Don't need Don't need
Need corpus Whether Whether Is
Rule Library Required Whether Is Whether
Algorithmic complexity Easy Hard So so
Maturity of technology Mature Not mature Mature
Implementation difficulty Easy Hard So so
Word Segmentation accuracy So so Accurate More accurate
Word speed Fast Slow So so

(1) Ambiguity recognition

Ambiguity recognition refers to a string has a variety of word segmentation methods, the computer is difficult to give the end of which word segmentation algorithm is the correct word series. such as "surface" can be divided into "surface/" or "Table/surface". The computer cannot tell which is the exact word breaker.

Word segmentation algorithm based on string: only compared with an electronic dictionary, it can not be ambiguous identification;

Based on understanding of the word segmentation algorithm: refers to the meaning of the string by understanding, it has a strong ability to identify ambiguity;

Based on the statistics of the word segmentation algorithm: According to the number of consecutive occurrences, get participle series, it is often able to give the correct choice of Word segmentation series, but also may be judged wrong situation.

(2) Recognition of new words

The new word recognition, also known as the unidentified word recognition, refers to the correct identification of words not appearing in the dictionary. Name, organization name, address, appellation and so on ever-changing, the dictionary is often not fully included in these words; In addition, the popular language appearing in the network is also a common source of unregistered words, such as "soy sauce" for the recent appearance in the network, and quickly popular, thus becoming a new word. A large number of studies have proved that the recognition of new words is an important factor in the accuracy of Chinese word segmentation.

Word segmentation algorithm based on string: cannot correctly identify the unregistered words, because this algorithm is only compared with the words in the dictionary;

Word segmentation algorithm based on understanding: understand the meaning of strings, so there is a strong ability to identify new words;

Segmentation algorithm based on statistics: This algorithm has a strong ability to recognize the second type of unregistered word, because of the number of occurrences, it will be treated as a new word; for the second type of unregistered words, such words have a certain regularity, such as name: "Surname" + name, such as Li Shenli; institution: prefix + appellation, such as Hope Group Therefore, it is necessary to recognize the rules in a certain way, and it is difficult to recognize them by statistical methods.

(3) Need a dictionary

Word segmentation algorithm based on string: The basic idea is to compare with the electronic dictionary, so the electronic dictionary is necessary. And the larger the dictionary, the higher the correct rate of word segmentation, because the larger the dictionary, the less the number of logins, which can greatly reduce the recognition of the error of the unidentified word;

Word segmentation algorithm based on understanding: understand the meaning of strings, so do not need an electronic dictionary;

Segmentation algorithm based on statistics: only according to statistics to get the final results, so the electronic dictionary is not necessary.

(4) Need corpus

Word segmentation algorithm based on string: The segmentation process is only compared with an existing electronic dictionary, so it does not need a corpus;

Word segmentation algorithm based on understanding: understand the meaning of strings, so do not need an electronic dictionary;

Segmentation algorithm based on statistics: need corpus for statistical training, so corpus is necessary, and good corpus is the guarantee of the accuracy of segmentation.

(5) Need rule base

Word segmentation algorithm based on string: The word segmentation process is only compared with an existing electronic dictionary, and no rules library is needed for word segmentation;

Word segmentation algorithm based on understanding: rules are the basis of computer understanding, so accurate, complete rules base is the premise of this algorithm;

Statistics based segmentation algorithm: According to the corpus statistics training, so the rule base is not necessary.

(6) Algorithm complexity

Word segmentation algorithm based on string: Only the comparison operation of strings, so the algorithm is simple;

Based on the understanding of the word segmentation algorithm: the need to fully deal with a variety of rules, so the algorithm is very complex; in fact, so far, there is no mature such algorithms;

Segmentation algorithm based on statistics: the need for corpus training, although the algorithm is also more complex, but has been more common, so the complexity of the word segmentation than the first one, more easily than the second. Nowadays, the practical word segmentation system adopts this algorithm.

(7) Maturity of technology

Word segmentation algorithm based on string: It is the earliest and most mature algorithm;

Based on the understanding of the word segmentation algorithm: is the most immature of a class of algorithms, so far there is no mature algorithm;

Segmentation algorithm based on statistics: There are many kinds of mature algorithms, which can basically meet the practical application.

So technology maturity: based on the matching word segmentation algorithm, based on the understanding of the word segmentation algorithm based on statistical segmentation algorithm.

(8) Implementation complexity

With the above reason, the implementation of complexity: Based on the understanding of the word segmentation algorithm based on the statistical segmentation algorithm based on matching segmentation algorithm.

(9) Word segmentation accuracy

So far, there is no accurate conclusion but theoretically, based on the understanding of the word segmentation algorithm has the highest accuracy, theoretically 100% accuracy; and based on matching segmentation algorithm and statistics based segmentation algorithm is a "shallow understanding" of the word segmentation method, does not involve the real meaning of understanding, it may appear wrong, Difficult to achieve 100% accuracy.

(10) Word speed

Based on matching segmentation algorithm: Simple algorithm, easy to operate, so the fast segmentation, so this algorithm is often used as another two algorithms preprocessing, the string of coarse;

Segmentation algorithm based on understanding: This algorithm often needs to operate a huge rule base, so the slowest speed;

Segmentation algorithm based on statistics: This segmentation algorithm is only compared with a statistical result, so the speed is general.

Therefore, the general speed of word segmentation from fast to slow, in turn, is: based on the matching word segmentation algorithm based on the statistical segmentation algorithm based on understanding of the word segmentation algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.