So far, Chinese word segmentation includes three methods: 1 segmentation based on string matching, 2 segmentation based on understanding, 3 segmentation based on statistics. So far, there is no way to prove which method is more accurate, each method has its own advantages and disadvantages, there are strengths and fatal weaknesses, the simple comparison is shown in the following table:
Comparison of the advantages and disadvantages of various participle methods
Word Segmentation method |
segmentation based on string matching |
The word segmentation based on understanding |
The segmentation based on statistics |
Ambiguity recognition |
Poor |
Strong |
Strong |
Recognition of new words |
Poor |
Strong |
Strong |
Need a dictionary |
Need |
Don't need |
Don't need |
Need corpus |
Whether |
Whether |
Is |
Rule Library Required |
Whether |
Is |
Whether |
Algorithmic complexity |
Easy |
Hard |
So so |
Maturity of technology |
Mature |
Not mature |
Mature |
Implementation difficulty |
Easy |
Hard |
So so |
Word Segmentation accuracy |
So so |
Accurate |
More accurate |
Word speed |
Fast |
Slow |
So so |
(1) Ambiguity recognition
Ambiguity recognition refers to a string has a variety of word segmentation methods, the computer is difficult to give the end of which word segmentation algorithm is the correct word series. such as "surface" can be divided into "surface/" or "Table/surface". The computer cannot tell which is the exact word breaker.
Word segmentation algorithm based on string: only compared with an electronic dictionary, it can not be ambiguous identification;
Based on understanding of the word segmentation algorithm: refers to the meaning of the string by understanding, it has a strong ability to identify ambiguity;
Based on the statistics of the word segmentation algorithm: According to the number of consecutive occurrences, get participle series, it is often able to give the correct choice of Word segmentation series, but also may be judged wrong situation.
(2) Recognition of new words
The new word recognition, also known as the unidentified word recognition, refers to the correct identification of words not appearing in the dictionary. Name, organization name, address, appellation and so on ever-changing, the dictionary is often not fully included in these words; In addition, the popular language appearing in the network is also a common source of unregistered words, such as "soy sauce" for the recent appearance in the network, and quickly popular, thus becoming a new word. A large number of studies have proved that the recognition of new words is an important factor in the accuracy of Chinese word segmentation.
Word segmentation algorithm based on string: cannot correctly identify the unregistered words, because this algorithm is only compared with the words in the dictionary;
Word segmentation algorithm based on understanding: understand the meaning of strings, so there is a strong ability to identify new words;
Segmentation algorithm based on statistics: This algorithm has a strong ability to recognize the second type of unregistered word, because of the number of occurrences, it will be treated as a new word; for the second type of unregistered words, such words have a certain regularity, such as name: "Surname" + name, such as Li Shenli; institution: prefix + appellation, such as Hope Group Therefore, it is necessary to recognize the rules in a certain way, and it is difficult to recognize them by statistical methods.
(3) Need a dictionary
Word segmentation algorithm based on string: The basic idea is to compare with the electronic dictionary, so the electronic dictionary is necessary. And the larger the dictionary, the higher the correct rate of word segmentation, because the larger the dictionary, the less the number of logins, which can greatly reduce the recognition of the error of the unidentified word;
Word segmentation algorithm based on understanding: understand the meaning of strings, so do not need an electronic dictionary;
Segmentation algorithm based on statistics: only according to statistics to get the final results, so the electronic dictionary is not necessary.
(4) Need corpus
Word segmentation algorithm based on string: The segmentation process is only compared with an existing electronic dictionary, so it does not need a corpus;
Word segmentation algorithm based on understanding: understand the meaning of strings, so do not need an electronic dictionary;
Segmentation algorithm based on statistics: need corpus for statistical training, so corpus is necessary, and good corpus is the guarantee of the accuracy of segmentation.
(5) Need rule base
Word segmentation algorithm based on string: The word segmentation process is only compared with an existing electronic dictionary, and no rules library is needed for word segmentation;
Word segmentation algorithm based on understanding: rules are the basis of computer understanding, so accurate, complete rules base is the premise of this algorithm;
Statistics based segmentation algorithm: According to the corpus statistics training, so the rule base is not necessary.
(6) Algorithm complexity
Word segmentation algorithm based on string: Only the comparison operation of strings, so the algorithm is simple;
Based on the understanding of the word segmentation algorithm: the need to fully deal with a variety of rules, so the algorithm is very complex; in fact, so far, there is no mature such algorithms;
Segmentation algorithm based on statistics: the need for corpus training, although the algorithm is also more complex, but has been more common, so the complexity of the word segmentation than the first one, more easily than the second. Nowadays, the practical word segmentation system adopts this algorithm.
(7) Maturity of technology
Word segmentation algorithm based on string: It is the earliest and most mature algorithm;
Based on the understanding of the word segmentation algorithm: is the most immature of a class of algorithms, so far there is no mature algorithm;
Segmentation algorithm based on statistics: There are many kinds of mature algorithms, which can basically meet the practical application.
So technology maturity: based on the matching word segmentation algorithm, based on the understanding of the word segmentation algorithm based on statistical segmentation algorithm.
(8) Implementation complexity
With the above reason, the implementation of complexity: Based on the understanding of the word segmentation algorithm based on the statistical segmentation algorithm based on matching segmentation algorithm.
(9) Word segmentation accuracy
So far, there is no accurate conclusion but theoretically, based on the understanding of the word segmentation algorithm has the highest accuracy, theoretically 100% accuracy; and based on matching segmentation algorithm and statistics based segmentation algorithm is a "shallow understanding" of the word segmentation method, does not involve the real meaning of understanding, it may appear wrong, Difficult to achieve 100% accuracy.
(10) Word speed
Based on matching segmentation algorithm: Simple algorithm, easy to operate, so the fast segmentation, so this algorithm is often used as another two algorithms preprocessing, the string of coarse;
Segmentation algorithm based on understanding: This algorithm often needs to operate a huge rule base, so the slowest speed;
Segmentation algorithm based on statistics: This segmentation algorithm is only compared with a statistical result, so the speed is general.
Therefore, the general speed of word segmentation from fast to slow, in turn, is: based on the matching word segmentation algorithm based on the statistical segmentation algorithm based on understanding of the word segmentation algorithm.