Comparison and ideas of common algorithms in Word Segmentation

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

And comprehension-Based Word SegmentationAlgorithmCompared with the statistical-based word segmentation algorithm, text-based matching algorithms are more common. Text matching algorithms are also called "mechanical word segmentation algorithms ", it matches the Chinese character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to certain rules. If a string is found in the dictionary, the match is successful, identifies a word. Based on different scanning directions, text matching and word segmentation can be divided into two types: forward matching and reverse matching. Based on the priority matching of different lengths, the methods can be divided into maximum (longest) Matching and minimum (shortest) matching) matching: Based on whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging.

Several common mechanical word segmentation methods are as follows:

1) forward maximum matching (from left to right)

2) reverse maximum matching (from right to left)

3) Minimum segmentation (minimum number of words cut out in each sentence ).

Other word segmentation algorithms that combine the above methods. For example, you can combine the forward and reverse maximum matching methods to form a bidirectional matching method. Due to the word-based feature of Chinese, forward least matching and reverse least matching are rarely used. This article focuses on the forward and reverse maximum matching methods.

The accuracy of the mechanical word segmentation algorithm depends on the accuracy of the algorithm and the completeness of the word segmentation. In this article, we imagine that the dictionary is sufficiently large and contains the words needed.

Generally, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and there are fewer ambiguities. The statistical results show that the error rate of positive matching is 1/169, and the error rate of reverse matching is 1/245. However, this accuracy is far from meeting the actual needs. The actual word segmentation system uses mechanical word segmentation as a preliminary scoring method, and uses other language information to further improve the accuracy of segmentation.

Let's take a look at two Chinese sentences:

1) Speech by Mayor Changchun during the Spring Festival

2) Changchun pharmacy

Assume that the Lexicon contains the following words: "Changchun", "Changchun", "Mayor", "Spring Festival", "speech", "Spring Festival medicine", and "pharmacy ", "Spring Pharmacy" and so on.

The result obtained by using the forward maximum match is:

Changchun/festival/speech (divided into four words, where the "section" does not match and the meaning is incorrect)

Changchun city/Changchun/pharmacy (divided into three words, all matched, correct semantics)

The result obtained by the reverse maximum matching method is:

Changchun/mayor/Spring Festival/speech (divided into four words, all matching, correct semantics)

Changchun/mayor/spring Pharmacy (divided into three words, all matched, with incorrect semantics)

From this point on, we can see the advantages and disadvantages of the forward and reverse maximum matching methods: both of them can correctly explain some Chinese characters, and some cannot be distinguished.

Can we consider combining these two matching methods to gain strengths? The answer is yes.

First, we use the forward and reverse largest matching methods to cut words for the same word, and then compare the results. For example, if the words "Changchun Mayor's speech for the Spring Festival" cannot be matched in the forward maximum matching method, the reverse maximum matching method is used as the result.

Secondly, we can introduce the concept of word frequency, and each word will obtain a word frequency value based on its probability of appearance in Chinese. We use two methods for word segmentation for "Changchun Pharmacy", but the word frequency of "Spring Pharmacy" obtained by the reverse largest matching method is much lower than that of other words. We can consider that the results obtained by this word segmentation method are not universal and the results obtained by the forward maximum matching method are obtained.

Of course, you can also combine other methods (such as the scanning mark method and the part-of-speech Check Method) with these two matching methods to achieve better and more accurate word segmentation.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Comparison and ideas of common algorithms in Word Segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Comparison and ideas of common algorithms in Word Segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support