Comparison and imagination of common word segmentation algorithms

Comparison and imagination of common word segmentation algorithms _ Practical Skills

Last Update:2017-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The algorithm based on the text matching is more general than that based on the understanding segmentation algorithm and the statistical segmentation algorithm. Based on the text matching algorithm is called "Machine segmentation algorithm", he is it is according to a certain strategy of the Chinese character string to be analyzed with a "full large" machine Dictionary of the entries, if found in the dictionary a string, the match is successful, can identify a word. According to the different scanning direction, the text matching word segmentation method can be divided into two kinds: forward matching and reverse matching, according to the case of different length priority matching, can be divided into maximum (longest) matching and minimum (shortest) matching, according to whether or not with the POS tagging process, And can be divided into simple word segmentation and segmentation and annotation combined with the integration method.

Several commonly used mechanical participle methods are as follows:

1 forward maximum matching method (from left to right direction)

2 Reverse Maximum matching method (from right to left direction)

3) Minimum segmentation (make the number of words cut in each sentence the smallest).

The other is to combine the above methods to form a word segmentation algorithm, for example, the forward maximum matching method and the reverse maximum matching method can be combined to form a two-way matching method. Due to the characters of Chinese words, the forward minimum matching and inverse minimum matching are seldom used. In this paper, the forward maximum matching method and the reverse maximum matching method are discussed emphatically.

Because the accuracy of the machine segmentation algorithm depends on both the accuracy of the algorithm and the completeness of the thesaurus. In this article, imagine that the thesaurus is large enough to contain the words you need.

Generally speaking, the segmentation precision of reverse matching is slightly higher than that of forward matching, and the ambiguity phenomenon is less. The statistic results show that the error rate of single positive maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. But this precision is far from satisfying the actual need. The actual use of the word segmentation system, is the mechanical participle as a primary means, but also by using a variety of other language information to further improve the accuracy of segmentation.

Let's take a look at two sentences in Chinese:

1) Changchun Mayor's Spring Festival speech

2) Changchun Changchun Pharmacy

We should include the following words in the Thesaurus "Changchun", "Changchun", "Mayor", "Spring Festival", "speech", "Aphrodisiac", "pharmacy", "Spring Pharmacy" and so on.

The results obtained by using the forward maximum matching method are:

Changchun/Changchun/Festivals/speeches (divided into 4 words, where "section" does not match, semantic error)

Changchun/Changchun/pharmacy (divided into 3 words, all matched to, semantically correct)

The results obtained by the reverse maximum matching method are:

Changchun/Mayor/Spring Festival/speech (divided into 4 words, all matched to, semantically correct)

Changchun/Mayor/Spring Pharmacy (divided into 3 words, all matched to, semantic error)

From then on, we can see the pros and cons of the forward maximum matching method and the reverse maximum matching method: All of them can correctly interpret some Chinese, while there are some indistinguishable.

Is it possible to consider combining the two matching methods to get the best of them? The answer is yes.

First of all, we use the positive maximum matching method and the reverse maximum matching method respectively to cut the word, and then compare the results. such as "Changchun Mayor Spring Festival Speech", because the positive maximum matching method has a word can not match, so choose to use the reverse maximum matching method as the result.

Secondly, we can introduce the concept of frequency, and each word will be given a frequency value according to the probability that it appears in Chinese. We have "Changchun Changchun Pharmacy" for two methods of participle, but because the reverse maximum matching method of "spring Pharmacy" of the word frequency compared to other words of the word frequency is much lower. We can think that the results obtained by this method are not universal, and the positive maximum matching method is taken.

Of course, some other methods (such as scanning Mark method, pos check method, etc.) can be combined with the two kinds of matching method to get better and more accurate word segmentation effect.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Comparison and imagination of common word segmentation algorithms _ Practical Skills

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Comparison and imagination of common word segmentation algorithms _ Practical Skills

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support