Comparison and imagination of common word segmentation algorithms _ Practical Skills

Source: Internet
Author: User

The algorithm based on the text matching is more general than that based on the understanding segmentation algorithm and the statistical segmentation algorithm. Based on the text matching algorithm is called "Machine segmentation algorithm", he is it is according to a certain strategy of the Chinese character string to be analyzed with a "full large" machine Dictionary of the entries, if found in the dictionary a string, the match is successful, can identify a word. According to the different scanning direction, the text matching word segmentation method can be divided into two kinds: forward matching and reverse matching, according to the case of different length priority matching, can be divided into maximum (longest) matching and minimum (shortest) matching, according to whether or not with the POS tagging process, And can be divided into simple word segmentation and segmentation and annotation combined with the integration method.

Several commonly used mechanical participle methods are as follows:

1 forward maximum matching method (from left to right direction)

2 Reverse Maximum matching method (from right to left direction)

3) Minimum segmentation (make the number of words cut in each sentence the smallest).

The other is to combine the above methods to form a word segmentation algorithm, for example, the forward maximum matching method and the reverse maximum matching method can be combined to form a two-way matching method. Due to the characters of Chinese words, the forward minimum matching and inverse minimum matching are seldom used. In this paper, the forward maximum matching method and the reverse maximum matching method are discussed emphatically.

Because the accuracy of the machine segmentation algorithm depends on both the accuracy of the algorithm and the completeness of the thesaurus. In this article, imagine that the thesaurus is large enough to contain the words you need.

Generally speaking, the segmentation precision of reverse matching is slightly higher than that of forward matching, and the ambiguity phenomenon is less. The statistic results show that the error rate of single positive maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. But this precision is far from satisfying the actual need. The actual use of the word segmentation system, is the mechanical participle as a primary means, but also by using a variety of other language information to further improve the accuracy of segmentation.

Let's take a look at two sentences in Chinese:

1) Changchun Mayor's Spring Festival speech

2) Changchun Changchun Pharmacy

We should include the following words in the Thesaurus "Changchun", "Changchun", "Mayor", "Spring Festival", "speech", "Aphrodisiac", "pharmacy", "Spring Pharmacy" and so on.

The results obtained by using the forward maximum matching method are:

Changchun/Changchun/Festivals/speeches (divided into 4 words, where "section" does not match, semantic error)

Changchun/Changchun/pharmacy (divided into 3 words, all matched to, semantically correct)

The results obtained by the reverse maximum matching method are:

Changchun/Mayor/Spring Festival/speech (divided into 4 words, all matched to, semantically correct)

Changchun/Mayor/Spring Pharmacy (divided into 3 words, all matched to, semantic error)

From then on, we can see the pros and cons of the forward maximum matching method and the reverse maximum matching method: All of them can correctly interpret some Chinese, while there are some indistinguishable.

Is it possible to consider combining the two matching methods to get the best of them? The answer is yes.

First of all, we use the positive maximum matching method and the reverse maximum matching method respectively to cut the word, and then compare the results. such as "Changchun Mayor Spring Festival Speech", because the positive maximum matching method has a word can not match, so choose to use the reverse maximum matching method as the result.

Secondly, we can introduce the concept of frequency, and each word will be given a frequency value according to the probability that it appears in Chinese. We have "Changchun Changchun Pharmacy" for two methods of participle, but because the reverse maximum matching method of "spring Pharmacy" of the word frequency compared to other words of the word frequency is much lower. We can think that the results obtained by this method are not universal, and the positive maximum matching method is taken.

Of course, some other methods (such as scanning Mark method, pos check method, etc.) can be combined with the two kinds of matching method to get better and more accurate word segmentation effect.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.