Comparison and ideas of common algorithms in Word Segmentation

Source: Internet
Author: User

And comprehension-Based Word SegmentationAlgorithmCompared with the statistical-based word segmentation algorithm, text-based matching algorithms are more common. Text matching algorithms are also called "mechanical word segmentation algorithms ", it matches the Chinese character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to certain rules. If a string is found in the dictionary, the match is successful, identifies a word. Based on different scanning directions, text matching and word segmentation can be divided into two types: forward matching and reverse matching. Based on the priority matching of different lengths, the methods can be divided into maximum (longest) Matching and minimum (shortest) matching) matching: Based on whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging.

Several common mechanical word segmentation methods are as follows:

1) forward maximum matching (from left to right)

2) reverse maximum matching (from right to left)

3) Minimum segmentation (minimum number of words cut out in each sentence ).

Other word segmentation algorithms that combine the above methods. For example, you can combine the forward and reverse maximum matching methods to form a bidirectional matching method. Due to the word-based feature of Chinese, forward least matching and reverse least matching are rarely used. This article focuses on the forward and reverse maximum matching methods.

The accuracy of the mechanical word segmentation algorithm depends on the accuracy of the algorithm and the completeness of the word segmentation. In this article, we imagine that the dictionary is sufficiently large and contains the words needed.

Generally, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and there are fewer ambiguities. The statistical results show that the error rate of positive matching is 1/169, and the error rate of reverse matching is 1/245. However, this accuracy is far from meeting the actual needs. The actual word segmentation system uses mechanical word segmentation as a preliminary scoring method, and uses other language information to further improve the accuracy of segmentation.

Let's take a look at two Chinese sentences:

1) Speech by Mayor Changchun during the Spring Festival

2) Changchun pharmacy

Assume that the Lexicon contains the following words: "Changchun", "Changchun", "Mayor", "Spring Festival", "speech", "Spring Festival medicine", and "pharmacy ", "Spring Pharmacy" and so on.

The result obtained by using the forward maximum match is:

Changchun/festival/speech (divided into four words, where the "section" does not match and the meaning is incorrect)

Changchun city/Changchun/pharmacy (divided into three words, all matched, correct semantics)

The result obtained by the reverse maximum matching method is:

Changchun/mayor/Spring Festival/speech (divided into four words, all matching, correct semantics)

Changchun/mayor/spring Pharmacy (divided into three words, all matched, with incorrect semantics)

From this point on, we can see the advantages and disadvantages of the forward and reverse maximum matching methods: both of them can correctly explain some Chinese characters, and some cannot be distinguished.

Can we consider combining these two matching methods to gain strengths? The answer is yes.

First, we use the forward and reverse largest matching methods to cut words for the same word, and then compare the results. For example, if the words "Changchun Mayor's speech for the Spring Festival" cannot be matched in the forward maximum matching method, the reverse maximum matching method is used as the result.

Secondly, we can introduce the concept of word frequency, and each word will obtain a word frequency value based on its probability of appearance in Chinese. We use two methods for word segmentation for "Changchun Pharmacy", but the word frequency of "Spring Pharmacy" obtained by the reverse largest matching method is much lower than that of other words. We can consider that the results obtained by this word segmentation method are not universal and the results obtained by the forward maximum matching method are obtained.

Of course, you can also combine other methods (such as the scanning mark method and the part-of-speech Check Method) with these two matching methods to achieve better and more accurate word segmentation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.