Chinese word segmentation (statistical language model)

Source: Internet
Author: User

Generally, the granularity of Chinese word segmentation varies according to different applications. For example, in machine translation, the granularity should be larger, and "Peking University" cannot be divided into two words. In speech recognition, Peking University is generally divided into two words. Therefore, different applications should have different word segmentation systems.

The statistical language model word segmentation method can be summarized as follows using several mathematical formulas:
We assume that a sentence s can have several word segmentation methods. For the sake of simplicity, we assume there are three types:
A1, A2, A3,..., AK,
B1, B2, B3,..., BM
C1, C2, C3,..., CN

Among them, A1, A2, B1, B2, C1, C2 and so on are all Chinese words. Therefore, the best word segmentation method should ensure that the sentence appears at the highest probability after the word is split. That is to say, if A1, A2,..., AK is the best method, then (P indicates probability ):
P (A1, A2, A3,..., ak)> P (B1, B2, B3,..., BM), and
P (A1, A2, A3,..., ak)> P (C1, C2, C3,..., CN)
Therefore, as long as we use the statistical language model mentioned above to calculate the probability of sentence appearance after each word segmentation and find out the highest probability, we can find the best word segmentation method.

Of course, there is an implementation technique. If we use all possible word segmentation methods and calculate the probability of sentences under each possibility, the calculation is quite large. Therefore, we can regard it as a dynamic programming problem and use Viterbi)AlgorithmQuickly find the optimal word segmentation.

You may not think of it. The Chinese word segmentation method is also applied to English processing, mainly in handwriting recognition. It is because the spaces between words are unclear when handwriting is recognized. The Chinese word segmentation method can help identify the boundaries of English words. In fact, many mathematical methods of language processing are generally irrelevant to specific languages. In Google, when designing language processing algorithms, we always consider whether they can be easily applied to various natural languages. In this way, we can effectively support searching in hundreds of languages.

Documents to be read for Chinese Word Segmentation:

1. Liang nanyuan
Automatic Word Segmentation System for written Chinese
Http://www.touchwrite.com/demo/LiangNanyuan-JCIP-1987.pdf

2. Guo Jin
Some New Results of statistical language model and Chinese speech word Conversion
Http://www.touchwrite.com/demo/GuoJin-JCIP-1993.pdf

3. Guo Jin
Critical tokenization and Its Properties
Http://acl.ldc.upenn.edu/J/J97/J97-4004.pdf

4. Sun maosong
Chinese word segmentation without using lexicon and hand-crafted training data
Http://portal.acm.org/citation.cfm? Coll = guide & DL = guidance & id = 980775

References:Http://www.kuqin.com/math/20071204/2776.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.