Poster: Wu Jun, Google researcher
Chinese Word Segmentation
----- An application of the statistical language model in Chinese Processing
The last time we talked about the use of the statistical language model for language processing. Because the model is based on words, word segmentation is required for languages such as China, Japan, and South Korea. For example, the Chinese aerospace officer was invited to the United States to hold a meeting with Space Department officials ."
Split into a string of words:
China/aerospace/officials/invited/to/US/AND/SPACE//officials/meetings.
The easiest way to think about word segmentation is to look up the dictionary. This method was first proposed by Professor Liang nanyuan of Beijing University of Aeronautics and Astronautics.
Using the "Dictionary" method, we actually scanned a sentence from left to right and identified it when we encountered some words in the dictionary and met compound words (such as "Shanghai University ") find the longest word match, and split it into a single word when you encounter an unknown string, so the simple word segmentation is complete. This simple word splitting method can fully process the sentences in the above example. In 1980s, Dr. Wang Xiaolong of Harbin Institute of Technology turned it into a word segmentation theory with the least number of words, that is, a sentence should be divided into a word string with the least number. An obvious disadvantage of this method is that it is powerless when there is a division of ambiguity (meaning of a dual understanding. For example, the correct division of the phrase "developing countries" is "development-China-country", and the dictionary from left to right will split it into "development-China-home ", obviously, it is wrong. In addition, not all longest matches must be correct. For example, the correct word segmentation of "Shanghai University City Bookstore" should be "Shanghai-university city-Bookstore" rather than "Shanghai University-city-Bookstore ".
Before 1990s, many scholars at home and abroad tried to use some grammar rules to solve the ambiguity of word segmentation, which was not very successful. About 90 years ago, Dr. Guo jin of Tsinghua University used the statistical language model to solve the problem of word segmentation, reducing the error rate of Chinese word segmentation by an order of magnitude.
The statistical language model word segmentation method can be summarized as follows using several mathematical formulas:
We assume that a sentence S can have several word segmentation methods. For the sake of simplicity, we assume there are three types:
A1, A2, A3,..., Ak,
B1, B2, B3,..., Bm
C1, C2, C3,..., Cn
Among them, A1, A2, B1, B2, C1, C2 and so on are all Chinese words. Therefore, the best word segmentation method should ensure that the sentence appears at the highest probability after the word is split. That is to say, if A1, A2,..., Ak is the best method, then (P indicates probability ):
P (A1, A2, A3,..., Ak)> P (B1, B2, B3,..., Bm), and
P (A1, A2, A3,..., Ak)> P (C1, C2, C3,..., Cn)
Therefore, as long as we use the statistical language model mentioned above to calculate the probability of sentence appearance after each word segmentation and find out the highest probability, we can find the best word segmentation method.
Of course, there is an implementation technique. If we use all possible word segmentation methods and calculate the probability of sentences under each possibility, the calculation is quite large. Therefore, we can regard it as a Dynamic Programming problem and use the Viterbi algorithm to quickly find the optimal word segmentation.
After Dr. Guo Jin from Tsinghua University, many scholars at home and abroad used statistical methods to further improve Chinese word segmentation. It is worth mentioning that Professor Sun maosong from Tsinghua University and Professor Wu dekai from the Hong Kong University of Science and Technology.
It should be pointed out that the definitions of words are not exactly the same. For example, some people think that Peking University is a word, and some people think that it should be divided into two words. The solution to one compromise is to find the nested structure of compound words while word segmentation. In the above example, if a sentence contains the word "Peking University", you should first treat it as a four-character word, and then further find out the word segmentation "Beijing" and "university ". This method was first published by Guo Jin in the Computational Linguistics (Computer Linguistics) magazine and will be used by many systems in the future.
Generally, the granularity of Chinese word segmentation varies according to different applications. For example, in machine translation, the granularity should be larger, and "Peking University" cannot be divided into two words. In speech recognition, Peking University is generally divided into two words. Therefore, different applications should have different word segmentation systems. Dr. Ge XianPing and Dr. Zhu an of Google designed and implemented their own word segmentation systems for search.
You may not think of it. The Chinese word segmentation method is also applied to English processing, mainly in handwriting recognition. It is because the spaces between words are unclear when handwriting is recognized. The Chinese word segmentation method can help identify the boundaries of English words. In fact, many mathematical methods of language processing are generally irrelevant to specific languages. In Google, when designing language processing algorithms, we always consider whether they can be easily applied to various natural languages. In this way, we can effectively support searching in hundreds of languages.
Readers interested in Chinese word segmentation can read the following documents:
1. Liang nanyuan
Automatic Word Segmentation System for written Chinese
Http://www.touchwrite.com/demo/LiangNanyuan-JCIP-1987.pdf
2. Guo Jin
Some New Results of statistical language model and Chinese speech word Conversion
Http://www.touchwrite.com/demo/GuoJin-JCIP-1993.pdf
3. Guo Jin
Critical Tokenization and its Properties
Http://acl.ldc.upenn.edu/J/J97/J97-4004.pdf
4. Sun maosong
Chinese word segmentation without using lexicon and hand-crafted training data
Http://portal.acm.org/citation.cfm? Coll = GUIDE & dl = guidance & id = 980775