Search Engine Knowledge Chinese word segmentation technology

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Chinese participle is a sentence or a phrase in accordance with the daily reading habits of mechanical decomposition. English is a unit of words, words and words are separated by space, and the Chinese word is the unit, all the words in the sentence to describe a meaning. For example, I like the search engine, the result of participle is: I | like the search engine. The Chinese character sequence is divided into meaningful words, that is, Chinese participle, some people also known as cutting words.

Each word in Chinese can be used directly as a word, without word-breaking, which makes it changeable. Although changeable, but flexible in expression. But this is a very difficult problem for search engines to solve. In Chinese participle, there are three kinds of difficult types.

1, Intersection type ambiguity

Suppose "abc" is a, B, C three Chinese characters, if "AB", "BC" are words, then the computer in the segmentation can be "abc" cut into "ab/c", can also be divided into "A/BC". This kind of tangent divergence meaning is called the intersection ambiguity.

2. Combination type ambiguity

If "AB" is a word, "ABC" is also a word, then the resulting tangent divergence is called combinatorial ambiguity.

3, Mixed type ambiguity

Mixed-type ambiguity is a tangent to the ambiguity of intersection type and combinatorial type.

At present, these problems are solved mainly by means of dictionaries and statistics.

First, we'll talk about dictionary segmentation. Dictionaries generally adopt the data storage structure of prefix tree and suffix tree. What is a prefix tree? In fact, we have a sentence from left to right scan once, encountered in the dictionary, some words are identified, encounter compound words to find the longest word matching, encountered not knowing the string on the split into a single word, so a simple word is completed. The suffix tree is scanned from right to left.

Statistical methods, although the dictionary participle has solved many participle problems. But in the face of many new words, participle also faces challenges. The method of segmentation of statistics is based on the knowledge of concepts and informatics. The basic principle is to look for words that often appear together, and the words that are always with each other are likely to form a word. This requires an analysis of a large amount of content. Even now Chinese participle is still evolving, there is not a word segmentation method can completely solve all problems.

Readers who are interested in Chinese participle can read the following documents:

1. Liangnanyum

Written Chinese automatic word segmentation system

Http://www.touchwrite.com/demo/LiangNanyuan-JCIP-1987.pdf

2. Guojin

Statistical language model and some new results of Chinese phonetic word conversion

Http://www.touchwrite.com/demo/GuoJin-JCIP-1993.pdf

3. Guojin

Unacknowledged tokenization and its Properties

Http://acl.ldc.upenn.edu/J/J97/J97-4004.pdf

4. Sun Maosung

Chinese Word segmentation without using lexicon and hand-crafted training data

http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=980775

This article first qining Network Marketing planning www.qi-ning.com reprint, please specify the author information. Thank you!

Qining msn:i@qining.org

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.