You don't know the cheats Baidu's Chinese participle three point principle

Source: Internet
Author: User

Baidu Chinese Word segmentation algorithm: refers to the search engine in order to better identify the needs of users, and in order to quickly provide users with the needs of information and use of the algorithm.

Search engines have to deal with quadrillion-level page data within a unit of time, so search engines have a Chinese thesaurus. For example, Baidu now has about 90,000 Chinese words, then the search engine can be Chi of the page analysis, according to the Chinese thesaurus classification.

Baidu participle basically has three kinds of division method

1, based on understanding: fool-type matching, less than or equal to 3 Chinese characters Baidu is not to cut words, such as search "University Hall."

  

2, based on statistics: Baidu put a word red reason: the word red is generally a keyword, you search "learning" when the word, Baidu it admitted to the "study" as a keyword, so the word "learning" marked red, this is the Baidu segmentation method: Based on statistical participle.

  

3, based on string matching (Baidu's Word segmentation method: The maximum cut lexical)

Max and min (maximum match: always match to Shang; minimum match: match the word to stop the match, and then start a match from another word: Baidu search "Hunan Hall Roof", Baidu's a word segmentation algorithm we take it as a black box, we pass some input keywords, According to Baidu's output results to determine the Baidu Word segmentation algorithm. Forward vs. reverse (forward: formerly backward; backward: from Back to front) (Hunan Hall Roof) Forward splitting method: Hunan University Hall Roof (Daiju Narita Earth Method) forward splitting method: Daiju Narita Earth method. Reverse Divide method: Method Earth Daiju Narita. And in this word "earth" is not a word.

  

In addition, the principle of cutting words: Baidu has a proprietary thesaurus (is inseparable) such as outstanding characters (such as: Mao Zedong) star (such as: Andy Lau) to retrieve a large number of words (such as: Buy tickets difficult).

Of course, these are only part of the principle of Chinese word segmentation, is not all right. Because of the Baidu algorithm is impossible to disclose, business machine secret if you know, it is not more than n Baidu.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.