The second analysis of the algorithm of Baidu segmentation

Source: Internet
Author: User
Keywords Algorithm BIS

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

spelling checker spelling error prompts (and phonetic cue feature)


  


spell CHECK error hint is a search engine has a function, that is, users submit queries to search engines, search engine check to see whether the user entered the spelling error, for Chinese users generally caused by errors is the input method caused by errors. Then we will analyze how Baidu is to achieve this function.


we analyze the spelling checker system to focus on the following issues:


(1) How does the system determine that a user's input is a possible error-checking query?


(2) If the judgment is possible wrong query input, how to prompt the correct vocabulary?


  


So how does Baidu do? Baidu Judge user input is wrong standards, I think it should be to look up the dictionary, if found in the dictionary does not contain the word, then it is likely to be a wrong input, at this time to start the error prompt function, this very good judgment, because if it is a normal word, Baidu generally will not have error prompts, and you deliberately input a dictionary can not contain the so-called words, at this time Baidu will generally prompt you to correct search vocabulary.


So how does Baidu prompt the correct vocabulary? It is obviously through the pinyin way, such as I input query "system only", Baidu provides tips for: ": Sanctions on the quality of paper materials," are homonym. So Baidu must maintain a dictionary of homonym, which retains the message of homonym, For example, it may contain the following entry: "Zhi caià sanctions, quality materials, paper, there is also a marked Pinyin program, can now see the basic process is: User input" System only ", look up the dictionary, found no this word, OK, start labeling Pinyin program, will" system only "labeled as pinyin" Zhi Cai " , and then find a homonym dictionary, found that "sanctions, materials, paper", then prompts the user may correct spelling.


The whole process looks very simple, but there are still some small problems, such as whether all the homonym in the thesaurus as a user's message? For example, a phonetic alphabet has 10, whether it is output? Baidu does not have all the homonym output but select a certain selection criteria, Select several of these outputs. How do you prove that? We look at the pinyin "Liu Li" of the homonym, purple-violet Input method prompts the word "Ryurei displaced glaze fluent" 4, we look at Baidu returned a few, input "Li" as a query, here is intentionally entered a dictionary does not contain words, So Baidu's spell check began to work, Baidu hint: "Coloured glaze Liu Liliu", what does this show? That is not all the homonym output, but select output, then what is the choice of criteria? I can guess the method is for the user query log statistics, extract the user query number of those homonym output, If this is the case, the above example shows that the user search "coloured glaze" times than others are higher, followed by "Liu Li", Again is "Liu", it seems that everyone likes to inquire about their own or know the name of the person.


Another small problem: a homonym dictionary contains 2 words, 3 words, does it contain 4 words and longer entries? Do you include a word? Here a word good answer, without testing can know certainly does not contain, because you enter a word, who knows whether it is wrong? Anyway, as long as the Chinese characters can be found in the thesaurus, so there is no basis for judgment. Two words are included, the above examples, three words also contain, such as query "Midtown Medicine" Baidu Error Tip: "Proprietary Chinese Medicine , modify the query for "Heavy city medicine", or hint "proprietary Chinese Medicine", again modify the query "heavy city to", Baidu still hints "proprietary Chinese medicine." What about 4 words?


Baidu will give you a hint, the following is an example:


input: Zhinghua Smoke tips Jinghua Cloud


input: Static words smoke hint Jinghua cloud


input: Static words Yan faint hint of the Jinghua cloud


so much longer words whether hint? also hint, for example I input: "The Fallen flower World has the Wind army", this inquiry is what meaning, estimated to read the ancient poetry of all know, see Baidu hint "petal season again every June", this shows what? The dictionary contains different lengths of homonym information, Also explained the core of Baidu's Chinese processing technology, that is, that dictionary, it is very large.


However, if the user entered the query by two or more than two substring composition, then Baidu's error prompted the function of the strike, such as the input query "mourning body", Baidu prompted "Altivo kicked", but. If you enter "I mourn", there is no error message.


also has a more important question: if the Chinese character is Polyphone then how to deal with? Baidu is more lazy, it does not have to deal with Polyphone. Let's take a look at Baidu's one mark Pinyin error, see this error before looking at the Polyphone Baidu is how to prompt the wrong, we enter the query "" is long, Baidu Tip "director of the Theatre", "the length" of the Pinyin has two: "Ju zhang/ju Chang", visible if it is Polyphone are prompted in several situations. Now we look at the wrong situation, we enter the query "drama often", Baidu hint ": Theater director", the Hint for "theater" of course good explanation, because it is a homonym, but why "director" will be prompted? This explains that Baidu's homonym dictionary has errors, stating that in the "Ju Chang" This entry contains "director "This wrong homonym." Let's get to the point. Description of Baidu's homonym dictionary is automatically generated, and there is no manual proofreading. Also explained that in the process of automatically generating a homonym dictionary, Baidu is not based on an article marking pinyin and then in the extraction of words and corresponding phonetic information obtained, It is the words of a dictionary to mark the syllable, so the error caused by the Polyphone can not be recognized, if the text is phonetic annotation, it may not appear this very easy to find error annotation. Of course, there is another explanation, that is, "director" is deliberately Baidu prompted by the possible correct words, because of the southern people "en" and "ch" and so on before and after the nasal division is not clear, so it? We continue to test what the situation is. is Baidu error or is this Baidu's advanced algorithm?


We consider the word "grow up", intentional error input for "stolen goods big", if the Baidu to take into account before and after the nasal problem, then should be prompted to "grow up", but Baidu hint is "hidden big". What does that mean? Explain that Baidu did not consider the nasal problems before and after, is a system error. We enter the query "reward", Intentionally put the error into "Hang Mulberry", there is no error prompted, it is true that the situation is not considered. Before the nasal not considered, then after the nasal consideration, we enter ": often", deliberately changed after nasal "warp", Baidu hint for "through repentance", Still did not consider after the nasal. This can be basically determined to be Baidu system error caused.


Based on the above deduction, we can draw the following conclusions: Baidu is a word dictionary inside each entry using Pinyin labeling program to mark into Pinyin, and then form a homonym dictionary, so two dictionaries are equally large, and the dictionary with the growth of Word dictionary is growing. As for the labeling process Polyphone Baidu did not consider, if it is polyphone labeled as multiple combinations of pronunciation, through this way to form a dictionary of homonym. Such a dictionary obviously contains many mistakes.


last question: Does Baidu check spelling in English? Let's try, input query "Chinese", yes, found a lot of results, focus on the search Baidu can also search English, really unexpected surprises. Change the query "Chine", will be more surprised to give us a hint " ? Baidu hint is: Eat to hold?, the original is inadvertently triggered Baidu's phonetic search function. Then pinyin search and Chinese check whether the same set of dictionary, let us experiment, search "Rongji", Baidu Tip "Ficus solvent volume", OK, change the Chinese query " Jong ", Baidu Tip" Ficus solvent volume ", it seems to be using the same set of homonym dictionaries. That is, Baidu's Chinese error correction and pinyin retrieval using the same mechanism, Chinese error correction more than a phonetic phonetic process. Is this the legendary Baidu's" in fact a very powerful pinyin input method " Phonetic cue function?


finally let us sum up the Baidu spell check system:


Background job: (1) The previous article we said, Baidu word used in the dictionary contains at least two dictionaries, one is a common dictionary, the other is a special dictionary (proper names, etc.), Baidu use Pinyin labeling program in turn scan all the dictionaries in each entry, and then mark Pinyin, if it is polyphone will be a number of notes are marked, For example "grow up", will be labeled "Zhang Da/chang da" two entries.


(2) through the annotated entry, the establishment of a dictionary of homonym, such as the above "grow up", there will be two entries: Zhang Daà grew up, Chang Daà grew up.


(3) Use the user query log frequency information to give each Chinese term a weight;


(4) OK, homonym Dictionary was completed, of course, with the gradual expansion of Word dictionary, homonym Dictionary also followed the expansion of synchronization;


  


spell Check:


(1) User input query, if it is multiple substrings, do not check spelling;


(2) for user inquiries, first look up word segmentation dictionary, if found to have this word entry, OK, do not make spelling check;


(3) If you find that the dictionary does not contain user inquiries, start the spelling checker; first, the Pinyin labeling program is used to mark user input;


(4) for the marked pinyin in the dictionary inside the scan, if not found without any hint;


(5) If the entry is found, then the output weight of a large number of sequential results;


  


phonetic hint:


(1) for the user input pinyin in the dictionary inside the scan, if not found without any hint;


(2) If the entry is found, then the output weight of a large number of sequential results;
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.