Master Summary Analysis If the use of Baidu Word segmentation algorithm analysis of the two (turn) _ Website application

Source: Internet
Author: User
Checker spelling error prompts (and phonetic cue feature)

Spell CHECK error hint is a search engine has a function, that is, users submit queries to search engines, search engine check to see whether the user entered the spelling errors, for Chinese users generally caused by errors is the input method caused by errors. Then we will analyze how Baidu is to achieve this function.

We analyze the spelling checker system to focus on the following issues:

(1) How can the system determine the user's input is likely to occur wrong query?

(2) If the judgment is possible wrong query input, how to prompt the correct vocabulary?

So how does Baidu do it? Baidu Judge user input is wrong standards, I think it should be to look up the dictionary, if found in the dictionary does not contain the word, then it is likely to be a wrong input, this time to start the error prompted the function, this very good judgment, because if it is a normal vocabulary, Baidu will not have the error , and you deliberately enter a dictionary can not contain the so-called words, at this time Baidu will generally prompt you to correct the search vocabulary.

So how does Baidu prompt the correct vocabulary? It is obviously through pinyin, such as I input query "system only", Baidu provide tips for the word: ": Sanctions on the quality of paper materials," are homonym. So Baidu must maintain a homonym dictionary, which retains a homonym information, such as may contain the following entry: " Zhi Caià Sanctions, quality materials, paper materials ", there is also a note Pinyin program, can now see the basic process is: User input" System only ", look up the dictionary, found no this word, OK, start labeling Pinyin program, will" system only "labeled" Zhi Cai ", and then find a dictionary of homonym, Found that the "sanctions, quality materials, paper", then prompts the user may correct spelling.

The whole process looks very simple, but there are still some small problems, such as whether all the homonym in the thesaurus as a user's message? For example, a phonetic alphabet has 10, whether it is output? Baidu does not have all the homonym output but select a certain selection criteria, Select several of these outputs. How do you prove that? We look at the pinyin "Liu Li" of the homonym, purple-violet Input method prompts the word "flowing displaced glaze fluent" 4, we look at Baidu returned a few, input "Li" as a query, here is intentionally entered a dictionary does not contain words, So Baidu's spell check began to work, Baidu hint: "Coloured glaze Liu Liliu", what does this show? That is not all the homonym output, but select output, then what is the choice of criteria? I can guess the method is for the user query log statistics, extract the user query number of those homonym output, If this is the case, the above example shows that the user search "coloured glaze" times than others are higher, followed by "Liu Li", Again is "Liuli", it seems that everyone likes to inquire about their own or know the name of the person.

Another small problem: a homonym dictionary contains 2 words, 3 words, does it contain 4 words and longer entries? Do you include a word? Here a word good answer, without testing can know certainly does not contain, because you enter a word, who knows whether it is wrong? Anyway, as long as the Chinese characters can be found in the thesaurus, so there is no basis for judgment. Two words are included, the above examples, three words also contain, such as query "Midtown Medicine" Baidu Error Tip: "Proprietary Chinese Medicine , modify the query for "Heavy city medicine", or hint "proprietary Chinese Medicine", again modify the query "heavy city to", Baidu still hints "proprietary Chinese medicine." What about the 4-word vocabulary?

Baidu will still give you a hint, the following is an example:

Input: Zhinghua Smoke tips Jinghua Cloud

Input: Silent smoke tip of the Jinghua cloud

Input: Static words Yan faint hint of the Jinghua smoke

So is the longer vocabulary a hint? Also hint, for example I input: "The Fallen flower World has the Wind army", this inquiry is what meaning, estimated to read the ancient poetry of all know, see Baidu hint "petal season again every June", this shows what? It shows that the homonym dictionary contains different lengths of homonym information, Also explained the core of Baidu's Chinese processing technology, that is, that dictionary, it is very large.

However, if the user entered the query by two or more than two substring composition, then Baidu's error hint function on the strike, such as the input query "mourning body", Baidu prompted "Altivo kicked", but. If you enter "I mourn", there is no error message.

There is also a more important question: if the Chinese character is pronunciation then how to deal with? Baidu is more lazy, it does not have to deal with pronunciation. Let's take a look at Baidu's one mark Pinyin error, see this error before looking at the pronunciation Baidu is how to hint wrong, we enter the query "", Baidu Tip " Director of the Theatre, "the length" of the Pinyin has two: "Ju Zhang/ju Chang", can be seen if the pronunciation is a few situations are prompted. Now we look at the wrong situation, we enter the query "drama often", Baidu hint ": Theater director", the Hint for "theater" of course good explanation, because it is a homonym, but why "director" will be prompted? This explains that Baidu's homonym dictionary has errors, stating that in the "Ju Chang" This entry contains "director "This wrong homonym." Let's get to the point, what does this mistake say? Description of Baidu's homonym dictionary is automatically generated, and there is no manual proofreading. Also explained that in the process of automatically generating a homonym dictionary, Baidu is not based on an article marking pinyin and then in the extraction of words and corresponding phonetic information obtained, It is the words of a dictionary to mark the syllable, so the error caused by the pronunciation can not be recognized, if the text is phonetic annotation, it may not appear this very easy to find error annotation. Of course, there is another explanation, that is, "director" is deliberately Baidu prompted by the possible correct words, because of the southern people "en" and "ch" and so on before and after the nasal division is not clear, so it? We continue to test what the situation is. is Baidu a mistake or is this the advanced algorithm of Baidu?

We consider the word "grow up", intentionally wrong input for "stolen goods big", if the Baidu to take into account before and after the nasal problem, then should be prompted to "grow up", but Baidu hint is "hidden big". What does that mean? Explain that Baidu did not consider the nasal problems before and after, is a system error. We enter the query "reward", deliberately input the error as "suspended Mulberry", there is no error prompt, and it is true that the situation is not considered. Before the nasal not considered, then after the nasal consideration Mody, we enter ": often", deliberately changed after nasal "warp", Baidu hint for "through repentance", Still did not consider after nasal. This basically can determine is the Baidu system error cause.

Based on the above deduction, we can draw the following conclusions: Baidu is a word dictionary inside each entry using Pinyin labeling program to mark into Pinyin, and then form a homonym dictionary, so two dictionaries are equally large, and the dictionary with the growth of Word dictionary is growing. As for the labeling process pronunciation Baidu did not consider, if it is pronunciation labeled as multiple combinations of pronunciation, through this way to form a dictionary of homonym. Such a dictionary obviously contains many mistakes.

One last question: does Baidu check spelling in English? Let's try, input query "Chinese", yes, found a lot of results, focus on the search Baidu can also search English, really unexpected surprises. Change the query "Chine", will be more surprised to give us the hint "Chinese"? ? Baidu hint is: Eat to hold it, originally is inadvertently triggered Baidu's phonetic search function. Then pinyin search and Chinese check whether the same set of dictionary, let us experiment, search "Rongji", Baidu Tip "Ficus solvent volume", OK, change a Chinese query "tolerance Machine", Baidu Tip "Ficus solvent volume", it seems to be using the same set of homonym dictionaries. That is, Baidu's Chinese error correction and pinyin retrieval using the same mechanism, Chinese error correction more than a phonetic phonetic process. Is this the legendary Baidu's "in fact a very powerful pinyin input method," The phonetic cue function?

Finally, let's summarize the spelling checker system of Baidu:

Background job: (1) The previous article we said, Baidu word used in the dictionary contains at least two dictionaries, one is a common dictionary, the other is a special dictionary (proper names, etc.), Baidu use pinyin tagging program to scan all the dictionaries in turn, and then mark Pinyin, if it is pronunciation will be a number of notes are marked, such as "grow up ", will be labeled" Zhang Da/chang da "two entries.

(2) Through the annotated entry, the establishment of a dictionary, such as the above "grow up", there will be two entries: Zhang Daà grew up, Chang Daà grew up.

(3) Using the user query log frequency information to give each Chinese entry a weight;

(4) OK, homonym Dictionary was completed, of course, with the gradual expansion of Word dictionary, homonym Dictionary also followed the expansion of synchronization;

Spell check:

(1) User input query, if it is multiple substrings, do not check spelling;

(2) for user inquiries, first look up word segmentation dictionary, if found that the word entry, OK, do not make spelling check;

(3) If you find that the dictionary does not contain user inquiries, start the spell check system, first of all, using phonetic tagging program to mark the user input pinyin;

(4) for the marked pinyin in the dictionary inside the scan, if not found then do not make any hint;

(5) If the entry is found, then the output weight of a large number of sequential results;

  

Phonetic cue:

(1) For the user input pinyin in the dictionary inside the scan, if not found then do not make any hint;

(2) If the entry is found, then output the weight of a large number of results in order.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.