Master collection and collation of Baidu Word segmentation algorithm analysis of one of the query processing and Word segmentation technology (1) _ Website Application

Source: Internet
Author: User
With the rise of the search economy, people begin to pay more attention to the performance, technology and daily traffic of major search engines around the world. As an enterprise, will be based on the popularity of search engines and daily traffic to choose whether to put ads and so on, as ordinary netizens, according to search engine performance and technology to choose their favorite engine to find information, as technicians, will be representative of the search engine as a research object. The rise of the search engine economy has once again proved to people the huge business opportunities that the network contains. The internet leaves the search with only empty, messy data, and a lot of gold deposits waiting to be dug up.
But how to design an efficient search engine? We can discuss how to design a practical search engine by using the technical means of Baidu. Search engine involves many technical points, such as query processing, sorting algorithm, page crawl algorithm, cache mechanism, anti-spam and so on. These technical details, As a commercial company, search engine service providers such as Baidu, Google and so on will not be made public. We can look at the existing search engine as a black box, by submitting input to the black box, judging the output of the black box to judge the unknown technical details inside the black box.

Query processing and segmentation is a Chinese search engine essential work, and Baidu as a typical Chinese search engine has been emphasizing its "Chinese processing" has other search engines do not have the key technology and advantages. Then let's take a look at what the so-called core technology Baidu is using.

We are divided into two parts to tell: query processing/chinese participle.

I. Query processing

Users submit queries to search engines, search engines generally accept users to do some processing after the query, and then in the index database to extract the relevant information. So what did Baidu do after receiving user inquiries?

1. Suppose the user submits more than one query string, for example, "Information Retrieval Theory Tool". Then the search engine first is based on separators such as space, punctuation, the query string into a number of subqueries, such as the above query will be resolved to:< information retrieval, theory, tools > three substrings; The truth is simple, so we look down.

2. Assuming that the submitted query has duplicate content, how does the search engine deal with it? For example, query "theory tool theory", Baidu is the repetition of the string as a single occurrence, that is, to deal with the equivalent of "theoretical tools", and Google is apparently not to merge, Instead, the weight of the repeating query substring is increased. So how do you come to this conclusion? We can submit "theoretical tools" to Baidu, return 341,000 documents, roughly look at the first page of the return content. Ok. Continue, we submit query "theory tool theory", look back to return results, still so many return documents, of course, this does not explain too many questions, then look at the first page return the result of the sort, see it? The order is completely unchanged, and the Google sort has some changes, This shows that Baidu is a duplicate of the query into a processing, and the sequence of occurrences of the string between the basic will not consider (Google is considering this sequential relationship).

3. Suppose the Chinese query submitted contains English words, how do the search engines deal with them? For example, the query "film BT download", Baidu's method is the Chinese character string in English as a whole reservation, and as a breakpoint will be Chinese cut apart, so that the above query cut for < film, BT, download ";" Whether the Middle English is a dictionary can be found in the word, or random characters, will be treated as a whole. As for why, you use the query "movie dfdfdf download" to see the results. Of course, if the query contains numbers, it does the same.

So far, everything is very simple, also very clear, Baidu how to deal with user inquiries? Summed up as follows: first, according to the split symbol to separate the query, and then see if there are duplicate strings, if there is, discard the redundant, only one, and then judge whether there is English or numbers, if any, Keep English or numbers as a whole and cut up the Chinese before and after.

What should we do next? The question of participle should be considered.


Chinese participle first, talk about the timing or condition of Baidu participle, whether it is a Chinese string Baidu to cut it? Not also, want to be Baidu's word-splitting procedure is honored to cut a bit is also to say conditions, can be a string on the cutting ah? Do you think Baidu is selling saw blades??  So what kind of strings meet the conditions of being cut? Simply put, if the string contains less than or equal to 3 Chinese characters, it will remain fixed, when the string length is greater than 4 Chinese characters, Baidu's word-breaker procedure before the horse gets going, the string dismembered .  how to prove it? We submit "movie downloads" to Baidu, look back to the results of the winning bid for the red word, not ugly out, the query has been cut into < film, download > Two words, the word segmentation program has been started, if it is more than 4 characters longer string, the word-breaker procedure is more impolite, Be sure to unload eight pieces and then quickly. Let's take a look at three characters, submit a query "Of course", it seems that this query is neither fish nor fowl, that is because I would like to see this string is cut into < of course, optional, return the results of 365 related pages, turn to the last page, found that the key words are "of course optional "The continuous occurrence of the situation, as if there is no segmentation, but is not sure, then submit the manual division of the query" of course choose "to see, return the results of 1,090,000, basically can be sure that there is no word, of course, another explanation is: for three characters Fushinches, and then the result of the segmentation as a phrase query , the effect is similar to that of no segmentation. But I'm inclined to judge that Baidu has no segmentation of a string less than 3 characters, didn't you say that? "If it is unnecessary, do not increase the entity," Why do not work hard. So if there's no Shard, there's an attendant problem, how do you extract the unsigned strings from the index library? ? This involves the index of the problem, I think Baidu should take two sets of indexing mechanism, one is according to the word index, one is according to N-gram Index, as for the specific problems of the index, later in the detailed discussion .  below we look at Baidu is to take what word segmentation algorithm, now word segmentation algorithm is more mature , there are simple and complex, such as forward maximum matching, reverse maximum matching, bidirectional maximum matching, language model method, shortest path algorithm and so on, interested can use Google to search for more understanding. This is not going to start. But remember one thing is: Judge a word system good, the key to see two points, One is the ability to eliminate ambiguity; one is the identification of the dictionary's unregistered words such as names, place names, organization names, etc. .  so what is the method of Baidu? My judgment is to use the bidirectional maximum matching algorithm. As for how to deduce it, let's take it one step at a point. Of course, here's the first hypothesis, Baidu will not take the more complex algorithm, because considering the speed problem .  We submit a query "MauzeThe Northeast Jinghua Smoke ", another unintelligible inquiry, although unintelligible but its own truth, I would like to see how Baidu's participle disambiguation and whether there is a dictionary of the recognition of the function, if the maximum matching algorithm, then the output should be:" Mao Zedong/Beijing/China/Smoke ", If it is the reverse maximum matching algorithm, then the output should be: "Mao/ze/Northeast/Jinghua Smoke", we look at the results of Baidu participle: "Mao Zedong/North/Jinghua Smoke", a very strange output, with our expectations more, but from which we can obtain the following information: Baidu participle can identify names, can also identify the "Jinghua smoke", this means that there is a dictionary of the identification of unregistered words, we can assume that the word segmentation process is divided into two stages: the first stage, first find a special dictionary, the dictionary contains some names, some places and some ordinary dictionaries do not have new words, so that the first "Mao Zedong" to parse out, Left the string "Beijing Hua Yun", and "North/Jinghua Smoke", can be considered as the inverse of the maximum match of the word segmentation results. This basically makes sense. To prove this, we submit the query "Hair Mao Zedong North", we expect two word segmentation results, one is the largest matching < creepy, ze, northeast; One is the result of the above hypothesis < hair, Mao Zedong, north, in fact, Baidu output is the second situation, so basic can determine Baidu participle adopted at least two dictionaries, one is a general dictionary, one is a special dictionary (names, etc.) .  continues the quiz, submit a query "Cuba than ethics", if it is the largest match, then the result should be < Babylon, if the reverse maximum match, then the result should be < Cuba, than, ethics, in fact, Baidu's participle result is < ancient Babylon, Daniel From this example, it seems to use a forward maximum matching algorithm; In addition, there are some examples that appear to be using a forward maximum match; But wait a minute, let's see this query "Beijing Hua Yun", the result of the positive maximum match expectation is < Beijing, China, the cloud, while reverse maximum match expectation result is < North, Jinghua Cloud, in fact, Baidu output is the latter, which shows that the possible reverse maximum matching; From this we can guess that Baidu is a two-way maximum matching word segmentation algorithm, if the positive and reverse matching word segmentation results of course good to do, direct output can; Positive match a result, reverse match a result, what should be good at this time? From the above two examples, in this case, Baidu to take the shortest path method, that is, fragments of the less the better, such as < Cuba, than, ethics > and < Ancient Babylon, compared to the choice of the latter,< BEIJING, China, Smoke > and < North, Jinghua smoke > compared to the choice of the latter. andSimilar examples, which can basically explain these output .  but still leave the problem is: if the reverse participle is inconsistent, and the shortest path is the same, then what to do? output positive or reverse result? Let's look at one more example. Submit Query "Distant Ancient Babylon", This query is divided into < remote by Baidu, Cyangugu, Babylon, the dictionary contains "Babylon", but whether there is "ancient Babylon" This word is uncertain, at this time can not see is the positive or reverse segmentation results, for the query for "Distant Ancient Babylon", at this time was cut into "distant/ancient Babylon" , which means that the lexicon contains the word "Ancient Babylon", this shows that "Distant ancient Babylon" is the result of a positive maximum match. Then why "Distant ancient Babylon" will not be reversed into "remote/Ancient/Ancient Babylon", Baidu's possible choice is in this case the choice of words less the group of segmentation results .  Of course, you can continue to ask: if the word is divided as much, then how to do? Finally look at an example, query "Wang Qiang:", Baidu will be cut into the "King/strong/small", is the result of positive segmentation, if the reverse will be cut into "king/strong/Size", This means that there is ambiguity and the same word is selected for the positive segmentation result.  ok, see here may have some dizzy, and finally summed up Baidu's word segmentation algorithm, of course, there is still speculation in the composition, the algorithm is as follows:  first query a special dictionary (names, some places, etc.), will be the exclusive name cut out , the remaining part takes the bidirectional participle strategy, if the two segmentation results are the same, there is no ambiguity, direct output segmentation results. If not, then the result of the shortest path is output, and if the length is the same, select the one-word-less set of segmentation results. If the word is the same, select the positive participle result ...   Baidu has been promoting its own advantages in Chinese processing, from the above, the word segmentation algorithm is not special, disambiguation effect is not ideal, even if Baidu take more complex than the above algorithm is also difficult to say is the advantage, if Baidu has the advantage, the only advantage is that very large special dictionary, This special dictionary is logged into the name of the person (for example, long today), appellation (such as the Old Lady), some places (such as the UAE, etc.), estimated that Baidu uses the academic publication of the relatively new named entity recognition algorithm from the corpus to continuously identify the thesaurus, gradually expand this specialized dictionary. If that's the advantage, So how long can this advantage remain an obvious problem .  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.