Analysis of the algorithm of Baidu segmentation

Source: Internet
Author: User
Keywords Algorithm we if

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Query processing and word segmentation technology





with the rise of the search economy, people begin to pay more attention to the performance, technology and daily traffic of the major search engines around the world. As an enterprise, will be based on the popularity of search engines and daily traffic to choose whether to put ads and so on, as ordinary netizens, according to search engine performance and technology to choose their favorite engine to find information, as technicians, will be representative of the search engine as a research object. The rise of the search engine economy has once again proved to people the huge business opportunities that the network contains. The internet leaves the search with only empty, messy data, and a lot of gold deposits waiting to be dug up.





but how to design an efficient search engine? We can discuss how to design a practical search engine by using the technical means of Baidu. Search engine involves many technical points, such as query processing, sorting algorithm, page crawl algorithm, cache mechanism, Spam and so on. These technical details, as a commercial company's search engine service providers such as Baidu, Google and so will not be made public. We can look at the existing search engine as a black box, by submitting input to the black box, The output from the black box was judged to be a rough gauge of the unknown technical details inside the black box.





Query processing and segmentation is a Chinese search engine essential work, and Baidu as a typical Chinese search engine has been stressed that its "Chinese processing" has other search engines do not have the key technology and advantages. Then let's take a look at what the so-called core technology Baidu is using.





we are divided into two parts to tell: query processing/chinese participle.





Query Processing





users submit queries to search engines, search engines generally accept users to do some processing after the query, and then in the index database to extract the relevant information. So what did Baidu do after receiving user inquiries?





1. Suppose the user submits more than one query string, for example, "Information Retrieval Theory Tool". Then the search engine first is based on separators such as space, punctuation, the query string into a number of subqueries, such as the above query will be resolved to:< information retrieval, theory, tools > three substrings; The truth is simple, so we look down.





2. Suppose the submitted query has duplicate content, how does the search engine deal with it? For example, query "theory tool theory", Baidu is the repetition of the string as a single occurrence, that is, to deal with the equivalent of "theoretical tools", and Google is apparently not to merge, Instead, the weight of the repeating query substring is increased. So how do you come to this conclusion? We can submit "theoretical tools" to Baidu, return 341,000 documents, roughly look at the first page of the return content. Ok. Continue, we submit query "theory tool theory", look back to return results, still so many return documents, of course, this does not explain too many questions, then look at the first page return the result of the sort, see it? The order is completely unchanged, and the Google sort changes, This shows that Baidu is a duplicate of the query into a processing, and the sequence of occurrences of the string between the basic will not consider (Google is considering this sequential relationship).





3. Suppose the Chinese query submitted contains English words, how do the search engines deal with them? For example, the query "film BT download", Baidu's method is the Chinese character string in English as a whole reservation, and as a breakpoint will be Chinese cut apart, so that the above query cut for < film, BT, download, Whether the middle of English is a dictionary can be found in the word or random characters, will be treated as a whole. As for why, you use the query "movie dfdfdf download" to see the results. Of course, if the query contains numbers, it does the same.





so far, everything is very simple, but also very clear, Baidu how to deal with user inquiries? Summed up as follows: first, according to the split symbol to separate the query, and then see if there are duplicate strings, if there is, discard the redundant, only one, and then judge whether there is English or numbers, if any, Keep English or numbers as a whole and cut the Chinese before and after.





What to do next? The question of participle should be considered.





two. Chinese word





First of all, talk about the timing of Baidu participle or conditions, whether it is a Chinese string Baidu to take to cut it? Not also, want to be Baidu's word-splitting procedure is honored to cut a bit is also to say conditions, can be a string on the cutting ah? Do you think Baidu is selling saw blades?





So what kind of string to meet the conditions of the cut? Simply put, if the string contains less than or equal to 3 Chinese characters, it will remain fixed, when the string length is greater than 4 Chinese characters, Baidu's word-breaker procedure to gets, the string dismembered.





How to prove it? We submit "movie downloads" to Baidu, look back to the results of the bid for the red Word place, not ugly out, the query has been cut into < film, download > Two words, the word segmentation program has been started, if it is more than 4 characters longer string, The word-breaker procedure is more impolite, must unload eight pieces and then quickly. Let's look at the three-character case, submit the query "Of course," it looks like this query is neither fish nor fowl, that is because I would like to see this string is cut into < of course, optional, return the results of 365 related pages, turn to the last page, found that the red key words are "of course choice" continuous occurrence of the situation, it seems that there is no segmentation, but is not sure, then submit the manual division of the query "of course choose" to see, the return of the results of 1,090,000, basically can be sure that there is no word, of course, another explanation is: for three characters Fushinches points, The result of the segmentation is then used as a phrase query, and the effect is similar to that of no segmentation. But I'm inclined to judge that Baidu has not split the string of less than 3 characters, did not he say, "If not necessary, not to increase the entity", why do not work hard. So if there's no segmentation, there's a problem that comes with it, How do I extract an unsigned string from the index library? This involves the index of the problem, I think Baidu should adopt two sets of indexing mechanism, one is according to the word index, one is according to N-gram Index, as for the specific problems of the index, later in the detailed discussion.





below we look at Baidu is to take what word segmentation algorithm, now the word segmentation algorithm has been considered more mature, there are simple and complex, such as forward maximum matching, reverse maximum matching, bidirectional maximum matching, language model method, the shortest path algorithm, etc., Interested can use Google to search for a more understanding. It's not going to start here. But to remember one point is: Judge a word system good, the key to see two points, one is to eliminate ambiguity ability; One is the identification of the words in the dictionary, such as people's names, place names, organization names and so on.





So what is the method used by Baidu? My judgment is to use the bidirectional maximum matching algorithm. As for how to deduce it, let's take it one step at a time. Of course, here is the first assumption that Baidu will not adopt a more complex algorithm because of speed.





We submit a query "Mao Zedong Beijing China Smoke", another unintelligible query, although unintelligible but its own truth, I would like to see how Baidu participle disambiguation and whether there is no dictionary of the recognition of the word function, if the maximum matching algorithm, then the output should be: "Mao Zedong/Beijing /Hua/Smoke ", if it is the reverse maximum matching algorithm, then the output should be:" Mao/ze/Northeast/Jinghua Smoke ", we look at the results of Baidu participle:" Mao Zedong/North/Jinghua Smoke ", a very strange output, and our expectations are much different, but from which we can obtain the following information: Baidu participle can identify names, but also can identify "Jinghua smoke", this means that there is a dictionary of the identification of unregistered words function, we can assume that the word segmentation process is divided into two stages: the first stage, first find a special dictionary, the dictionary contains some names, some places and some ordinary dictionaries do not have new words, so that the first "Mao Zedong "Parsing out, left the string" Beijing Hua Smoke ", and" North/Peking Smoke ", can be seen as the inverse of the largest matching participle results. This basically makes sense. To prove this, we submit the query" Hair Mao Zedong North ", we expect two word segmentation results, one is the forward maximum matching < creepy, ze, northeast ", one is the result of the above hypothesis < hair, Mao Zedong, north", in fact, Baidu output is the second situation, so basically can determine Baidu participle adopted at least two dictionaries, one is a general dictionary, one is a special dictionary (names, etc.). And it's a special dictionary, The remaining pieces are then handed down to the common dictionary.





Continue quiz, submit query "Cuban than ethics", if it is a positive maximum match, then the result should be < Babylon, if the inverse of the largest match, then the result should be < Cuba, than, ethics, in fact, Baidu's word of the result is < ancient Babylon, Daniel, From this example, it seems that the forward maximum matching algorithm is used; In addition, there are some examples that appear to be using a forward maximum match; But wait, we're looking at this query. "Beijing Hua Yun", the result of the positive maximum matching expectation is < Beijing, China, the cloud, while the reverse maximum match expected result is < North, The Jinghua cloud, in fact, Baidu output is the latter, which indicates that the possible reverse maximum matching; From this we can guess Baidu is a two-way maximum matching word segmentation algorithm, if the positive and reverse matching word segmentation results of course good, direct output can; Reverse match a result, at this time what should be good? From the above two examples, in this case, Baidu to take the shortest path method, that is, fragments of the less the better, such as < Cuba, than, ethics > and < Ancient Babylon,,< > compared to choose the latter to Beijing, China, smoke > and < North, the Jinghua smoke > compared to the choice of the latter. There are similar examples that can basically explain these output results.





but still remains the question is: if the reverse participle is inconsistent, and the shortest path is the same, then how to do? output positive or reverse result? Let's take another example. Submit Query "Remote Ancient Babylon", this query was divided into < remote, Cyangugu, Babylon, There are "Babylon" in the dictionary, but whether the word "Ancient Babylon" is uncertain, at this time can not see is the positive or reverse segmentation results, for the query for "Distant Ancient Babylon", at this time is divided into "distant/Ancient Babylon", which indicates that the dictionary has "Ancient Babylon" the word, this explains the " Distant Ancient Babylon "is the result of a positive maximum match." Then why is "distant ancient Babylon" not to be reversed into "remote/Ancient/Ancient Babylon", Baidu's possible choice is the choice of words in this case the group of segmentation results.





Of course also can continue to cross-examine: if the word is divided after the same many, how to do? Finally look at an example, query "Wang Qiang:", Baidu will be cut into the "King/strong/small", is the result of positive segmentation, if the reverse will be cut into "king/strong/Size", This means that there is ambiguity and the same word is selected for the positive segmentation result.





OK, see here may have some dizzy, finally summed up Baidu's word segmentation algorithm, of course, there is still a guess in the composition, the algorithm is as follows:





first Query the special dictionary (names, some places, etc.), will be the exclusive name cut out, the remaining part of the two-way word segmentation strategy, if the two segmentation results are the same, there is no ambiguity, direct output segmentation results. If not, the result of the shortest path is output, if the length Select the group of segmentation results with fewer single words. If the word is the same, select the positive participle result ...





Baidu has been promoting its own advantages in Chinese processing, from the above, the word segmentation algorithm is not special, disambiguation effect is not ideal, even if Baidu take more complex than the above algorithm is also difficult to say is the advantage, if Baidu has the advantage, the only advantage is that very large special dictionary, This special dictionary is logged into the name of the person (for example, the Big one) appellation (such as the Old Lady), some places (such as the UAE, etc.), estimated that Baidu uses the academic publication of the relatively new named entity recognition algorithm from the corpus to continuously identify the thesaurus, gradually expand this specialized dictionary. If that's the advantage, The question of how long this advantage can remain is obvious.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.