Baidu Segmentation algorithm detailed 1th/2 page _ website operation

Source: Internet
Author: User
Tags data structures advantage
This paper expounds the query processing of Baidu preprocessing stage and Chinese word segmentation by the method of the search result inductive analysis and the word-cutting general algorithm analysis. Summing up, if you have a certain understanding of data structures, algorithms, it will be relatively easy to understand; personal feeling, the forward maximum matching algorithm is not accurate enough, Whether it's a special dictionary or a word in an ordinary dictionary, there are different weights, this search frequency should have a certain relationship, based on this, in the emergence of a number of special words in the dictionary, it is necessary to use a two-way maximum matching algorithm to detect which proprietary words should be first cut out, of course, this is a personal guess, to be elegant.

The understanding of the word segmentation technology for the SEO work has great significance, from a scientific point of view to analyze keywords, and the concept of keyword deployment strategy; If the forward maximum matching algorithm is correct, it can be concluded that the weight of the word segmentation after the cut is sorted according to the forward direction.

I also want to understand is a special dictionary and a common dictionary, which weight will be higher?

The following is reproduced in the original text:
Query processing and Word segmentation technology
With the rise of the search economy, people begin to pay more attention to the performance, technology and daily traffic of major search engines around the world. As an enterprise, will be based on the popularity of search engines and daily traffic to choose whether to put ads and so on, as ordinary netizens, according to search engine performance and technology to choose their favorite engine to find information, as technicians, will be representative of the search engine as a research object. The rise of the search engine economy has once again proved to people the huge business opportunities that the network contains. The internet leaves the search with only empty, messy data, and a lot of gold deposits waiting to be dug up.
But how to design an efficient search engine? We can use the technology of Baidu to explore how to design a practical search engine. Search engine involves many technical points, such as query processing, sorting algorithm, page crawl algorithm, cache mechanism, anti-spam and so on. These technical details, as a commercial company's search engine service providers such as Baidu, Google and so will not be made public. We can look at the existing search engine as a black box, by submitting input to the black box, judging the output of the black box to determine the unknown technical details inside the black box.
Query processing and segmentation is a Chinese search engine essential work, and Baidu as a typical Chinese search engine has been emphasizing its "Chinese processing" has other search engines do not have the key technology and advantages. So let's take a look at what the so-called core technology Baidu is using.
We are divided into two parts to tell: query processing/chinese participle.
First, query processing
Users submit queries to search engines, search engines generally accept the user query to do some processing, and then in the index database to extract the relevant information. So what does Baidu do after receiving user inquiries?
1, assume that the user submitted more than one query string, such as "Information retrieval theory tool." Then the search engine first is based on the separator, such as space, punctuation, the query string into a number of subqueries, such as the above query will be resolved to: three substrings; this is simple, and we'll look down.
2, the assumption that the query has duplicate content, the search engine how to deal with it? For example, query "theory tool theory", Baidu is to repeat the string as only once, that is, to deal with the equivalent of the "theoretical tool", and Google is apparently not to merge, but to repeat the query substring of the weight increase to deal with. So how did it come to this conclusion? We can submit "theoretical tools" to Baidu, return 341,000 documents, roughly look at the first page of the return content.
Ok. Continue, we submit query "theory tool theory", look back to return results, still so many return documents, of course this does not explain too many questions, then look at the first page return the result of the sort, see it? The order is completely unchanged, and GOOGLE has some changes in the ordering, which means that Baidu is a duplicate of the query into a processing, and the sequence of occurrences between strings is basically not considered (GOOGLE is considering this sequential relationship).
3, the assumption that the Chinese query contains English words, search engine is how to deal with? For example, the query "film bt download", Baidu's method is to Chinese string in English as a whole reservation, and as a breakpoint on the Chinese cut apart, so that the above query is cut, regardless of whether the middle of English is a dictionary can be found in the word or random characters, will be treated as a whole. As for why, you use the query "movie dfdfdf download" to see the results. Of course, if the query contains numbers, so is the case.
So far, everything is very simple, also very clear, Baidu how to deal with user inquiries? summed up as follows: first, according to the split symbol to separate the query, and then see if there is a duplicate string, if there is, discard the redundant, only one, and then judge whether there is English or numbers, if any, the English or the number as a whole to retain and cut before and after the Chinese.
What's the next thing to do? It's time to consider the problem of participle.
Second, Chinese participle
First of all, talk about the timing of Baidu participle or conditions, whether it is a Chinese string Baidu to take to cut it? Not also, want to be Baidu's word-splitting procedure is honored to cut a bit is also to say conditions, can be a string on the cutting ah? Do you think Baidu is selling saw blades?
So what kind of strings meet the conditions of being cut? Simply put, if the string contains less than or equal to 3 Chinese characters, it will remain fixed, when the string length is greater than 4 Chinese characters, Baidu's word-breaker procedure to gets going, the string dismembered.
How do you prove it? We submit "movie downloads" to Baidu, look back to the results of the winning bid for the red word, not ugly out, the query has been cut into two words, the word segmentation program has been started, if it is more than 4 characters longer strings, the word-breaker program is more impolite, must be a large unloading of eight and then quickly. Let's take a look at the three-character case, submit the query "Of course optional", it seems that this query is neither fish nor fowl, that is because I would like to see this string is cut into, return the results of 365 related pages, turn to the last page, found that the key words are "Of course optional" continuous occurrence of the situation, it seems that there is no segmentation, but not sure , then submit the manual division of the query "of course choice" to see, return the results of 1,090,000, basically can be sure that there is no word, of course, another explanation is: for three characters Fushinches, and then the result of the segmentation as a phrase query, so see the effect and no segmentation is similar.
But I am inclined to judge that Baidu has no shard of less than 3 characters, but did not say, "If not necessary, not to increase the entity", why do not work hard. So if there's no Shard, there's an attendant problem, how do you extract the unsigned strings from the index library? This involves the index of the problem, I think Baidu should adopt two sets of indexing mechanism, one is according to the word index, one is according to N-gram Index, as to the specific problems of the index, later in the detailed discussion.
Below we look at Baidu is to adopt what word segmentation algorithm, now the word segmentation algorithm is relatively mature, there are simple and complex, such as forward maximum matching, reverse maximum matching, bidirectional maximum matching, language model method, the shortest path algorithm, etc., interested can use Google to search to increase understanding. It's not going to start here. But to remember one point is: Judge a word system good, the key to see two points, one is to eliminate ambiguity ability; One is the identification of the words in the dictionary, such as people's names, place names, organization names and so on.
So what is the method that Baidu uses? My judgment is to use the bidirectional maximum matching algorithm. As for how to deduce it, let's look at it step-by-step. Of course, here is the first assumption that Baidu will not take a more complex algorithm, because of the speed problem.
We submit a query "Mao Zedong Beijing China Smoke", another unintelligible query, although unintelligible but its own truth, I would like to see how Baidu participle disambiguation and whether there is no dictionary of the recognition of the word function, if the maximum matching algorithm, then the output should be: "Mao Zedong/Beijing/China/ Smoke ", if it is the reverse maximum matching algorithm, then the output should be:" Mao/ze/Northeast/Jinghua Smoke ", we look at the results of Baidu participle:" Mao Zedong/North/Jinghua Smoke ", a very strange output, with our expectations more, but from which we can obtain the following information: Baidu participle can identify names , you can also identify the "Jinghua smoke", this means that there is a dictionary of the identification of unregistered words, we can assume that the word segmentation process is divided into two stages: the first stage, first find a special dictionary, the dictionary contains some names, some places and some ordinary dictionaries do not have new words, so that the first "Mao Zedong" to parse out, Left the string "Beijing Hua Yun", and "North/Jinghua Smoke", can be seen as the reverse maximum matching of participle results. That basically makes sense. To prove this, we submit the query "Hair Mao Zedong North", we expect two kinds of word segmentation results, one is the largest matching, one is the result of the above hypothesis, in fact, Baidu output is the second situation, so that the basic can determine Baidu participle adopted at least two dictionaries, one is a general dictionary, one is a special dictionary (names, etc.). It is a special dictionary to be divided first, and then the remainder of the pieces to the common dictionary to slice.
Continue quiz, submit query "Cuban than ethics", if it is a positive maximum match, then the result should be, if the reverse maximum match, then the result should be, in fact, the result of Baidu participle, from this example, seems to use a forward maximum matching algorithm; In addition, there are some examples that seem to be using a forward maximum match ; But wait a minute, we look at this query "Beijing Hua Yun", the best match expected result is, and reverse the largest matching expected result is, in fact, Baidu output is the latter, which shows that the possible reverse maximum matching; From this we can guess that Baidu adopts bidirectional maximum matching word segmentation algorithm, If the forward and reverse matching word segmentation results are of course good, direct output can be, but if the two are inconsistent, forward matching a result, reverse matching a result, what should be good at this time?
From the above two examples, in this case, Baidu to take the shortest path method, that is, fragments of the less the better, such as compared with the choice of the latter, and compared to the choice of the latter. There are similar examples that can basically explain these output results.
But the remaining question is: if the reverse participle is inconsistent, and the shortest path is the same, then what to do? Output positive or reverse result?
Let's look at one more example. Submit Query "Remote Ancient Babylon", this query is divided into Baidu, there are "Babylon" in the dictionary, but whether there is "ancient Babylon" This word is uncertain, at this time can not see is the positive or reverse segmentation results, for the query for "Distant Ancient Babylon", at this time was cut into "distant/Ancient Babylon", This means that the dictionary contains the word "Ancient Babylon", which shows that "the Distant ancient Babylon" is the result of a positive maximum match. So why "distant ancient Babylon" will not be reversed into the "remote/Ancient/Ancient Babylon", Baidu's possible choice is this case to choose less words of the group of segmentation results.
Of course, you can continue to ask: if the word is divided as much, then how to do? Finally, look at an example, query "Wang Qiang:", Baidu will be cut into the "King/strong/small", is the result of the positive segmentation, if the reverse will be cut into "king/strong/Size", which indicates that there is ambiguity and the same word also select the positive segmentation results.
OK, see here may have some dizzy, finally summed up Baidu's word segmentation algorithm, of course, there is still a guess in the composition, the algorithm is as follows:
First inquires the special dictionary (person name, some place name and so on), will the exclusive name cut out, the remaining part takes the bidirectional word segmentation strategy, if both splits the result to be same, explained that does not have the ambiguity, the direct output participle result. If it is inconsistent, the result of the shortest path is output, and if the length is the same, select the group of segmentation results with fewer single words. If the word is the same, select the positive participle result.
Baidu has been promoting its own advantages in Chinese processing, from the above, the word segmentation algorithm is not special, disambiguation effect is not ideal, even if Baidu take more complex than the above algorithm is also difficult to say is the advantage, if Baidu has the advantage, the only advantage is that very large special dictionary, This special dictionary is logged into the name (for example, long today), the title (such as Old Lady), some places (such as the United Arab Emirates, etc.), estimated Baidu using the academic community announced a relatively new named entity recognition algorithm from the corpus to continuously identify the word, gradually expand this specialized dictionary. If that is the advantage, then the obvious question is how long this advantage can be maintained.
spelling checker spelling error prompt (and phonetic cue feature)
  
Spell CHECK error hint is a search engine has a function, that is, users submit queries to search engines, search engine check to see whether the user entered the spelling errors, for Chinese users generally caused by errors is the input method caused by errors. Then we will analyze how Baidu is to achieve this function.
We analyze the spelling checker system to focus on the following issues:
(1) How can the system determine the user's input is likely to occur wrong query?
(2) If the judgment is possible wrong query input, how to prompt the correct vocabulary?
  
So how does Baidu do it? Baidu Judge user input is wrong standards, I think it should be to look up the dictionary, if found in the dictionary does not contain the word, then it is likely to be a wrong input, this time to start the error prompted the function, this very good judgment, because if it is a normal vocabulary, Baidu will not have the error , and you deliberately enter a dictionary can not contain the so-called words, at this time Baidu will generally prompt you to correct the search vocabulary.
So how does Baidu prompt the correct vocabulary? Obviously through pinyin, like my input query
Current 1/2 page 12 Next read the full text

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.