One of the Baidu Word Segmentation Algorithm analyses collected by experts, query processing and Word Segmentation technology (1)

Source: Internet
Author: User

With the rise of the search economy, people began to pay more attention to the performance, technology, and daily traffic of various search engines around the world. As an enterprise, it will choose whether to advertise Based on the popularity of search engines and daily traffic. As a common Internet user, you will select your favorite search engine materials based on the search engine performance and technology. As a technician, you will take representative search engines as the research object. the rise of the search engine economy once again proves the huge business opportunities of the Internet. When the network leaves the search, there will be only empty and messy data, and a large number of gold mines waiting for laborious mining.
But how to design an efficient search engine? We can discuss how to design a practical search engine using the technical means adopted by Baidu. The search engine involves many technical points, such as query processing and sorting. Algorithm , Page capturing algorithms, cache mechanisms, anti-spam, and so on. these technical details, such as Baidu and Google, are not publicly available as search engine service providers of commercial companies. we can regard the existing search engine as a black box. By submitting input to the black box, we can judge the output returned by the black box and roughly judge the unknown technical details in the black box.

Query Processing and word segmentation are essential tasks of a Chinese search engine, baidu, as a typical Chinese search engine, has always stressed that it has key technologies and advantages that other search engines do not possess. so let's take a look at what core technologies Baidu uses.

Query Processing/Chinese word segmentation.

I. Query Processing

The user submits a query to the search engine. The search engine generally performs some processing after receiving the user's query, and then extracts the relevant information from the index database. so what does Baidu do after receiving user queries?

1. assume that the user has submitted more than one query string, for example, "information retrieval theory tool ". then, the search engine splits the query string into several subquery strings Based on delimiters such as spaces and punctuation marks. For example, the preceding query is parsed as: <information retrieval, theory, tool> three substrings. This principle is simple. Let's look at it.

2. Assume that the submitted query contains duplicate content. What should the search engine do? For example, to query "theoretical tool theory", Baidu treats repeated strings as only once, that is, to process them as equivalent "theoretical tools", while Google apparently does not merge them, instead, it increases the weight of the repeated query substrings. so how can we draw this conclusion? We can submit the "theoretical tool" to Baidu and return 341,000 documents. Let's take a look at the returned content on the first page. OK. to continue, we will submit the "theoretical tool Theory" for query. We are looking at the returned results, but there are still so many returned documents. Of course, this does not indicate too many problems. Let's look at the sorting of the returned results on the first page, see it? The order is completely unchanged, while Google's order is somewhat changed, which means Baidu combines repeated queries into one for processing, in addition, the order in which strings appear is not considered (Google considers this order ).

3. Assume that the submitted Chinese query contains English words. How does the search engine handle this? For example, to query "BT download of a movie", Baidu's method is to keep the English in the Chinese string as a whole and use this as the breakpoint to split the Chinese characters separately, in this way, the above query is switched to <movie, BT, download>, regardless of whether the English in the middle is a word in the dictionary or a random character, will be treated as a whole. as for why, you can use "dfdfdf Download Movie" to check the results. of course, this is also true if the query contains numbers.

So far, everything has been simple and clear. How Does Baidu process user queries? To sum up the following: First, separate the query based on the delimiter, and then check whether there are repeated strings. If yes, discard the redundant strings and keep only one string. Then, judge whether there are English or numbers, if any, keep the English or numbers as a whole and cut the front and back Chinese characters.

What should I do next? This is a question about word segmentation.

Chinese Word Segmentation: First, let's talk about Baidu's Word Segmentation timing or condition. Is it a Chinese character string that Baidu will use to cut it down? Baidu's Word Segmentation Program It is also a pleasure to cut it out. Where can it be a string to cut it? Do you sell saw blades for Baidu? So what types of strings can meet the cut conditions? To put it simply, if the string contains only three Chinese characters or less, it will remain unchanged. When the string length is greater than four Chinese characters, Baidu's word segmentation program will quickly get out of the box, dismember the string. how can we prove it? We submitted the "Movie Download" to Baidu to see where the returned result won the red letter. It is not difficult to see it. The query has been cut into two words: <movie, download>, it indicates that the word segmentation program has started. If it is a string longer than four Chinese characters, the word segmentation program will be even more rude. It must be removed and then saved. let's take a look at the three-character scenario. When we submit a query, we can see that this query is not uncommon. That's because I want to see that this string is segmented into <of course, choose>, the 365 related pages of the returned results are displayed on the last page. It is found that the red keywords are "of course selected" consecutively. It seems that there is no split, but it is not clear yet, then, submit the manual-divided query "of course, select" and check that 1,090,000 articles are returned. Basically, we can be sure that no word splitting is performed. Of course, another explanation is: split the three characters first and then query the split result as a phrase. The effect is similar to that without splitting. however, I tend to judge that Baidu does not split strings with less than three characters. Isn't OCCAM saying "if there is no need, do not add entities"? Why is it useless. if there is no splitting, there will be a question How to extract unsharded strings from the index database? This involves the index problem, I think baidu should adopt two sets of index mechanism, one is according to the word index, one is according to the N-GRAM index, as to the index specific problem, it will be discussed in detail later. next, let's take a look at what word segmentation algorithm Baidu adopts. Now the word segmentation algorithm is quite mature and simple and complicated, such as forward maximum matching, reverse maximum matching, and bidirectional maximum matching, language Model methods, shortest path algorithms, etc. If you are interested, you can use Google to search for help. I will not discuss it here. but remember that the key points of determining whether a word segmentation system is good are two points: the ability to eliminate ambiguity; the recognition of unregistered words in a dictionary, such as names, place names, and Organization Names. so what method does Baidu use? My judgment is to use the bidirectional maximum matching algorithm. as for how to make the reasoning, let's look at it step by step. of course, the first assumption here is that Baidu will not adopt complicated algorithms because of speed issues. we submit a query "Mao Zedong, Beijing, Hua Yayun", and another Alibaba Cloud query. Although Alibaba Cloud has its own principle, I would like to see how Baidu's word segmentation does not discriminate and whether the dictionary has the ability to recognize unregistered words. If it is a forward maximum matching algorithm, the output should be: "Mao Zedong/Beijing/Hua/", if it is the inverse maximum matching algorithm, the output should be: "Mao/ze/Northeast/Beijing Hua Yayun". Let's look at Baidu's word segmentation results: mao Zedong, Beijing, and Beijing, Hua Yanyun. A strange output is much different from our expectations. However, we can obtain the following information: Baidu word segmentation can recognize people's names, you can also recognize the word "Beijing-China Tobacco cloud", which indicates that the dictionary has the ability to recognize unregistered words. We can assume that the word segmentation process is divided into two stages: the first stage, first look for a special dictionary, this dictionary contains some personal names, some place names, and some new words that are not available in common dictionaries. In this way, Mao Zedong is first parsed and the remaining characters are left. String "Beijing China Tobacco cloud", and "Beijing/Beijing China Tobacco cloud" can be seen as the word segmentation result of the reverse largest match. this basically makes sense. to prove this, we submit the query "fa Mao Zedong North". We expect two kinds of word segmentation results: one is the maximum positive matching <fa Mao, Ze, Northeast>, one is the result of the above assumptions <FA, Mao Zedong, Bei>. In fact, Baidu output is the second case. This basically ensures that Baidu word segmentation adopts at least two dictionaries, one being a common dictionary, one is a specialized Dictionary (such as name of person ). in addition, specialized dictionaries are used to split the remaining parts first, and then the remaining parts are handed over to common dictionaries for segmentation. continue the test and submit the query "Cuba bi-ethics". If the positive matching is the largest match, the result should be <ancient Babylon, LI>. If the reverse matching is the largest match, then the result should be <Cuba, ratio, ethics>. In fact, Baidu's word segmentation result is <ancient Babylon, LI>. From this example, it seems that the positive maximum matching algorithm is used; in addition, some examples show that positive and maximum matching are used. However, it is slow. We can see that this query is "Beijing Hua Yayun". The expected result of positive and maximum matching is <Beijing, China, yanyun>, and the expected result of the maximum reverse matching is <Beijing, Beijing, and Beijing>. In fact, Baidu outputs the latter, which indicates that the reverse maximum match may be used. From this point, we can guess that Baidu uses the bidirectional maximum match word segmentation algorithm, if the forward and reverse matching word splitting results are consistent, you can directly output the results. But if the two are inconsistent, a positive matching result and a reverse matching result, what should you do? From the two examples above, Baidu adopts the Shortest Path Method in this case, that is, the smaller the number of parts to be split, the better, such as <Cuba, ratio, ethics> and <ancient Babylon, LI> compared with the latter, the latter is selected in comparison with <Beijing, China, smoke cloud> and <Beijing, Beijing, and Beijing smoke cloud>. there are some similar examples to illustrate these output results. but the problem persists: What if forward and reverse word segmentation are inconsistent and the shortest path is the same? Output positive or reverse results? Let's look at an example. submit the query "distant ancient Babylon", which is divided by Baidu into <distant, ancient, Babylon>, indicating that the dictionary contains "Babylon ", however, whether the word "Ancient Babylon" exists is uncertain. At this time, it cannot be seen whether it is the result of forward segmentation or reverse segmentation, and the query is "distant ancient Babylon ", at this time, it is divided into "distant/ancient Babylon", which indicates that the dictionary contains the word "Ancient Babylon", which indicates that "distant ancient Babylon" is the largest positive matching result. so why "distant ancient Babylon" won't be divided into "remote/ancient Babylon" in reverse direction? Baidu may choose the group with fewer words in this case. of course, you can continue to ask: what if there are as many words after splitting? Finally, let's look at an example and look at "Wang Qiang size:". Baidu divides it into "Wang/strong/small", which is the result of forward segmentation, if the word is reversed, it is split into "King/strong/large". This indicates that the word is ambiguous and the word is the same, and the forward splitting result is selected. OK. The header may be dizzy. Finally, let's take a look at Baidu's word segmentation algorithm. Of course, there are still guesses in it. The algorithm is as follows: first, query the specialized Dictionary (Name of person, some place names). The proprietary names are cut out, and the remaining parts adopt bidirectional word segmentation. If the splitting results are the same, there is no ambiguity, and the word segmentation result is directly output. if they are inconsistent, the result of the shortest path is output. If the length is the same, the group of splitting results with fewer words is selected. if the same word is used, select the forward word splitting result .. baidu has been promoting its advantages in Chinese processing. From the above perspective, word segmentation algorithms have no special features, and the effect of discrimination is not ideal, even if Baidu adopts an algorithm that is more complex than the preceding word segmentation algorithm, it cannot be said to be an advantage. If Baidu has an advantage, the only advantage is the large specialized dictionary, this specialized dictionary is used to log on to people's names (such as the elders and the present), titles (such as the old lady), and some place names (such as the UAE ), it is estimated that Baidu uses a relatively new Named Entity Recognition Algorithm published by the academic community to continuously identify unregistered words in the dictionary from the corpus and gradually expand this specialized dictionary. if this is an advantage, it is obvious how long the advantage can be maintained.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.