Baidu word segmentation algorithm details page 1/2

Source: Internet
Author: User

This article uses the search result inductive Analysis + Word Segmentation Algorithm The Analysis Methods describe and summarize the query processing and Chinese Word Segmentation technologies in the Baidu preprocessing phase. If you have a certain understanding of the data structure and algorithm, it is easier to understand. In my personal sense, it is not accurate to obtain the maximum positive matching algorithm. Words in both specialized dictionaries and general dictionaries have different weights, the search frequency should have a certain relationship. Based on this, when words in multiple specialized dictionaries appear, it is a two-way maximum matching algorithm that needs to be used to detect which exclusive word should be split first. Of course, this is a personal conjecture and needs to be studied.

Understanding Word Segmentation technology is of great significance to Seo work. You can analyze keywords from a scientific perspective and propose keyword deployment strategies. If the positive maximum matching algorithm is correct, we can basically conclude that the word segmentation weights after word segmentation are in positive order.

What I want to know is that specialized dictionaries and common dictionaries have higher weights?

The original text is reproduced as follows:
Query Processing and Word Segmentation
With the rise of the search economy, people began to pay more attention to the performance, technology, and daily traffic of various search engines around the world. As an enterprise, it will choose whether to advertise Based on the popularity of search engines and daily traffic. As a common Internet user, you can select your desired search engine based on the search engine performance and technology. As a technician, You can take representative search engines as the research object. The rise of the search engine economy once again proves the huge business opportunities of the Internet. When the network leaves the search, there will be only empty and messy data, and a large number of gold mines waiting for laborious mining.
But how to design an efficient search engine? We can discuss how to design a practical search engine with the technical means adopted by Baidu. Search engines involve many technical points, such as query processing, sorting algorithms, page capturing algorithms, cache mechanisms, and anti-spam. These technical details, as a search engine service provider for commercial companies such as Baidu and Google, will not be made public. We can regard the existing search engine as a black box. By submitting input to the black box, we can judge the output returned by the black box and roughly judge the unknown technical details in the black box.
Query Processing and word segmentation are essential tasks of a Chinese search engine, baidu, as a typical Chinese search engine, has always stressed that it has key technologies and advantages that other search engines do not possess in "Chinese processing. So let's take a look at what core technologies Baidu uses.
Query Processing/Chinese word segmentation.
I. Query Processing
The user submits a query to the search engine. The search engine generally performs some processing after receiving the user's query, and then extracts the relevant information from the index database. So what does Baidu do after receiving user queries?
1. Assume that the user has submitted more than one query string, for example, "information retrieval theory Tools ". Then, the search engine splits the query string into several subquery strings Based on delimiters such as spaces and punctuation marks. For example, the preceding query is parsed as three substrings; this principle is simple. Let's look at it.
2. Assume that the submitted query contains duplicate content. What should the search engine do? For example, when querying the "theoretical tool theory", Baidu treats repeated strings as only once, that is, processing them into equivalent "theoretical tools", while Google apparently does not merge them, instead, it increases the weight of the substring for repeated queries. So how can we draw this conclusion? We can submit the "theoretical tool" to Baidu and return 341,000 documents. Let's take a look at the returned content on the first page.
OK. Continue, we will submit the query "theoretical tool Theory". When looking at the returned results, there are still so many returned documents. Of course, this does not indicate too many problems. Let's look at the sorting of the returned results on the first page, see it? The order is completely unchanged, while Google's order is somewhat changed, which means Baidu combines repeated queries into one for processing, in addition, the order in which strings appear is not considered (Google considers this order ).
3. Assume that the submitted Chinese query contains English words. How does the search engine handle this? For example, to query "BT download movie", Baidu's method is to keep the English in the Chinese string as a whole and use this as the breakpoint to split the Chinese into different ones. In this way, the above query is switched, whether the English in the middle is a word in the dictionary or a random character, it will be treated as a whole. As for why, you can query "dfdfdf Download Movie" to see the results. Of course, this is also true if the query contains numbers.
So far, everything has been simple and clear. How Does Baidu process user queries? To sum up the following: First, separate the query based on the delimiter, and then check whether there are repeated strings. If yes, discard the redundant strings and keep only one string. Then, judge whether there are English or numbers, if any, keep the English or numbers as a whole and cut the front and back Chinese characters.
What should I do next? The question of Word Segmentation should be considered.
Ii. Chinese Word Segmentation
First, let's talk about Baidu's word splitting timing or condition. Is it a Chinese character string that Baidu uses to cut it apart? Baidu's Word Segmentation Program It is also a pleasure to cut it out. Where can it be a string to cut it? Do you sell saw blades for Baidu?
So what types of strings can meet the cut conditions? To put it simply, if the string contains only three Chinese characters or less, it will remain unchanged. When the string length is greater than four Chinese characters, Baidu's word segmentation program will quickly get out of the box, dismember the string.
How can we prove it? We submitted a "Movie Download" to Baidu to see where the returned results won the red letter. The query was cut into two words, indicating that the word segmentation program has started, if it is a string longer than four Chinese characters, the word segmentation program will be even more rude, it must be removed eight and then fast. Let's take a look at the three-character scenario. When we submit the query "of course", it looks like this query is nondescribable. That's because I want to see that this string is split into 365 pages, turning to the last page, we found that the red keywords are "of course selected" consecutively. It seems that there is no splitting, but we are not sure yet, then, submit the manually-divided query "of course select" to see the result. The returned result contains 1,090,000 articles, which can basically be determined that no word splitting is performed. Of course, the other explanation is: split the three characters first and then treat the split result as a phrase query. The effect is similar to that without splitting.
However, I tend to judge that Baidu does not split strings with less than three characters. Occam does not mean "do not add entities if necessary". Why is it useless. If there is no splitting, there will be a problem. How can we extract unsharded strings from the index database? This involves the index problem, I think baidu should adopt two sets of index mechanism, one is according to the word index, one is according to the N-GRAM index, as to the index specific problem, it will be discussed in detail later.
Next, let's take a look at what word segmentation algorithm Baidu adopts. Now the word segmentation algorithm is quite mature and simple and complicated, such as forward maximum matching, reverse maximum matching, and bidirectional maximum matching, language Model methods, shortest path algorithms, and so on. If you are interested, you can use Google to search for a better understanding. I will not discuss it here. But remember that the key points of determining whether a word segmentation system is good are two points: the ability to eliminate ambiguity; the recognition of unregistered words in a dictionary, such as names, place names, and Organization Names.
So what method does Baidu use? My judgment is to use a bidirectional maximum matching algorithm. Let's take a step-by-step look at how to make inferences. Of course, the first assumption here is that Baidu will not adopt complicated algorithms because of speed issues.
We submit a query for "Mao Zedong, Beijing, Hua Yayun" and another Alibaba Cloud query. Although Alibaba Cloud has its own principle, I would like to see how Baidu's word segmentation does not discriminate and whether the dictionary has the ability to recognize unregistered words. If it is a forward maximum matching algorithm, the output should be: "Mao Zedong/Beijing/Hua/", if it is the inverse maximum matching algorithm, the output should be: "Mao/ze/Northeast/Beijing Hua Yayun". Let's look at Baidu's word segmentation results: mao Zedong, Beijing, and Beijing, Hua Yanyun. A strange output is much different from our expectations. However, we can obtain the following information: Baidu word segmentation can recognize people's names, you can also recognize the word "Beijing-China Tobacco cloud", which indicates that the dictionary has the ability to recognize unregistered words. We can assume that the word segmentation process is divided into two stages: the first stage, first look for a special dictionary, this dictionary contains some names, some place names, and some new words that are not available in common dictionaries. In this way, Mao Zedong is first parsed, and the string "Beijing Hua Yanyun" is left ", the word splitting result of "North/Beijing-Beijing. This basically makes sense. To prove this, we submit the query "Mao Zedong North". We expect two kinds of word segmentation results: one is positive maximum matching, and the other is the result of the above assumptions, in fact, Baidu output is the second case. This basically ensures that Baidu word segmentation adopts at least two dictionaries, one being a common dictionary and the other being a specialized Dictionary (such as name of person ). In addition, a specialized dictionary is used to split the remaining parts first and then to a common dictionary for splitting.
Continue the test and submit the query "Cuba ratio ethics". If it is positive maximum match, the result should be: If it is reverse maximum match, then the result should be, in fact, Baidu's word segmentation result is, from this example, it seems that the forward maximum matching algorithm is used. In addition, some examples show that the forward maximum matching algorithm is used. However, the query is slow. Let's look at this query: "Beijing Hua Yanyun ", the expected result of the forward maximum match is, while the expected result of the reverse maximum match is that Baidu actually outputs the latter, which indicates that the reverse maximum match may be used; from this point, we can guess that Baidu adopts the bidirectional maximum matching word segmentation algorithm. If the positive and reverse matching word segmentation results are consistent, it is easy to output them directly. However, if the two are inconsistent, A positive match is a result and a reverse match is a result. How can this problem be solved?
In the above two examples, Baidu adopts the shortest path method, that is, the smaller the number of parts to be split, the better. For example, compared with the latter, the latter is compared with the latter. There are some similar examples to illustrate these output results.
But the problem persists: What if forward and reverse word segmentation are inconsistent and the shortest path is the same? Output positive or reverse results?
Let's look at an example. Submit the query "distant ancient Babylon". This query is divided into Baidu, which indicates that the dictionary contains "Babylon", but whether the word "Ancient Babylon" is uncertain, at this time, it cannot be seen whether it is the result of forward or reverse segmentation. If the query is "distant ancient Babylon", it is divided into "distant/ancient Babylon ", this indicates that the dictionary contains the word "Ancient Babylon", which indicates that "distant ancient Babylon" is the result of positive and maximum matching. So why is "distant ancient Babylon" not divided into "distant/ancient Babylon" in reverse direction? Baidu may choose the group with fewer words in this case.
Of course, you can continue to ask: what if there are as many words after splitting? Finally, let's look at an example and look at "Wang Qiang size:". Baidu divides it into "Wang/strong/small", which is the result of forward segmentation, if the word is reversed, it is split into "King/strong/large". This indicates that the word is ambiguous and the word is the same, and the forward splitting result is selected.
OK. The header may be dizzy. Finally, let's take a look at Baidu's word segmentation algorithm. Of course, there are still guesses in it. The algorithm is as follows:
First, query the specialized Dictionary (personal name, partial place name, etc.), and cut out the proprietary name. The remaining part adopts a bidirectional word segmentation policy. If the splitting results are the same, there is no ambiguity and the word segmentation result is directly output. If they are inconsistent, the result of the shortest path is output. If the length is the same, the group of splitting results with fewer words is selected. If the same word is used, select forward word splitting.
Baidu has been promoting its advantages in Chinese processing. From the above perspective, word segmentation algorithms have no special features, and the effect of discrimination is not ideal, even if Baidu adopts an algorithm that is more complex than the preceding word segmentation algorithm, it cannot be said to be an advantage. If Baidu has an advantage, the only advantage is the large specialized dictionary, this specialized dictionary is used to log on to people's names (such as the elders and the present), titles (such as the old lady), and some place names (such as the UAE ), it is estimated that Baidu uses a relatively new Named Entity Recognition Algorithm published by the academic community to continuously identify unregistered words in the dictionary from the corpus and gradually expand this specialized dictionary. If this is an advantage, it is obvious how long the advantage can be maintained.
Spelling Checker spelling check error prompt (and Pinyin prompt function)

The spelling check error prompt is a function provided by the search engine. That is to say, the user submits the query to the search engine. The search engine checks whether the spelling entered by the user is incorrect, for Chinese users, the error is generally caused by the input method. let's analyze how Baidu implements this function.
We analyze the spelling check system to focus on the following issues:
(1) How does the system determine whether a user input may cause a wrong query?
(2) How can I prompt correct words for incorrect query input?

So how did Baidu do it? Baidu determines whether user input is correct. I think it should be a dictionary query. If the dictionary does not contain this word, it may be a wrong input. At this time, the error prompt function is enabled, this is a good judgment, because if it is a normal word, Baidu generally does not have an error message, and you intentionally enter a so-called word that is not included in the dictionary, at this time, Baidu will usually prompt you to correctly retrieve words.
So how does Baidu prompt correct words? Obviously, it is in pinyin format, for example, I enter the query

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.