Baidu keyword algorithm uses the word segmentation algorithm to analyze keywords. This can make our keywords more accurate and make our seo optimization work more effective.
Understanding Word Segmentation technology is of great significance to SEO work. You can analyze keywords from a scientific perspective and propose keyword deployment strategies. If the positive maximum matching algorithm is correct, it can be concluded that the word segmentation weights after word segmentation are in positive order.
How to design an efficient search engine? We can discuss how to design a practical search engine with the technical means adopted by Baidu. Search engines involve many technical points, such as query processing, sorting algorithms, page capturing algorithms, CACHE mechanisms, and ANTI-SPAM. These technical details, as a search engine service provider for commercial companies such as Baidu and GOOGLE, will not be made public. We can regard the existing search engine as a black box. By submitting input to the black box, we can judge the output returned by the black box and roughly judge the unknown technical details in the black box.
This article describes and summarizes the query processing and Chinese Word Segmentation techniques in the Baidu pre-processing stage through inductive Analysis of search results and common word segmentation algorithms, if you have a certain understanding of the data structure and algorithm, it will be easier to understand. In my personal sense, it is not accurate enough to obtain the positive maximum matching algorithm, both specialized dictionaries and general dictionaries have different weights. The search frequency must be related. Based on this, when words in multiple specialized dictionaries appear, it is a two-way maximum matching algorithm that needs to be used to detect which exclusive word should be split first. Of course, this is a personal conjecture and needs to be studied.
Query Processing and Word Segmentation
With the rise of the search economy, people began to pay more attention to the performance, technology, and daily traffic of various search engines around the world. As an enterprise, it will choose whether to publish advertisements based on the popularity of search engines and daily traffic. As a common Internet user, you can select your desired search engine based on the search engine performance and technology. As a technician, You can take representative search engines as the research object. The rise of the search engine economy once again proves the huge business opportunities of the Internet. When the network leaves the search, there will be only empty and messy data, and a large number of gold mines waiting for laborious mining.
Query Processing and word segmentation are essential tasks of a Chinese search engine, baidu, as a typical Chinese search engine, has always stressed that it has key technologies and advantages that other search engines do not possess in "Chinese processing. So let's take a look at what core technologies Baidu uses.
Query Processing/Chinese word segmentation.
I. Query Processing
The user submits a query to the search engine. The search engine generally performs some processing after receiving the user's query, and then extracts the relevant information from the index database. So what does Baidu do after receiving user queries?
1. Assume that the submitted Chinese query contains English words. How does the search engine handle this? For example, to query "BT download movie", Baidu's method is to keep the English in the Chinese string as a whole and use this as the breakpoint to split the Chinese into different ones. In this way, the above query is switched, whether the English in the middle is a word in the dictionary or a random character, it will be treated as a whole. For what it is, you can query "dfdfdf Download Movie" to see the results. Of course, this is also true if the query contains numbers.
2. Assume that the user has submitted more than one query string, for example, "information retrieval theory Tools ". Then, the search engine splits the query string into several subquery strings Based on delimiters such as spaces and punctuation marks. For example, the preceding query is parsed as three substrings; this principle is simple. Let's look at it.
3. Assume that the submitted query contains duplicate content. What should the search engine do? For example, when querying the "theoretical tool theory", Baidu treats repeated strings as only once, that is, processing them into equivalent "theoretical tools", while GOOGLE apparently does not merge them, instead, it increases the weight of the substring for repeated queries. So how can we draw this conclusion? We can submit the "theoretical tool" to Baidu and return 341,000 documents. Let's take a look at the returned content on the first page.
We submitted the query "theoretical tool Theory". When we look at the returned results, there are still so many returned documents. Of course, this does not indicate too many problems. Let's look at the sorting of the returned results on the first page, see it? The order is completely unchanged, while GOOGLE's order is somewhat changed, which means Baidu combines repeated queries into one for processing, in addition, the order in which strings appear is not considered (GOOGLE considers this order ).