The basic process and principle of search engine work

Last Update:2017-02-28 Source: Internet

Author: User

Tags html page key words

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is the most important search engine? Some people say the accuracy of the query results, some people will say that the query results of the richness, but in fact these are not the most fatal search engine places. For search engines, the most fatal is the query time. Imagine, if you are in the Baidu interface query a keyword, the result will take 5 minutes to your query results feedback to you, the result must be you quickly discard Baidu.

Search engines in order to meet the demanding speed (now commercial search engine query time units are microsecond order of magnitude), so the use of caching support query requirements, that is, we query search results are not timely, but in its server has been cached results. So what is the overall flow of search engine work? We can understand it as a three-paragraph.

This article is only a general explanation and review of the three working processes, some of which will be explained separately in detail in other articles.

I. Web site collection.

Web collection, in fact, we often say that spiders crawl Web pages. So for spiders (which Google calls robots), the pages they are interested in are grouped into three categories:

1. Spiders have never caught a new page.

2. Spiders have been caught, but page content has changed pages.

3. The spider crawled over, but now has deleted the page.

So how to discover these three kinds of pages effectively and crawl, is the original intention and purpose of spider programming. So here's a question, the starting point for spiders to crawl.

Every webmaster as long as your site has not been severely down the right, then through the website backend server, you can find the industrious spider patronize your site, but do you think from the point of view of the writing program, spiders are how to come? In view of this, all parties have views. There is a saying that spiders crawl from the seed station (or call high Weight station), according to the weight from high to low level from the start. Another kind of saying spiders crawl in the URL set is not obvious in order, search engine will be based on the content of your site update rules, automatically calculate when is crawling your site the best time, and then crawl.

In fact, for different search engines, its crawl starting point will certainly be different, for Baidu, Mr.zhao more inclined to the latter. Baidu Official Blog published in the "index page link to complement the mechanism of a method" (Address: http://stblog.baidu-tech.com/?p=2057) in the article, it is clear that "spider will try to detect the publishing cycle of the Web page to a reasonable frequency to check the page", From this we can infer that in Baidu's index library, for each URL set, it calculates the appropriate crawl time and a series of parameters, and then the corresponding site to crawl.

Here, I would like to explain, is for Baidu, site is not the value of the spider has crawled your page value. For example, site:www.seozhao.com, the numerical value is not always said Baidu included values, want to query specific Baidu should be included in Baidu Webmaster Tools Query index number. So what is site? I will explain this in future articles.

So how do spiders find new links? It relies on hyperlinks. We can think of all the Internet as a collection of collective, spiders from the beginning of the URL set a hyperlink along the page to start constantly discovering new pages. In this process, each new URL found is compared to the one already in the set a, and if the new URL is added to the set a, it is discarded if it exists in the set a. Spiders to a site traversal crawl strategy is divided into two kinds, one is depth first, the other is width first. But if Baidu is such a commercial search engine, its traversal strategy may be some more complex rules, such as the weight of the domain name itself, involving Baidu's own server matrix distribution.

Two. pretreatment.

Preprocessing is the most complex part of search engines, and most ranking algorithms are effective in preprocessing. Then the search engine in the preprocessing of this link, the data mainly for the following steps to deal with:

1. Extract keywords.

Spiders crawl to the page and we see in the browser, the source is the same, usually the code is messy, and there are many of the main content of the page is irrelevant. As a result, search engines need to do three things: code to Noise. Remove all the code from the page, leaving only text. ② go unless the text keyword. For example, the navigation bar on the page and other key words for common areas shared by different pages. ③ remove the deactivated word. The term "stop" refers to words that have no specific meaning, such as "in" and so on.

When the search engine gets the key words of this webpage, it will use its own word segmentation system, divide this article into a Word segmentation list, and then store in the database, and with the URL of this article one by one correspondence. Let me illustrate.

If the spider crawls the page URL is http://www.seozhao.com/2.html, and search engine this page after the above operation to extract the keyword set for P, and p is by the keyword P1,P2,......, pn composition, then in the Baidu database, The relationship between them is one by one correspondence, as shown below.

2. Eliminate duplication and reprint of Web pages.

Each search engine has a different algorithm for identifying duplicate pages, but Mr.zhao thinks that if the algorithm is understood to consist of 100 elements, then all the search engines are afraid that the 80 elements are exactly the same. And the other 20 elements, is based on different search engine for SEO attitude is different, and specifically set up the corresponding strategy. In this paper, the general flow of search engines to explain the preliminary, the specific mathematical model does not do more explaining.

3. Important Information analysis.

In the process of code denoising, search engine is not simply to remove it, but to make full use of web code (such as H tags, strong tags), keyword density, anchor text inside the chain to analyze the most important phrase in this page.

4. The importance of Web page analysis.

By pointing to the page's external chain anchor text to pass the weight of the value of this page to determine a weight value, combined with the above "Important information analysis", so as to establish the Web page keyword set p in each of the key words of the ranking factor.

5. Inverted file.

As stated above, the query result of the user is not timely, but in the search engine cache has been roughly lined up, of course, the search engine will not be a prophet, he will not know what keywords users will query, but he can build a keyword thesaurus, and when it processes the user query request, will have its request in accordance with the thesaurus for Word segmentation. So down, the search engine can be in the user generated query behavior, the thesaurus of each keyword in its corresponding URL ranking first calculation good, so that the processing of the query is greatly saved time.

In short, the search engine uses controllers to control spider crawling, then saves the URL set with the original database, saves it, and then uses the indexer to control the correspondence between each keyword and URL and saves it in the index database.

Let's take a few examples to illustrate the following.

If the http://www.seozhao.com/2.html page is cut into P={P1,P2,P3,......,PN}, it is reflected in the index database by the image below.

The image above is for the convenience of everyone to make it easier to understand, the index database is actually the search engine for the highest performance requirements of the database, because all of the factors are affected by the algorithm, so the actual index database I think should be composed of multi-dimensional array of more complex index table, But its main embodiment of the general role and the same as above.

Third, inquiry service.

The query service as the name implies, is to deal with users in the Search interface query request. The search engine constructs the retriever and then processes the request in three steps.

1. According to the query method and keywords to cut words.

First of all the user search keyword cut into a keyword sequence, we temporarily use Q to express, the user search keyword Q is cut into q={q1,q2,q3,......,qn}.

Then according to the user query methods, such as all the words are linked together, or the middle of the space, and according to the words of different keywords in q, to determine the required query words in the query the results of each word in the display of the importance of possession.

2. Search results sorted.

We have the search Word collection q,q each keyword corresponding to the URL sorting-index library, at the same time, according to the user's query method and part of speech to calculate each keyword in the query results of the display of the important, then only need to carry out a comprehensive sorting algorithm, the search results came out.

3. Display search results and document summaries.

When the search results are available, the search engine displays the search results in the user interface for users to use.

Here, you can think about two questions.

In the search interface, we often find that the summary of Baidu display is the user search word around, if I not only look at the first page, more than a few pages, you will see some results because the target page itself does not fully contain the search term, and the summary of the bid to extract the red Word is only part of the search term, then we can understand that, Baidu in the search word is not fully contained in the case, should be given priority to show in the results of the word is Baidu more important words? Then from these search results can we see the part of Baidu segmentation algorithm?

② Sometimes the page will appear many times the search term, and Baidu search results page in the Site Summary section will only show the part, usually so part of the continuous, then we can not understand in the summary section, Baidu will give priority to display the page it is considered and the most important part of the search terms? So we can infer from Baidu for the page after the noise to the different parts of the weight assigned algorithm?

These two problems, do SEO friends of their own to explore and grope it, Mr.zhao dare not in this no children.

Four, the current Baidu's process loopholes.

Please forgive me to use process loopholes to describe this module, but I have to say, in the current click of the world, I think it is wrong to say that loopholes.

That is, in addition to the top three major links, Baidu also built a user behavior module, to affect the original database and index library. And the impact of the original database, Baidu is a snapshot of complaints, mainly deal with the internet profiteering of some acts, this is understandable. And the impact of the index library, is the user's click Behavior, the design itself is understandable, but the Baidu algorithm is immature, leading to the click of cheating Rampant.

Baidu's user Behavior Analysis module is very simple, in addition to their own complaints to the entrance, is to collect users in the search interface of the click Behavior, if this page results are most users to read, but did not produce clicks, the user actually most choose to click on the second page or even more behind the page, then this phenomenon will be the Baidu engineers know, The algorithm is fine-tuned based on this aspect. Now Baidu for different industries, its algorithm has long been different.

If a search screen in the first two pages is clicked by a large number of users, the search results will typically be raised to first place in 24 as a child.

V. Search engine general flowchart (plus user behavior Analyzer)

The above is my understanding of the basic process and principle of search engine work.

Finally, I would like to say that the vast number of SEO practitioners should have found whether Baidu or Google or other commercial search engines, they will require seoer not to care about the algorithm, not to care about the search engine, but to pay more attention to the user experience. Here we can understand that as a metaphor, search engine is to buy watermelon people, and SEO are the people of watermelon, buy watermelon people ask us these kinds of watermelon do not care about their selection of watermelon standards, but a lot of care how to plant a good watermelon, and for what kind of watermelon is they need good watermelon, They often cover the past with some vague notions. To be sure, the results of this search engine will be diverse, they can choose the results of more choices, to maximize the maintenance of these commercial search engine itself, but please do not forget that we have to eat the watermelon.

Mr.zhao always adhere to white hat SEO, in-depth study of the UE, do a meaningful station for users. But at the same time, I also firmly believe that as a seoer, we should also have a timely understanding of the algorithm, so that we make a stand in line with the user's taste, more can be in the search engine to get a good show, because after all, seoer is also a person, also hope to live better.

In the future, I will be in other articles in the step-by-step analysis of search engine links, and published in my blog "Search engine Principles" of the column, I hope to help you.

Source: Mr.zhao's blog http://www.seozhao.com/319.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More