What is the working process of the search engine

Source: Internet
Author: User
Tags contains file handling html tags include log mixed sort root directory


Search engine work process is very complex, we simply introduced how the search engine is to achieve the page rank. The introduction here is relative to the real search engine technology is only fur, but for SEO personnel is enough to use.



The search engine's work process can be divided into three phases:



1 Crawl and crawl – search engine spiders access Web pages by tracking links and get page HTML code to be stored in the database.



2 preprocessing-indexing program to capture the page data for text extraction, Chinese word segmentation, index processing, for ranking program call.



3 ranking-After the user input keywords, the ranking program calls the index library data, calculates the dependencies, and then generates the search results page in a certain format.



Crawl and crawl



Crawl and crawl is the first step in search engine work, complete the task of 



Spider



The program used by search engines to crawl and access pages is called a spider (spider), also known as a Robot (BOT). Search engine Spiders Visit the site page similar to the browser used by ordinary users. Spider program issued a page access request, the server returned HTML code, the spider program to the received code into the original page database. Search engine in order to improve crawl and crawl speed, are using multiple spiders concurrent distribution crawling. When a spider accesses any Web site, it first accesses the robots.txt file in the root directory of the site. If the robots.txt file prohibits search engines from crawling certain files or directories, spiders will comply with the protocol and not crawl the banned URLs.



Tracking Links



In order to crawl the web as much as possible, search engine spiders will track the links on the page, from one page to the next page, like spiders crawling in the spider web, which is the origin of the name of the search engine spider. The entire internet is made up of websites and pages linked to each other. Theoretically, spiders can crawl from any page and follow links to all the pages on the web. Of course, because the site and page link structure is extremely complex, spiders need to take a certain crawling strategy to traverse all pages on the web.



The simplest crawling traversal strategy is divided into two kinds, one is depth first, the other is breadth first.



Depth first refers to the spider crawling along the discovery link until there is no other link, then return to the first page, along another link and then crawl forward.



Breadth first refers to the spider in a page found multiple links, not along a link has been forward, but the page all the first layer of links are crawled, and then along the second level of the link found on the page to crawl to the third layer of pages. Theoretically, whether the depth first or breadth first, as long as the spider enough time to climb the entire Internet. In the actual work, the spider's bandwidth resources, time is not unlimited, it is impossible to crawl through all the pages. In fact, the largest search engine is just crawling and collecting a small part of the Internet.



Depth first and breadth priority are usually mixed, so that you can take care of as many sites as possible (breadth first), and also take care of the inner pages of a part of the site (depth first).



Pretreatment



Extract text



Today's search engines are based on text content as well. Spiders crawl to the page of the HTML code, in addition to the user can see in the browser visible text, but also contains a large number of HTML format tags, Javascript programs, etc. can not be used for ranking content. The first thing the search engine preprocessing to do is to remove tags, programs from HTML files, and extract the content of Web pages that can be used for ranking processing. In addition to the visible text, search engines will also extract some special code containing text information, such as meta tags in the text, picture substitution text, Flash file alternative text, link anchor text and so on.



Chinese participle



Participle is a unique step in Chinese search engine. Search engines store and process pages, and user searches are based on words. English and other language words and words have space between, search engine indexing program can directly divide sentences into a collection of words. There is no separator between Chinese words and words, and all the words and words in a sentence are connected together. Search engines must first distinguish which words form a word, and which words themselves are a word. For example, "weight loss method" will be word for "weight loss" and "method" two words.



There are basically two kinds of Chinese word segmentation methods, one is based on dictionary matching, the other is based on statistics.



Based on the dictionary matching method is to refer to the analysis of a piece of Chinese characters and a pre-made dictionary of the terms of the match, in the string to be analyzed in the dictionary to scan the existing entries are matched success, or say cut a word. According to the scanning direction, the matching method based on dictionary can be divided into forward matching and reverse matching. According to the difference of the matching length priority, it can be divided into the maximum matching and the minimum matching. The scanning direction and length are mixed first, and the forward maximum matching and reverse maximum matching can be produced in different ways. Dictionary matching method is simple to calculate, and its accuracy depends largely on the completeness and update of the dictionary.



Based on statistical segmentation method refers to the analysis of a large number of text samples, to calculate the word and word adjacent to the statistical probability, a few words adjacent to appear more, the more likely to form a word. The advantage of statistical method is that it is quicker to respond to new words, and also helps to eliminate ambiguity. Two kinds of word segmentation methods based on dictionary matching and statistics have their advantages and disadvantages, the actual use of the word segmentation system are mixed using two methods to achieve rapid and efficient, but also to identify new words, words, to eliminate ambiguity.



Search engine on the page segmentation depends on the size of the word library, accuracy and segmentation algorithm is good or bad, rather than depending on how the page itself, so SEO staff can do little to participle. The only thing you can do is to prompt the search engine in some form on the page, some words should be treated as a word, especially when there may be ambiguity, such as in the title of the page, the H1 tag and the bold words. If the page is about "Kimono", then the word "kimono" can be deliberately marked as bold. If the page is about "makeup and clothing", you can label the "costume" two words in bold. In this way, the search engine to analyze the page to know that the marked bold should be a word.



To stop the word



In both English and Chinese, there are some high frequencies in the content of the page, but there is no effect on the content of words, such as "," "," "the" Auxiliary, "Ah", "ha", "ah" such as exclamations, "thus", "to", "but" such as prepositions. These words are called stop words because they have little effect on the main meaning of the page. Common stop words such as the,a,an,to,of in English. The search engine removes these stop words before indexing the page, making the index data topic more prominent and reducing the amount of unnecessary computation.



Eliminate noise



Most of the pages also have a part of the content on the theme of the page has no contribution, such as the copyright notice text, navigation bar, advertising and so on. The common blog navigation for example, almost every blog page will appear in the article Classification, Historical archive and other navigation content, but these pages themselves and the "classification", "history" of the words have no relationship. Users search for "history", "classification" of these keywords simply because there are these words on the page to return blog posts is meaningless, completely irrelevant.



So these blocks belong to the noise, the page theme can only play a role in dispersing. Search engines need to identify and eliminate these noises and rank without using noise content. The basic method of denoising is based on HTML tags on the page block, distinguish the page header, navigation, text, footer, advertising and other areas, in the site a large number of repeated blocks often belong to the noise. After the page is De noising, the rest is the main content of the page.



to Heavy



The search engine also needs to redo the page. The same article often repeats itself on different sites and on different sites on the same site, and search engines don't like the repetitive content. When users search, if the first two pages see the same article from different sites, the user experience is too bad, although it is content-related. Search engines want to return only one of the same articles, so you need to identify and delete duplicate content before indexing, which is called going heavy. The basic method of going heavy is to compute the fingerprint of the page characteristic keyword, that is to say, select the most representative part of the key words from the main content of the page (often the most frequent keywords), and then compute the digital fingerprint of these keywords.



Here the keyword selection is in participle, to stop the word, after the noise. The experiments show that 10 characters can be used to achieve high computational accuracy, and the contribution of choosing more words to the improvement of the accuracy is not significant. Understand the search engine to the weight of the algorithm, SEO personnel should know simply to increase ", the ground, get", the replacement of paragraph order this so-called pseudo original, and can not escape the search engine to the weight of the algorithm, because such operations can not change the characteristics of the article keyword. And the search engine's algorithm is likely not only at the page level, but to the paragraph level, mixed with different articles, cross-exchange paragraph order can not be reproduced and copied into the original.



A forward index can also be referred to as an index.



After text extraction, word segmentation, denoising, and heavy, the search engine is unique, can reflect the content of the main page, in terms of the contents of the unit. Next search engine indexing program can extract keywords, according to the word segmentation procedures, the page into a set of keywords, while recording the frequency of each keyword on the page, the number of occurrences, format (such as appearing in the title tag, bold, H tags, anchor text, etc.), location ( such as the first paragraph of the page text, etc.). In this way, each page can be recorded as a set of keyword sets, in which each keyword word frequency, format, location and other weight information are also recorded on the record.



Inverted index



Forward indexing is not yet directly used for rankings. If the user searches for the keyword 2, if only the forward index, the ranking program needs to scan all the files in the index library, find the file containing the keyword 2, and then calculate the correlation. Such calculations do not meet the demand for real-time return of ranking results. So the search engine will reconstruct the forward index database to the inverted index, and translate the mapping of the file corresponding to the keyword to the mapping of the keyword to the file. In the inverted index the keyword is the primary key, each keyword corresponds to a series of files, these files appear in this keyword. So when a user searches for a keyword, the sort program locates the keyword in the inverted index, and immediately finds all the files that contain the keyword.



Link Relationship Calculation



Link-relational computing is also an important part of preprocessing. Now all the mainstream search engine ranking factors include the link flow information between the pages. After the search engine crawls the page content, must calculate beforehand the page to have what link to which other pages? What are the import links for each page? These complex link-pointing relationships form the weight of links for sites and pages. Google PR value is the link between the most important embodiment of this. Other search engines are also doing similar calculations, although they are not called PR.



Special File Handling



In addition to HTML files, search engines are usually able to crawl and index text based on a variety of file types, such as PDF, Word, WPS, XLS, PPT, TXT file and so on. We often see these file types in search results. But the current search engine can not deal with pictures, videos, Flash, such as non-text content, and can not execute scripts and programs. While the search engine has made some progress in identifying images and extracting text from Flash, it is far from the goal of reading pictures, videos, and Flash content to return results. The image, video content ranking is often dependent on the text content, details can refer to the following Integrated search section.



Ranking



After the search engine Spider crawl page, the index program calculates inverted index, the search engine is ready to be able to handle user search at any time. After the user fills in the search box keyword, the rank program calls the index database data, the computation rank displays to the user, the ranking process is the direct interaction with the user.



Search word processing



After the search engine receives the search word that the user input, need to do some processing to the search word, can enter the rank process. Search term processing includes several aspects:



Chinese participle



As with the page index, the search term must also be Chinese participle, and the query string will be converted to a keyword combination based on words. The principle of participle is the same as page participle.



To stop the word



As with indexing, search engines also need to remove the stop words in search terms to maximize the relevance and efficiency of rankings.



Instruction processing



After the word segmentation, the search engine's default approach is to use the "and" logic between the key words. In other words, users search "weight loss method", the program word for "weight loss" and "method" two words, search engine sort by default, the user is looking for both "weight loss", but also contains "methods" page. A page that contains only "lose weight" does not contain a "method", or contains only "methods" that do not contain "weight loss", is considered to be incompatible with the search criteria. Of course, this is only a very simplistic explanation for the principle that we will actually see search results that contain only a subset of the keywords. In addition, the user input query Word may also contain some advanced search instructions, such as plus, minus, etc., search engines need to make identification and corresponding processing.



File matching



After the search word is processed, the search engine gets the word based keyword collection. The file matching phase is to find the files that contain all the keywords. The inverted index mentioned in the index section allows file matching to be completed quickly.



Selection of the initial subset



When you find a matching file that contains all the keywords, you cannot calculate the dependencies, because the files found are often hundreds of thousands of millions of, or even tens of thousands. It takes a long time to calculate the relevance of so many files in real time. In fact, users do not need to know all the matching hundreds of thousands of millions of pages, most users will only look at the first two pages, that is, the first 20 results. Search engines do not need to compute the relevance of so many pages, and only the most important part of the page can be counted. People who use search engines will notice that the search results page usually displays a maximum of 100. Users click on the "next page" link at the bottom of the search results page to see only page 100th, which is 1000 search results. Baidu usually returns 76 pages of results.



Dependency calculation



After the initial subset is selected, the page in the sub set is evaluated for keyword dependencies. Calculating dependencies is the most important step in the ranking process. Correlation calculation is the most interesting part of search engine algorithm. The main factors affecting relevance include several aspects.



Keywords Common degree



After participle of multiple keywords, the entire search string meaning contributions are not the same. The more commonly used words contribute less to the meaning of search terms, the less commonly used words contribute more to meaning. For example, suppose the user enters the search term "we Pluto". The term "we" is very often used in a very high number of pages, and it contributes little to the degree of recognition and relevance of the search term "we Pluto". Finding pages that contain the word "we" has little impact on search rankings relevance, and too many pages contain the word "we." The term "Pluto" is less commonly used and contributes much more to the meaning of the search term "we Pluto". Pages that contain the word "Pluto" are more relevant to the search term "we Pluto". The most common word is the stop word, the meaning of the page has no effect at all.



So the search engine on the search word string in the keyword is not treated equally, but based on the weight of the common degree. The weighted coefficients of common words are low, and the ranking algorithm pays more attention to the words which are not commonly used. Let's assume that the "we" and "Pluto" two words appear on all two pages A and B. But the word "we" appears in plain text on page A, and the word "Pluto" appears in the title tag on page A. B on the contrary, "we" appears in the title tag, and "Pluto" appears in plain text. So for the "we Pluto" search term, page A will be more relevant.



Frequency and density



It is generally believed that in the absence of keyword accumulation, search terms in the page appear more than the number of times, the density is relatively high, the description page and search terms more relevant. Of course, this is only a general rule, the actual situation is not the case, so there are other factors related to the calculation. Frequency and density are only part of the factor, and the degree of importance is getting lower. Keyword location and form as mentioned in the Index section, page keywords appear in the format and location are recorded in the index library. Keywords appear in the more important position, such as title tags, bold, H1, and so on, the more relevant pages and keywords. This is part of the page SEO to solve.



Keyword distance



After the segmentation of the keyword complete match appears, the description is most relevant to the search term. For example, search for "weight loss method", the page on the complete "weight loss method" Four words is the most relevant. If the "weight loss" and "method" two words do not have consecutive matches appear, the distance appears near some, but also by the search engine that relevance slightly larger.



Link Analysis and Page weight



In addition to the page itself, the link between the page and weight relationship also affect the relevance of keywords, the most important is the anchor text. The more pages there are to import links to search terms as anchor text, the more relevant the page is. Link analysis also includes the theme of the link source page itself, the text around the anchor text, and so on. Ranking filter and adjust the selection of matching file subsets, the calculation of relevance, the overall ranking has been determined. After that, the search engine may have some filtering algorithms that make minor adjustments to the rankings, with the most important filtering being the imposition of penalties. Some suspected of cheating on the page, although according to the normal weight and relevance of the calculation to the front, but the search engine's penalty algorithm may be in the last step of the page to the back. Typical example is Baidu's 11-bit, Google's negative 6, minus 30, minus 950 and other algorithms.



Rank display



After all rankings are determined, the ranking program calls the original page title tag, description label, snapshot date and other data displayed on the page. Sometimes a search engine needs to dynamically generate a page summary instead of calling the page's own description label.



Search Cache



A large portion of the keywords users search for is repetitive. According to the 2/8 law, 20% of the search terms accounted for 80% of the total number of searches. According to the long tail theory, the most common search terms do not account for more than 80%, but usually there is also a relatively bulky head, a small number of search terms accounted for a large part of all the search. Especially when there are hot news, millions of people may search the exact same keyword every day. It can be a big waste if you re processing your rankings every time you search.



Search engine will put the most common search terms into the cache, users search directly from the cache, without having to go through file matching and correlation calculation, greatly improve the ranking efficiency, reduce the search response time. Query and click the log to search for the user's IP address, search keywords, search time and click on the results of the page, search engines are recorded to form a log. The data in these log files is important for search engine to judge the quality of search results, adjust the search algorithm, and expect search trends.



Above we briefly introduced the search engine work process. Of course, the actual search engine work steps and algorithms are very very complex. The above description is simple, but there are many technical difficulties. Search engines are also constantly optimizing the algorithm to optimize the database format. There are also differences in the steps of different search engines. But the basic workings of all the major search engines are generally the same, and there will be no substantial change over the past few years and in the years ahead.



This article by the Zhengzhou Cerebral Palsy Hospital first original, A5 starting, I hope to help you webmaster, remember to reprint left this article feeds webmaster information http://www.naotan0371.com, Welcome to pick.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.