How search engine crawlers work

Source: Internet
Author: User

 1. Overview of crawler principles and key technologies
Web Crawlers automatically extract web pages.ProgramIt is an important component of a search engine. A traditional crawler obtains the URLs on an initial webpage from the URLs of one or more initial webpages, and continuously extracts new URLs from the current webpage and puts them in the queue, until the system is stopped. The workflow for focusing on crawlers is complex and requires analysis based on certain webpages.AlgorithmFilter links irrelevant to the topic, retain useful links, and put them in the URL queue waiting for crawling. Then, it selects the URL of the web page to be crawled from the queue according to a certain search policy and repeats the process until a certain system condition is reached. In addition, all Web pages crawled by crawlers will be stored by the system, analyzed, filtered, and indexed for subsequent queries and searches. For crawlers that focus on crawlers, the analysis results obtained in this process may also provide feedback and guidance for the capture process in the future.

Compared with general web crawlers, focusing on crawlers also needs to solve three main problems:
(1) description or definition of the target to be captured;
(2) analysis and filtering of webpages or data;
(3) URL search policy.
The description and definition of captured targets are the basis for determining the Web analysis algorithm and URL search policy. The Web analysis algorithm and the candidate URL sorting algorithm are the key factors determining the form of services provided by the search engine and crawling web pages. The algorithms of these two parts are closely related.

2. Capture the Target description
The descriptions of target crawlers can be divided into three types: target-based web page features, target-based data models, and domain-based concepts.
Objects crawled, stored, and indexed by crawlers based on the characteristics of the target webpage are generally websites or webpages. The method for obtaining the seed sample can be divided:
(1) pre-defined initial seed samples;
(2) A pre-defined webpage category directory and a seed sample corresponding to the category directory, such as Yahoo! Classification structure;
(3) Examples of target capturing based on user behaviors:
A) The Marked samples are displayed during browsing;
B) obtain the access mode and related samples through user log mining.
Among them, the webpage features can be the Content features of the webpage, or the link structure features of the webpage, and so on.
The descriptions or definitions of target crawlers can be divided into three types: target-based web features, target-based data models, and domain-based concepts.
Objects crawled, stored, and indexed by crawlers based on the characteristics of the target webpage are generally websites or webpages. The specific method can be divided into the following methods based on the seed sample acquisition method: (1) pre-defined initial seed sample capture; (2) pre-defined webpage category directory and the seed sample corresponding to the category directory, such as Yahoo! Classification structure, etc.; (3) Capture target samples determined by user behavior. Among them, the webpage features can be the Content features of the webpage, or the link structure features of the webpage, and so on.

Crawlers based on the target data mode target the data on the webpage. The captured data generally conforms to a certain mode, or can be converted or mapped to the target data mode.

Another method of description is to create an ontology or dictionary for the target domain to analyze the importance of different features in a topic from the semantic perspective.

3. Webpage Search Policy
Web page crawling policies can be divided into depth first, breadth first, and best priority. In many cases, deep priority may cause crawlers to fall into the trapped issue. Currently, the common problems are breadth first and best priority.
3.1 breadth-first search strategy
A breadth-first search policy is used to search at the next level after the current level is completed during the crawling process. The Design and Implementation of this algorithm are relatively simple. To cover as many webpages as possible, the breadth-first search method is generally used. There are also many studies that apply the breadth-first search policy to focused crawlers. The basic idea is to think that there is a high probability of topic relevance between the webpage and the initial URL within a certain link. Another method is to combine the breadth-first search and web page filtering technology, first capture the web page with the breadth-first policy, and then filter out irrelevant web pages. The disadvantage of these methods is that as the number of webpages crawled increases, a large number of irrelevant webpages will be downloaded and filtered, and the algorithm efficiency will decrease.

3.2 optimal priority search policy
The optimal priority search policy predicts the similarity between candidate URLs and target webpages based on a certain web page analysis algorithm, or the relevance with the subject, and selects one or more of the best evaluated URLs for crawling. It only accesses webpages that are predicted to be "useful" by web analysis algorithms. One problem is that many related webpages on crawling paths may be ignored, because the best priority policy is a local optimal search algorithm. Therefore, the best priority should be combined with specific applications for improvement to jump out of the local advantages. In section 4th, we will discuss in detail with the Web analysis algorithm. Research shows that such closed-loop adjustment can reduce the number of irrelevant webpages by 30% ~ 90%.

4. webpage Analysis Algorithm

Webpage analysis algorithms can be classified into three types: network topology, webpage-based content, and user access behavior.
4.1 network topology-based Analysis Algorithms
Based on links between webpages, you can use known webpages or data to evaluate objects that have direct or indirect links with them (such as webpages or websites. It can also be divided into three types: webpage granularity, website granularity, and webpage block granularity.
4.1.1 Analysis Algorithm of Web Page Granularity
PageRank and hits algorithms are the most common link analysis algorithms. Both of them are based on recursive and normalized calculation of the Link degree between webpages to obtain the importance rating of each webpage. Although the PageRank algorithm takes into account the randomness of user access behavior and the existence of sink web pages, it ignores the purposeful nature of most users' access, that is, the relevance between webpages and links and the query topic. To solve this problem, the HITS algorithm puts forward two key concepts: Authority and hub ).

The issue of link-based crawling is the tunnel between the subject groups on the relevant pages. That is, many webpages that deviate from the subject on the crawling path also point to the target webpage, the Partial Evaluation Policy interrupts the crawling behavior on the current path. Document [21] proposes a hierarchical Context Model Based on backlink ), it is used to describe the center layer0 of the web page topology that points to the target Web page within a certain radius of physical Hops as the target Web page, and divide the web page according to the number of physical hops pointing to the target Web page, the link from the outer page to the inner page is called a reverse link.

4.1.2 Analysis Algorithm of website Granularity
Website-specific resource discovery and management policies are simpler and more effective than those of web pages. The key to website-specific crawling lies in the division of websites and the calculation of the siterank. The siterank calculation method is similar to PageRank, but the links between websites must be abstracted to a certain extent, and the weights of links must be calculated under a certain model.
The website can be divided into domain names and IP addresses. [18] This article discusses how to divide websites for different host and Server IP addresses under the same domain name in Distributed scenarios, construct a site map, and evaluate siterank using PageRank-like methods. At the same time, the document graph is constructed based on the distribution of different files on each site, and the docrank is obtained based on the siterank distributed computing. The article [18] proves that the distributed siterank computing not only greatly reduces the algorithm cost of a single-host site, but also overcomes the disadvantages of a single site with limited network coverage. A secondary advantage is that it is difficult for common PageRank frauds to cheat siterank.
4.1.3 Analysis Algorithm of Web Page block granularity
A page usually contains multiple links pointing to other pages. Only a part of these links point to the webpage related to the topic, or it is highly important according to the link and anchor text of the webpage. However, PageRank and hits algorithms do not distinguish these links. Therefore, webpage analysis is often disturbed by noise links such as advertisements. The basic idea of the link analysis algorithm at the page block level is to divide the webpage into different page blocks through the VIPs web page segmentation algorithm ), create the link matrices of pagetoblock and blocktopage for these page blocks, respectively as Z and X. Therefore, the page block-level PageRank on the pagetopage graph is Wp = X × Z; The blockrank on the blocktoblock graph is WB = z × X. Some have implemented the PageRank and hits algorithms at the block level. The experiment proves that the efficiency and accuracy are better than those of the traditional algorithms.
4.2 webpage Analysis Algorithm Based on webpage content
The Analysis Algorithm Based on Web Content refers to the Evaluation of web pages based on the characteristics of web content (text, data, and other resources. The content of a webpage evolved from Hypertext to dynamic page (or hidden Web) data. The latter's data volume is about directly visible page data (PIW, publicly Indexable web) of 400 ~ 500 times. On the other hand, various types of network resources such as multimedia data and Web services are also increasingly abundant. Therefore, the Analysis Algorithm Based on Web content also uses the original simple text retrieval method, it has developed into a comprehensive application that covers multiple methods such as webpage data extraction, machine learning, data mining, and semantic understanding. This section summarizes the analysis algorithms based on the webpage content based on different webpage data forms. The first is a webpage with no structure or simple structure based on text and hyperlink; the second is for pages dynamically generated from structured data sources (such as RDBMS), and the data cannot be accessed in batches. The third is for data between the first and second types of data, with a good structure, the display follows a certain pattern or style and can be accessed directly. (This article is reproduced on the Internet)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.