Seo crawler principles

Source: Internet
Author: User
A purely technical article on the principles and architecture of Web Crawler programs may not be clearly understood in some places. For the SEO industry, it is often used to deal with search engines and crawler programs, if you are looking at something you are not clear about and want to know, you can search for relevant explanations, it is helpful for work (I personally think the area worth attention has been highlighted in red ). The article is relatively long. I have published it twice and can convert it into a PDF document for reading (the lazy kids shoes can be downloaded by myself at the end of the next article ).
How web crawlers work
1. Overview of crawler principles and key technologies
Web Crawler is a program for automatically extracting Web pages. It Downloads Web pages from the Internet for search engines and is an important component of search engines. A traditional crawler obtains the URLs on an initial webpage from the URLs of one or more initial webpages, and continuously extracts new URLs from the current webpage and puts them in the queue, until the system is stopped. The workflow of focusing on crawlers is complex. You need to filter links unrelated to topics based on certain Web analysis algorithms, reserve useful links, and put them in the URL queue waiting for crawling. Then, it selects the URL of the web page to be crawled from the queue according to a certain search policy and repeats the process until a certain system condition is reached. In addition, all Web pages crawled by crawlers will be stored by the system, analyzed, filtered, and indexed for subsequent queries and searches. For crawlers that focus on crawlers, the analysis results obtained in this process may also provide feedback and guidance for the capture process in the future.
Compared with general web crawlers, focusing on crawlers also needs to solve three main problems:
Description or definition of the target to be captured;
Analysis and filtering of webpages or data;
URL search policy.
The description and definition of captured targets are the basis for determining the Web analysis algorithm and URL search policy. The Web analysis algorithm and the candidate URL sorting algorithm are the key factors determining the form of services provided by the search engine and crawling web pages. The algorithms of these two parts are closely related.
2. Capture the Target description
The descriptions of target crawlers can be divided into three types: target-based web page features, target-based data models, and domain-based concepts.
Objects crawled, stored, and indexed by crawlers based on the characteristics of the target webpage are generally websites or webpages. The method for obtaining the seed sample can be divided:
Pre-defined initial seed sample;
The pre-defined webpage category directory and the seed sample corresponding to the category directory, such as Yahoo! Classification structure;
Sample of the target to be captured based on user behavior, which can be divided into: the marked samples are displayed during user browsing; Access Modes and related samples are obtained through user log mining.
Among them, the webpage features can be the Content features of the webpage, or the link structure features of the webpage, and so on.
Crawlers based on the target data mode target the data on the webpage. The captured data generally conforms to a certain mode, or can be converted or mapped to the target data mode.
Another method of description is to create an ontology or dictionary for the target domain to analyze the importance of different features in a topic from the semantic perspective.
3. Webpage Search Policy
Web page crawling policies can be divided into depth first, breadth first, and best priority. In many cases, deep priority may cause crawlers to fall into the trapped issue. Currently, the common problems are breadth first and best priority.
3.1
Breadth-first search policy
A breadth-first search policy is used to search at the next level after the current level is completed during the crawling process. The Design and Implementation of this algorithm are relatively simple. To cover as many webpages as possible,
Generally, the breadth-first search method is used. There are also many studies that apply the breadth-first search policy to focused crawlers. The basic idea is to think that there is a high probability of topic relevance between the webpage and the initial URL within a certain link. Another method is to combine the breadth-first search and web page filtering technology, first capture the web page with the breadth-first policy, and then filter out irrelevant web pages. The disadvantage of these methods is that as the number of webpages crawled increases, a large number of irrelevant webpages will be downloaded and filtered, and the algorithm efficiency will decrease.
3.2
Optimal priority search policy
The optimal priority search policy predicts the similarity between candidate URLs and target webpages based on a certain web page analysis algorithm, or the relevance with the subject, and selects one or more of the best evaluated URLs for crawling. It only accesses webpages that are predicted to be "useful" by web analysis algorithms. One problem is that many related webpages on crawling paths may be ignored, because the best priority policy is a local optimal search algorithm.
Therefore, the best priority should be combined with specific applications for improvement to jump out of the local advantages. In section 4th, we will discuss in detail with the Web analysis algorithm. Research shows that such closed-loop adjustment can reduce the number of irrelevant webpages by 30% ~ 90%.
4. webpage Analysis Algorithms
Webpage analysis algorithms can be classified into three types: network topology, webpage-based content, and user access behavior.
4.1
Network topology-based Analysis Algorithm
Based on links between webpages, you can use known webpages or data to evaluate objects that have direct or indirect links with them (such as webpages or websites. It can also be divided into three types: webpage granularity, website granularity, and webpage block granularity.
4.1.1
Web page Granularity Analysis Algorithm
PageRank and hits algorithms are the most common link analysis algorithms. Both of them are based on recursive and normalized calculation of the Link degree between webpages to obtain the importance rating of each webpage.
Although the PageRank algorithm takes into account the randomness of user access behavior and the existence of sink web pages, it ignores the purposeful nature of most users' access, that is, the relevance between webpages and links and the query topic. Needle
The HITS algorithm puts forward two key concepts: Authority and hub ).
The issue of link-based crawling is the tunnel between the subject groups on the relevant pages. That is, many webpages that deviate from the subject on the crawling path also point to the target webpage, the Partial Evaluation Policy interrupts the crawling behavior on the current path. Some documents have proposed a hierarchical Context Model (context) based on backlink.
Model) is used to describe the center layer0 of the webpage topology that points to the target webpage within the radius of a certain number of physical hops. The webpage is divided by the number of physical hops pointing to the target webpage, the link from the outer page to the inner page is called a reverse link.
4.1.2
Analysis Algorithms of website Granularity
Website-specific resource discovery and management policies are simpler and more effective than those of web pages. The key to website-specific crawling lies in the division of websites and the calculation of siterank. The calculation method of siterank is similar to that of PageRank, however, it is necessary to abstract the links between websites to a certain extent and calculate the weights of links under a certain model.
The website can be divided into domain names and IP addresses. Some documents discuss how to station IP addresses of different hosts and servers under the same domain name in a distributed environment.
Divide points, construct a site map, and evaluate siterank using methods similar to PageRank. At the same time, the document diagram is constructed based on the distribution of different files on each site, combined
Docrank is obtained from siterank distributed computing. Using Distributed siterank computing not only greatly reduces the algorithm cost of standalone sites, but also overcomes the disadvantages of individual sites with limited network coverage. A common advantage is PageRank.
It is difficult to cheat siterank.
4.1.3
Analysis Algorithm of Web Page block granularity
A page usually contains multiple links pointing to other pages. Only a part of these links point to the webpage related to the topic, or it is highly important according to the link and anchor text of the webpage. However, PageRank and hits algorithms do not distinguish these links. Therefore, webpage analysis is often disturbed by noise links such as advertisements. At the page block level (Block
Level) the basic idea of the link analysis algorithm is to divide the webpage into different page blocks through the VIPs web page segmentation algorithm, and then create a page
The link matrix of block and block to page, respectively, is Z and X. Therefore, the PageRank of the page block level on the page to page graph is W (p) = x × Z;
The blockrank in the block to block graph is W (B) = z × X.
Some have implemented the PageRank and hits algorithms at the block level. The experiment proves that the efficiency and accuracy are better than those of the traditional algorithms.
4.2
Webpage Analysis Algorithm Based on webpage content
The Analysis Algorithm Based on Web Content refers to the Evaluation of web pages based on the characteristics of web content (text, data, and other resources. The content of a webpage evolved from Hypertext to dynamic pages (or
The data volume is about to be directly visible to the page data (PIW, publicly Indexable
Web) 400 ~ 500 times. On the other hand, multimedia data, web
Various types of network resources such as services are also increasingly abundant. Therefore, the Analysis Algorithm Based on Web content also uses the original simple text retrieval method, it has developed into a comprehensive application that covers multiple methods such as webpage data extraction, machine learning, data mining, and semantic understanding. This section summarizes the analysis algorithms based on webpage content based on different webpage data formats:
For non-structured or structured web pages that are dominated by text and hyperlinks;
For pages dynamically generated from structured data sources (such as RDBMS), data cannot be directly accessed in batches;
The target data is between the first and second types of data and has a good structure. The display follows a certain pattern or style and can be accessed directly.

Seo crawler principles

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.