Focus on Web Crawlers

Source: Internet
Author: User

Some time ago, I was busy preparing for the AI project defense. In fact, I compiled a very simple web program-web crawler, and then simulated the form of my graduation thesis to prepare the opening report, thesis Defense (PPT), and thesis design (word ). I was reluctant to do it at the beginning, but after the project was completed, I found that web crawlers are actually very interesting and related to our current study, the following is a brief introduction of Web Crawler knowledge.


1. crawler principle and key technologies

Web Crawler is a program for automatically extracting Web pages. It Downloads Web pages from the Internet for search engines and is an important component of search engines. A traditional crawler obtains the URLs on an initial webpage from the URLs of one or more initial webpages, and continuously extracts new URLs from the current webpage and puts them in the queue, until the system is stopped. The workflow of focusing on crawlers is complex. You need to filter links unrelated to topics based on certain Web analysis algorithms, reserve useful links, and put them in the URL queue waiting for crawling. Then, it selects the URL of the web page to be crawled from the queue according to a certain search policy and repeats the process until a certain system condition is reached. In addition, all Web pages crawled by crawlers will be stored by the system, analyzed, filtered, and indexed for subsequent queries and searches. For crawlers that focus on crawlers, the analysis results obtained in this process may also provide feedback and guidance for the capture process in the future.

Compared with general web crawlers, focusing on crawlers also needs to solve three main problems:

Description or definition of the target to be captured;

Analysis and filtering of webpages or data;

URL search policy.

The description and definition of captured targets are the basis for determining the Web analysis algorithm and URL search policy. The Web analysis algorithm and the candidate URL sorting algorithm are the key factors determining the form of services provided by the search engine and crawling web pages. The algorithms of these two parts are closely related.

2. Capture Target description

  Existing crawler-focused descriptions can be divided into three types: target-based web page features, target-based data models, and domain-based concepts..

Objects crawled, stored, and indexed by crawlers based on the characteristics of the target webpage are generally websites or webpages. The method for obtaining the seed sample can be divided:

Pre-defined initial seed sample;

The pre-defined webpage category directory and the seed sample corresponding to the category directory, such as Yahoo! Classification structure;

Sample of the target to be captured based on user behavior, which can be divided into: the marked samples are displayed during user browsing; Access Modes and related samples are obtained through user log mining.

Among them, the webpage features can be the Content features of the webpage, or the link structure features of the webpage, and so on.

Crawlers based on the target data mode target the data on the webpage. The captured data generally conforms to a certain mode, or can be converted or mapped to the target data mode.

Another method of description is to create an ontology or dictionary for the target domain to analyze the importance of different features in a topic from the semantic perspective.

3. Webpage Search Policy

Web page crawling policies can be divided into depth first, breadth first, and best priority. In many cases, deep priority may cause crawlers to fall into the trapped issue. Currently, the common problems are breadth first and best priority.

For details, refer to "breadth-first and best-first".

4. webpage Analysis Algorithms

  Webpage analysis algorithms can be classified into three types: network topology, webpage-based content, and user access behavior..

1. network topology-based Analysis Algorithm

Based on links between webpages, you can use known webpages or data to evaluate objects that have direct or indirect links with them (such as webpages or websites. It can also be divided into three types: webpage granularity, website granularity, and webpage block granularity.

1.1 web page Granularity Analysis Algorithm

PageRank and hits algorithms are the most common link analysis algorithms. Both of them are based on recursive and normalized calculation of the Link degree between webpages to obtain the importance rating of each webpage. Although the PageRank algorithm takes into account the randomness of user access behavior and the existence of sink web pages, it ignores the purposeful nature of most users' access, that is, the relevance between webpages and links and the query topic. To solve this problem, the HITS algorithm puts forward two key concepts: Authority and hub ).

The issue of link-based crawling is the tunnel between the subject groups on the relevant pages. That is, many webpages that deviate from the subject on the crawling path also point to the target webpage, the Partial Evaluation Policy interrupts the crawling behavior on the current path. Some documents have proposed a hierarchical Context Model Based on backlink ), it is used to describe the center layer0 of the web page topology that points to the target Web page within a certain radius of physical Hops as the target Web page, and divide the web page according to the number of physical hops pointing to the target Web page, the link from the outer page to the inner page is called a reverse link.

1.2 website Granularity Analysis Algorithm

Website-specific resource discovery and management policies are simpler and more effective than those of web pages. The key to website-specific crawling lies in the division of websites and the calculation of siterank. The calculation method of siterank is similar to that of PageRank, however, it is necessary to abstract the links between websites to a certain extent and calculate the weights of links under a certain model.

The website can be divided into domain names and IP addresses. Some documents discuss how to divide the IP addresses of different hosts and servers under the same domain name in Distributed scenarios, construct a site map, and evaluate siterank using PageRank-like methods. At the same time, the document graph is constructed based on the distribution of different files on each site, and the docrank is obtained based on the siterank distributed computing. Using Distributed siterank computing not only greatly reduces the algorithm cost of standalone sites, but also overcomes the disadvantages of individual sites with limited network coverage. A common advantage is PageRank.
It is difficult to cheat siterank.

1.3 Analysis Algorithm of Web Page block granularity

A page usually contains multiple links pointing to other pages. Only a part of these links point to the webpage related to the topic, or it is highly important according to the link and anchor text of the webpage. However, PageRank and hits algorithms do not distinguish these links. Therefore, webpage analysis is often disturbed by noise links such as advertisements. The basic idea of the link analysis algorithm at the block level is to divide the webpage into different page blocks by using the VIPs web page segmentation algorithm ), create a page to block and block to Page Link matrix for these page blocks,
Z and x respectively. Therefore, PageRank at the page block level on the page to page graph is W (p) = x × Z; blockrank on the block to block graph is W (B) = z × X. Some have implemented the PageRank and hits algorithms at the block level. The experiment proves that the efficiency and accuracy are better than those of the traditional algorithms.

2. webpage Analysis Algorithm Based on webpage content

The Analysis Algorithm Based on Web Content refers to the Evaluation of web pages based on the characteristics of web content (text, data, and other resources. The content of a webpage evolved from Hypertext to dynamic page (or hidden Web) data. The latter's data volume is about directly visible page data (PIW, publicly Indexable web) of 400 ~ 500 times. On the other hand, various types of network resources such as multimedia data and Web services are also increasingly abundant. Therefore, the Analysis Algorithm Based on Web content also uses the original simple text retrieval method, it has developed into a comprehensive application that covers multiple methods such as webpage data extraction, machine learning, data mining, and semantic understanding. This section summarizes the analysis algorithms based on webpage content based on different webpage data formats:

For non-structured or structured web pages that are dominated by text and hyperlinks;

For pages dynamically generated from structured data sources (such as RDBMS), data cannot be directly accessed in batches;

The target data is between the first and second types of data and has a good structure. The display follows a certain pattern or style and can be accessed directly.

Postscript: web crawler is a "robot" used to search information. In the modern sense, a search engine can perform a task at a constant rate that humans cannot achieve, if you are interested, you can develop a small crawler software and appreciate its strong functions.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.