Crawler Technology __ Web crawler

Source: Internet
Author: User

Web crawler is a program that automatically extracts Web pages, which downloads Web pages from the World Wide Web and is an important component of search engines. The following series of articles will be a detailed introduction to the reptile technology, I hope that you will eventually be able to make their favorite reptiles.

Web crawler Technology
With the rapid development of the network, the World Wide Web becomes the carrier of a lot of information, and how to extract and use it effectively becomes a great challenge. Search engines (Engine), such as the traditional general search engine AltaVista, Baidu, Yahoo! and Google, serve as a tool to help people retrieve information and become portals and guides for users to access the World Wide Web. However, these generic search engines also have some limitations.

Design and analysis of web crawler in search engine
Here is a brief introduction to the search engine machine crawler production and some basic things to pay attention to. Easy to say, web crawler is similar to the offline reading tool you use. Say off-line, actually still want to connect with the network, otherwise how to catch something down. So where are the different places.

Graph theory and network crawler (web crawlers)
Discrete mathematics is an important branch of contemporary mathematics, and it is also the mathematical foundation of Computer Science. It includes four branches of mathematical logic, set theory, graph theory and modern algebra. Mathematical logic is based on Boolean operations, which we have already introduced. Here we introduce the relationship between graph theory and the Internet Automatic Download Tool web crawler (crawlers). By the way, we use Google Trends to search for the word "discrete mathematics", and we can find a lot of interesting phenomena.

Search engine technology in PHP
We want to select a search information to be accurate (so that our search will be more meaningful AH), fast (because we analyze search results and show extra time), search sites that are concise (easy for HTML source code analysis and stripping), because of the fine features of a new generation of search engine Google, Here we choose it as an example, to see how PHP to achieve the background of Google (www.google.com) search, personalized display of the foreground of the process.

Search engine Spider catcher (PHP)
This article shows the implementation of spider capture of the PHP code.

Multi-threaded control of spider/Crawler programs (C # language)
In the "Crawler/Spider Program Production (C # language)" article, has introduced the crawler implementation of the basic methods, it can be said that the crawler has realized the function. It's just that there is an efficiency problem and the download speed may be slow. This is caused by two reasons ...

Alternative Search data methods: web crawler Program

We are more familiar with the use of a variety of search engines, but there is a more proactive and specialized search technology: Web crawler.

A review of 1 crawler technology

Introduction

With the rapid development of the network, the World Wide Web becomes the carrier of a lot of information, and how to extract and use it effectively becomes a great challenge. Search Engine, such as the traditional general search engine altavista,yahoo! and Google, as a tool to help people retrieve information to become users access to the World Wide Web portal and guide. However, these generic search engines also have some limitations, such as:

(1) The users of different fields and backgrounds often have different searching purposes and needs, and the results returned by the general search engine contain a lot of pages that users don't care about. 

(2) The goal of universal search engine is to be as large as possible network coverage, the contradiction between the limited search engine server resources and the unlimited network data resources will deepen. 

(3) The rich Web data forms and the continuous development of network technology, pictures, databases, audio/video multimedia, such as a large number of different data appear, the general search engine often on these information content is dense and has a certain structure of data powerless, can not be well found and acquired. 

(4) Most general search engines provide search based on keyword, it is difficult to support query based on semantic information. 

In order to solve the above problems, the focused crawler of the related Web resources has emerged. A focus crawler is a program that automatically downloads Web pages, which, based on established crawl targets, have selected access to Web pages and related links to get the information they need. Unlike a general-purpose crawler (Generalpurpose web crawler), the Focus crawler does not pursue a large overlay, but targets a Web page that captures the content of a particular topic and prepares data resources for a subject-oriented user query. 

1 working principle and key technology of focused crawler

Web crawler is a program that automatically extracts Web pages, which downloads Web pages from the World Wide Web and is an important component of search engines. The traditional crawler starts with the URL of one or several initial web pages, gets the URL on the initial page, and in the process of crawling the page, constantly extracts a new URL from the current page into the queue until it satisfies the system's certain stop condition, as shown in Figure 1 (a) flowchart. The working flow of the focus crawler is more complicated, it needs to filter the links unrelated to the topic according to certain Web page analysis algorithm, keep the useful links and put them in the URL queue waiting to be crawled. It then selects the URL of the next page to crawl from the queue according to a certain search policy, and repeats the process until a certain condition of the system is reached, as shown in Figure 1 (b). In addition, all crawled Web pages will be stored in the system, the analysis, filtering, and indexing, so that after the query and retrieval, for the focus of the crawler, the results of this process may also give feedback and guidance to the subsequent crawl process. 

In relation to a common web crawler, there are three main problems to focus on the crawler:

(1) Description or definition of the grasping target;

(2) Analysis and filtration of web pages or data;

(3) The search strategy for the URL. 

The description and definition of crawl target is the basis to decide how to design Web page analysis algorithm and URL search strategy. The Web page analysis algorithm and the candidate URL ranking algorithm are the key to determine the service form of search engine and crawl behavior of Crawler Web page. These two parts of the algorithm are closely related. 

2 Capture Target Description

The existing focus crawler can be classified into 3 kinds based on target Web features, target data model and domain based concept. 

A crawler that crawls, stores, and indexes objects based on the characteristics of the target Web page is typically a Web site or Web page. According to the seed sample acquisition method can be divided into:

(1) Pre-given initial sampling of the seed;

(2) A predetermined list of web pages and a seed sample corresponding to the classified catalogue, such as the Yahoo! Classification structure;

(3) Examples of grasping targets determined by user behavior are divided into:

(a) A sample of the captured samples displayed in the user's browsing process;

(b) Access patterns and related samples are obtained through user log mining. 

Among them, the webpage characteristic can be the content characteristic of the webpage, also can be the link structure characteristic of the webpage, and so on. 

The existing focus crawler can be divided into three kinds of description or definition of target Web page, based on target data pattern and domain based concept. 

A crawler that crawls, stores, and indexes objects based on the characteristics of the target Web page is typically a Web site or Web page. The specific method can be divided into: (1) The initial fetching seed samples given in advance, (2) The Pre given Web page classification directory and the seed sample corresponding to the classified catalogue, such as the Yahoo! Classification structure, and (3) the fetching target sample determined by the user behavior. Among them, the webpage characteristic can be the content characteristic of the webpage, also can be the link structure characteristic of the webpage, and so on. 

Author: Zipauan 2006-1-10 10:11 reply to this statement

A review of 2 crawler technology

The crawler based on the target data pattern is aimed at the data on the Web page, and the data crawled should conform to a certain pattern, or it can be transformed or mapped into the target data pattern. 

Another way to describe this is to establish an ontology or dictionary in the target domain, which is used to analyze the importance of different features in a topic from a semantic perspective. 

3 Web search Strategy

The crawl strategy of Web page can be divided into depth first, breadth first and best priority three kinds. Depth-First in many cases leads to creeping into the reptile (trapped) problem, which is now common in breadth first and best priority. 

3.1 Breadth-First search strategy

Breadth-First search strategy refers to the search in the crawl process, after the completion of the current level of searching, before the next level of search. The design and implementation of the algorithm is relatively simple. In the current to cover as many pages as possible, the general use of breadth-first search methods. There are also many studies that apply breadth-first search strategies to focus crawlers. The basic idea is that the Web page with the initial URL within a certain link distance has a high probability of topic relevance. Another approach is to use breadth-first search and web-filtering technology, first grab the page with breadth-first strategy, and then filter out irrelevant pages. The disadvantage of these methods is that with the increase of crawling web pages, a large number of unrelated pages will be downloaded and filtered, the efficiency of the algorithm will be lower. 

3.2 Best First search strategy

Optimal first search strategy according to a certain web page analysis algorithm, predict the candidate URL and the target page similarity, or the relevance of the topic, and select the best evaluation of one or several URLs to crawl. It accesses only pages that are predicted to be "useful" through the Web Analytics algorithm. One problem is that many of the relevant pages on the crawler crawl path may be overlooked because the best priority strategy is a local optimal search algorithm. Therefore, it is necessary to improve the best priority combined with concrete application to jump out of the local optimal point. In the 4th section, we will combine the Web page analysis algorithm to make a concrete discussion. Research shows that such a closed-loop adjustment can reduce the number of unrelated pages 30%~90%. 

4 Web Analytics algorithm

The Web analytics algorithm can be classified into three types based on network topology, Web content and user access behavior. 

4.1 Analysis algorithm based on network topology

An algorithm for evaluating an object (which can be a Web page or Web site, etc.) that has a direct or indirect link to a relationship, based on a link between pages, through known pages or data. Also divides into the webpage granularity, the website granularity and the page block granularity these three kinds. 

4.1.1 Web page (webpage) particle size analysis algorithm

PageRank and hits algorithms are the most common link analysis algorithms, both of which are based on the recursive and normalized computation of the link degree between pages, and the importance of each page is evaluated. Although the PageRank algorithm considers the randomness of the user's access behavior and the existence of the sink Web page, it ignores the relevance of the Web page and the link to the query subject when the majority of users visit. To solve this problem, the hits algorithm brings forward two key concepts: Authoritative Web pages (authority) and central-type Web pages (hub). 

The problem of crawling based on links is the phenomenon of tunneling between related pages, that is, many pages that deviate from the theme in the crawl path also point to the target page, and the local evaluation strategy interrupts the crawl behavior on the current path. In [21], a layered context model based on reverse link (backlink) is proposed, which is used to describe the central Layer0 of a Web page topology that points to a certain physical hop radius within a target Web page. The page is hierarchically divided by the physical number of hops to the target page, and the link from the outer page to the inner page is called the reverse link. 

Analysis algorithm of granularity of 4.1.2 website

Site granularity of resource discovery and management strategy is more simple and effective than the granularity of the Web page. The key point of crawling of Web site granularity is the division of site and the calculation of site rank (Siterank). The Siterank calculation method is similar to the PageRank, but the link between the website needs to be abstracted to some extent, and the weight of the link is calculated under a certain model. 

Site Division is divided by domain name and by IP address divided into two kinds. In the paper [18], we discuss how to divide the IP address of different hosts and servers under the same domain name, construct the site map, and evaluate Siterank by using similar PageRank method. At the same time, according to the distribution of different files on each site, the document Map is constructed, and the Docrank is obtained by combining Siterank distributed calculation. The document [18] proves that using the distributed Siterank calculation not only reduces the cost of the single site algorithm greatly, but also overcomes the disadvantage that the individual site has limited coverage to the whole network. One advantage is that common PageRank fraud is difficult to deceive Siterank. 

The analysis algorithm of 4.1.3 page block granularity

In a page, often contains a number of links to other pages, only part of these links to the topic-related pages, or according to the link Anchor text page to show its high importance. However, in the PageRank and hits algorithms, there is no distinction between these links, so often to the Web page analysis to bring ads and other noise links interference. The basic idea of the algorithm of link analysis in the page block level (Blocklevel) is to divide the Web page into different page blocks by VIPs page segmentation algorithm, Then the link matrix of Pagetoblock and Blocktopage is established for these web blocks, and the exergy is recorded as z and X respectively. As a result, the page block level PageRank on the Pagetopage graph is blockrank for wp=xxz; on the Blocktoblock map. Exergy has already realized the block-level PageRank and hits algorithm, and it is proved by experiments that the efficiency and accuracy are better than the traditional algorithm. 

4.2 Web page analysis algorithm based on Web content

Based on the content of the Web page analysis algorithm refers to the use of Web content (text, data and other resources) characteristics of the Web page evaluation. The content of the Web page from the original hypertext, developed to a later dynamic page (or called hidden web) data mainly, the latter data volume is about directly visible page data (piw,publicly indexable web) 400~500 times. On the other hand, multimedia data, Web service and other forms of network resources are increasingly rich. Therefore, based on the content of the Web page analysis algorithm from the original relatively simple text retrieval method, developed to cover the Web page data extraction, machine learning, data mining, semantic understanding and other methods of comprehensive application. This section is based on the different forms of Web page data, based on the Web page content analysis algorithm, summed up the following three categories: the first for text and hyperlink-oriented unstructured or structure is very simple Web pages; the second is for pages that are dynamically generated from structured data sources such as RDBMS, whose data cannot be accessed directly in bulk Third, the data bounded by the first and second categories of data, has a good structure, the display follows a certain pattern or style, and can be directly accessed. 

4.2.1 Web page analysis algorithm based on text

1 Pure text classification and clustering algorithm

To a large extent, the technology of text retrieval is borrowed. The text analysis algorithm can quickly and efficiently classify and cluster the Web pages, but it is rarely used alone because it ignores the structure information between the pages and the inside of the Web page. 
2 hypertext Classification and clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.