http://blog.csdn.net/zolalad/article/details/16344661
Hadoop-based distributed web Crawler Technology Learning notes
first, the principle of network crawler
The function of web crawler system is to download webpage data and provide data source for search engine system. Many large-scale web search engine systems are called web-based search engine systems, such as Google, Baidu. This shows the importance of web crawler system in search engine. The Web page contains some hyperlink information in addition to the text information that is available for the user to read. Web crawler system is through the hyperlinks in the Web page to continue to get other Web pages on the network. It is because this collection process is like a crawler or spider roaming the network, so it is called a web crawler system or network Spider system, in English called Spider or Crawler.
second, the Network Crawler system working principle
Web crawler system will generally choose some of the more important, the degree (the number of links in the Web page) larger site URL as a seed URL collection. The web crawler system takes these seed sets as the initial URL and begins to crawl the data. Because the Web page contains link information, the URL of the existing Web page will get some new URLs, you can think of the point structure between the pages as a forest , each seed URL corresponding to the page is the root node of a tree in the forest. In this way, web crawler systems can traverse all Web pages based on the breadth-first algorithm or the depth-first algorithm . Because the depth-first search algorithm may cause the crawler system to fall into a Web site, is not conducive to search more close to the homepage of the Web page information, so the general use of breadth-first search algorithm to collect Web pages. The web crawler system first puts the seed URL into the download queue and then simply pulls out a URL from the first team to download its corresponding Web page. After the content of the Web page is stored, and then the link information in the webpage is parsed, some new URLs can be obtained and the URLs will be added to the download queue. Then take out a URL, the corresponding page to download, and then parse, so repeatedly, know to traverse the entire network or meet a certain conditions before it will stop.
Crawl strategy:
In a crawler system, the URL queue to crawl is an important part. The sequence of URLs in the URL queue to crawl is also an important issue, as it involves grabbing that page first and then crawling which page. The method of determining the order of these URLs is called a crawl strategy. Several common crawl strategies are highlighted below:
1. Depth-First traversal strategy
The depth-first traversal strategy refers to the web crawler starting from the start page, a link tracking down a link, after processing the line and then into the next Start page, continue to follow the link. Let's take the following diagram as an example:
Traversed path: a-f-g e-h-i B C D
2. Width-First traversal strategy
The basic idea of the width-first traversal strategy is to insert the link found in the new download page directly at the end of the URL queue to crawl. In other words, a web crawler crawls all the pages that are linked in the Start page, then selects one of the linked pages and continues to crawl all the pages that are linked in this page. Or take the diagram above as an example:
Traverse path: a-b-c-d-e-f G H I
3. Reverse Link Number Policy
The number of backlinks refers to the number of pages that a page is linked to by another page. The number of backlinks represents the extent to which the content of a Web page is recommended by other people. Therefore, many times the search engine crawl system will use this indicator to evaluate the importance of the page, so as to determine the order of the different pages crawl.
In the real network environment, due to the existence of advertising links, cheat links, the number of backlinks can not be completely equal to his my that also important degree. Therefore, search engines tend to consider some reliable backlink numbers.
4.Partial PageRank Strategy
The Partial PageRank algorithm draws on the idea of the PageRank algorithm: for pages that have already been downloaded, together with URLs in the URL queue to crawl, form a collection of Web pages, calculate the PageRank value of each page, URLs in the URL queue to be crawled are sorted by the size of the PageRank value, and the pages are crawled in that order.
If the PageRank value is recalculated each time a page is fetched, a compromise is to recalculate the PageRank value once each K page is fetched. However, there is a problem with this situation: for the downloaded page of the analysis of the link, that is, the unknown page we mentioned earlier that part, there is no PageRank value for the time being. To solve this problem, these pages will be given a temporary PageRank value: the page all into the chain passed in the PageRank value to summarize, thus forming the PageRank value of the unknown page, so as to participate in the sorting. The following examples illustrate:
5.OPIC Policy Policy
The algorithm is actually also a matter of importance to the page score. Before the algorithm starts, give all pages an identical initial cash (cash). After downloading a page p, allocate P's cash to all the links analyzed from P and clear P's cash. For all pages in the queue to crawl URLs, sort by the number of cash.
6. Major Station priority strategy
For all pages in the URL queue that you want to crawl, categorize them according to the site you belong to. For sites that have more pages to download, download them first. This strategy is therefore called the major station priority strategy.
three, the basic structure of the Network crawler system
Through the above web Crawler system Basic principles of introduction, we can general crawler system structure is divided into 6 modules, the 6 modules composed of the basic structure of the crawler system:
(1) Configuration module: This module allows users to configure the crawler system through configuration files. For example, the crawler system to download the depth of the page (number of layers), multi-threaded crawl when the number of threads, crawl the same site two pages of the time interval and limit the URL to crawl the regular expression and so on.
(2) URL recognition module has been visited: because the URL of a Web page may be resolved many times, so in order to prevent the same Web page is repeated repeatedly download crawler must have this module to filter out the crawled pages.
(3) Robots Protocol module: When the web crawler system for the first time on a Web site collection, to first crawl robots.txt, and then learned to specify the directory should not be accessed. Or, depending on the meta-information of the Web page, you can determine which server definitions cannot be indexed and accessed, and then access only the pages that can be indexed.
(4) Web Capture module: Web Capture module mainly to complete the crawl of the Web page. Establish a connection to the server via a URL and then get the Web content.
(5) Web page parsing module: Extract links from downloaded Web pages and put the extracted URLs into the download queue.
(6) Store Web page module: The function of this module is to store the downloaded Web pages in a certain organization on the local server or Distributed file system. for the subsequent processing of the search engine module .
The above basic structure is the Web crawler system must have. In the application, because of different crawler system to the different combinations of modules, it will also form a different system structure.
Four, the working principle of the distributed network Crawler
The above describes two problems that must be considered in designing a centralized crawler system, however, both the core working principle and the core basic structure need to be considered in both the Distributed crawler system and the centralized crawler system. Because the distributed network Crawler can be regarded as a combination of multiple centralized network crawler systems. Combining the core work principle and core structure of the centralized crawler above, this section describes the working principle of the distributed network Crawler.
Distributed crawler system is running on the machine cluster, each node in the cluster is a centralized crawler , and its working principle is the same as the centralized crawler system. These centralized crawlers are controlled by a master node in a distributed crawler system to work together. Because the Distributed crawler system requires multiple nodes to work together, so multiple nodes need to communicate with each other to interact information, so the key to build a distributed crawler system is network communication . Because distributed crawler systems can use multiple nodes to crawl Web pages, the efficiency of distributed crawler systems is much higher than that of centralized crawler systems.
There are many kinds of architectures for distributed crawler systems, and there are many ways to work and store them. However, the typical distributed crawler system adopts the architecture of master-slave mode . There is one master node that controls all crawl tasks from the node, which is responsible for assigning URLs that ensure load balancing for all nodes in the cluster. In addition, about the storage method, the more popular is to save the crawled Web pages on the Distributed File System , so that the management of data on multiple nodes more convenient. Typically, the distributed file system used is a Hadoop-based HDFS system.
Research status of distributed web crawler
At present, the most successful distributed web crawler system is mainly used in search engine companies (such as Google) and other commercially stronger companies. However, these companies have not disclosed the technical details of their distributed crawlers and the application of cloud computing is still in its infancy. Now the more famous distributed web Crawler has Mercator,ubicrawler, Webfountain and Google Crawler.
Vi. search engine system based on Web -Basic architecture
A complete "distributed information acquisition and retrieval platform (the search engine system based on WEB )" can be broadly divided into 5 modules, each of which corresponds to one or more map/reduce tasks for Hadoop. These 5 modules are: Distributed Acquisition Module (crawler), distributed Analysis module, distributed index module, distributed retrieval module and user Query module.
First of all, the distributed Information Acquisition Module is responsible for crawling Web pages, which is partially done by several map/reduce processes. The crawled pages are initially preprocessed and stored in the Distributed File System (HDFS), which forms the original text library. Secondly, the distributed Analysis Module is responsible for the analysis of the Web pages in the original text library, which is mainly done by the word breaker provided by the text parser . The result of the word processing is submitted to the Distributed index module , and the analysis module also analyzes the query submitted by the user. Thirdly, the Distributed Index Module is responsible for the frequency analysis of keywords and the creation of inverted indexes . After keyword analysis to generate an index dictionary, the indexer creates an inverted index after the index library is saved in the Distributed File System (HDFS), which is also composed of several map/reduce processes. In addition, the distributed Retrieval Module is responsible for retrieving the query index in the Index library and then feedback the result data set to the user. Finally, the user Query module is responsible for the interaction between the user and the search engine. The user submits the query to the distributed Retrieval module first, and the retrieval module returns the query result set to the user according to a certain rule.
structure design of distributed crawler system
7.1 Crawler Basic process Design
The following is a detailed description of the basic flow of the crawler system:
(1) first select the preferred URL seed file collection from the local file system upload to the Hadoop cluster Distributed File System HDFs in the file home, in this folder always hold the current layer to crawl the URL. Also, set the number of scratched layers to 0.
(2) determine if the queue to be crawled in the in folder is empty. If so, jump to (7); otherwise, execute (3).
(3) grab the backlog queue in the In folder. This crawl process is done by the crawlerdriver module, which is a Hadoop-based map/reduce process. We will describe in detail the map/reduce implementation of the Crawlerdriver module later in this article. Finally, the crawled pages are stored in the Doc folder in HDFs. This Doc folder holds every layer of unprocessed Web pages.
(4) parse the crawled Web page and extract the linked links from the crawled pages in the Doc folder. This parsing process is done by the parserdriver module, which is also a map/reduce computing process. The specific extraction implementation is done through HTML parsing in the map phase. We will describe in detail the map/reduce implementation of the Parserdriver module later in this article. At the end of this process, the extracted links are saved in the Out folder on HDFs. This out folder always holds the link-out links that the current layer resolves.
(5) optimize the parsed link, filter the crawled URL, that is, from the Out folder has been resolved in the linked link URL to filter out the captured URL. This optimization process is done by the optimizerdriver module, which is also a map/reduce process. Later we will detail how to complete the map/reduce implementation of the Optimizerdriver module based on Hadoop. Optimization will save the filtered optimized URL collection in the In folder waiting for the next round of crawling.
(6) determine whether the number of scratched layers is less than depth. If less than, "scratched layer" is added 1, return (2); otherwise enter (7).
(7) Merge to remove the weight, each layer crawled pages to merge while removing the repeated crawl of the page. This work is done by the mergedriver module, which is also a map/reduce process based on Hadoop development. Later we will detail how to complete the map/reduce implementation of the Optimizerdriver module based on Hadoop. After merging, the results are still stored in the Doc folder on the Distributed File System HDFs.
(8) do a simple preprocessing on the crawled Web page. Converting HTML code to XML. This work is done by the htmltoxmldriver module, which is also a map/reduce process based on Hadoop development. Store the processed XML file in an XML folder in HDFs.
(9) end.
From the above process we can see that the entire crawler system can be divided into 5 parts, each part is a separate module to complete the corresponding function, each module corresponding to a map/reduce process. The following are the features of the 5 modules:
(1) crawlerdriver module : Parallel Download backlog queue, the text file in the folder as the URL to crawl the seed collection, the text file in the first crawl is the user given the initial seed, From the start of the second round is the chain-out link that was extracted from the previous round. The module is a map/reduce process based on Hadoop development, the MAP and reduce have different functions, the specific download is done in the reduce phase, and the use of multi-threaded download, the download is done in Java network programming. The downloaded page is saved in the Doc folder on HDFs.
(2) parserdriver module : Parallel analysis of downloaded pages, extract links. The links that are linked out of each page are analyzed according to the downloaded pages in the Doc folder. The module is also a map/reduce process based on Hadoop, but requires only a Map phase to accomplish the goal. The main work in the map phase is to parse out links with the HTML parser, and also to restrict the types of URLs that are linked by rules, and to prevent the extracted links from being linked to other sites. Finally, these links are saved in the Out folder on HDFs.
(3) optimizerdriver module : Parallel optimization link out link, filter out duplicate links. Based on the extracted link-out links in the Out folder, it is optimized, leaving the crawled URLs to the next layer of processing. Because the relationship between the site layer and the layer is a graph structure, the work of the module can be understood as the problem of looking for loops , filtering out the URLs that make up the loop. This module is also a map/reduce process based on Hadoop development. Store the optimized URLs in the in folder on HDFs.
(4) mergedriver module : Parallel merging of pages crawled by each layer. Merge the pages that are crawled from each layer in the Doc folder to remove pages that may be duplicated between layers. This part is also a map/reduce process based on Hadoop development. Finally, the results are still stored in the Doc folder.
(5) htmltoxmldriver module : Convert HTML to XML in parallel. Based on the pages crawled in the Doc folder, the conversion is done in preprocessing. This part is done through the Dom tree . It is also a map/reduce process. Save the converted XML in an XML folder on HDFs.
In this way, the 5 functional modules constitute a distributed crawler system based on Hadoop. The Crawlerdriver, Parserdriver, and Optimizerdriver are executed from the build backlog queue to complete the crawl of each layer, and after jumping out of the loop, the Mergedriver and Htmltoxmldriver preprocessing work is performed. Among them, the number of cycles is controlled by the pre-set parameter "Crawl layer depth" and "whether the queue to be crawled is empty".
frame design of 7.2 crawler system
The crawler system has four storage structures: The URL library to crawl, the original web page library, the link-out URL library, and the XML library. These four storage structures are distributed file systems that exist in Hadoop as a carrier of HDFs. These four storage structures are described in detail below:
(1) The URL library to be crawled: the collection of URLs that the current layer needs to crawl, is actually a text file that records the URLs to be crawled, where the URLs are separated by "\ n". Before the first layer is crawled, this text file is a user-submitted URL seed set as a crawler to enter the Internet portal.
(2) original web page library: store the original page crawled down each layer. The Web page here is HTML information that has not been processed, and is stored in the form of a key value of Url,value for the URL corresponding to the HTML page.
(3) chain out the URL library : Each layer is stored in the parsed link link, which is stored in the form of a key value of Url,value value for the URL of the corresponding page contains a chain of links.
(4) XML Library : Stores the XML information that is transformed by the pages that are crawled from all layers. The conversion here is equivalent to preprocessing the HTML information. Its storage form is the XML information for the Web page that has a key value of Url,value value for the URL.
The above 5 function modules perform different functions separately, and they all work in parallel with multiple machines, and these four storage structures store the results generated by each function module respectively.