Design and Implementation of proactive-based distributed parallel web spider

Source: Internet
Author: User

 

Summary: Because the Internet has a massive amount of information and is growing rapidly, it is important to increase the speed of data collection and updating for the web spider of the search engine Information Collector. This article uses the active object provided by the parallel distributed computing middleware of the proactive mesh network) A distributed parallel web spider named P-spider is designed and implemented by technology, network parallel computing technology, and automatic deployment mechanism. Experiments show that the web spider is easy to manage and deploy, it has a higher collection rate than multi-threaded web spider.

Keywords: Web spider, proactive, parallel, distributed

0Introduction Yan

A web spider is a program for automatically downloading web pages. It is generally used by search engines. Web Spider collects information in search engines. it traverses the Internet based on the links between webpages and downloads the information stored on the Internet to a local device, this allows search engines to index the data by category. Due to the rapid growth of internet information, web spider must have a faster collection and update speed. The use of multithreading technology on a single machine can improve the collection speed to a certain extent. However, due to the limited computing resources on a single machine, the speed improvement through multithreading technology is also limited. The multi-host distributed parallel structure increases the number of processors and network interfaces, which significantly improves the collection efficiency of web spider compared with the single-host multi-thread architecture.

In distributed parallel computing, the traditional MPI-based technology has poor program portability and complicated configuration. If Java is used for development directly, there is still a large gap between multi-thread and distributed Java applications, and code reuse is prohibited to build distributed applications on multi-thread applications, for example, javarmi and javaidl. To convert local objects into available remote objects, programmers are required to make major changes to the existing code in the library, which puts a great burden on programmers. Proactive middleware is a Java-based distributed parallel software package with good Java compatibility and object-oriented reusability. It can be used to design and develop distributed parallel programs to make up for these shortcomings. Proactive also provides interfaces for using various network mesh middleware to facilitate deployment in the network mesh environment, making proactive more advantageous in developing distributed parallel web spider.

We use the proactive mesh network to distribute the active object of the computing middleware in parallel) A proactive-based distributed parallel web spider named P-spider is designed and implemented by technology, network parallel computing technology, and automatic deployment mechanism. Experiments show that the web spider is easy to manage and deploy, it has a higher collection rate than multi-threaded web spider.

1 proacitive

Proactive is a Java open-source development package for concurrent, distributed, and concurrent computing developed by a development team led by Professor Denis caromel of inira in France. It features mobility and security in the unified framework, it is part of objectweb consortium open source middleware and has the following main features:

1.1Active Object

Active Object (AO) is the core of proactive computing. It includes a remote object and a thread. This thread controls the activity of the active object and works with other active objects that have been deployed. Active objects are added with three functions: location transparency, activity transparency, and synchronization based on standard objects. Communication between active objects is asynchronous by default. An active object includes a primary object, a thread, and a queue of pending requests.

1.2Asynchronous call

The asynchronous call of the proactive to the active object is implemented through the future object. The future object is the object that automatically generates the returned results of the call when the method is called in proactive. Proactive adopts a wait-by-Necessity method to synchronize internal objects. The idea is as follows: after a future object is generated, it can continue to be executed unless it is directly referenced by the future object, will automatically stop and wait until the future object gets a specific value. When the value of the future object becomes available, it is automatically updated.

1.3 type group

The so-called type group is a group with the same type of active objects. You can call the group method like a common object. The typed group communication is based on the proactive asynchronous remote method call. Multiple AO operations can be called at a time. If there is a return value, the result is also a group.

1.4Node deployment

When developing distributed applications, it is often difficult to deploy computing nodes. The proactive development kit provides a powerful XML deployment descriptor for convenient development and deployment of computing nodes. The proactive deployment file is an XML file, which consists of three parts: componentdefinition, deployment, and infrastructure. It is used to provide information about the ing between virtual nodes (virtualnode), Java Virtual Machine (JVM), and node. Proactive obtains node deployment information from the deployment file when the program is running.

 

 

2P-spiderDesign and Implementation

2.1 p-spiderSystem Framework

P-spider adopts a distributed parallel design scheme. The entire system includes one central node and several computing nodes. The central node acts as the coordinator and the computing node acts as multiple crawlers (spiderworker ). The Coordinator is responsible for deploying the entire system, managing and maintaining the URL queue. Spiderworker is responsible for webpage collection, analysis, and reporting of detected URLs. The Coordinator communicates with the crawler through a high-speed LAN. The entire system framework 1 is shown.

 

The P-spider Coordinator consists of two parts: spidercoordinator and spiderworkload. Each spiderworker is also designed as an active object.

Spider is responsible for system deployment and management. Create a virtual node (VN) based on the configuration file, create a Virtual Machine (JVM) and a node on each computer, and then, create multiple spiderworker active objects remotely on each computer that acts as a computing node, define all spiderworkers as a type group, and then call the group method to start all spiderworkers, finally, the statistical data returned by each spiderworker is summarized after the collection task is completed. Spiderworkload is responsible for maintaining the URL queue. It receives the URL reported by spiderworker, removes duplicates, and distributes the URL to each spiderworker for collection. The URL queue maintained by spiderworker is implemented using Hash hash.

Multiple spiderworker active objects are assigned to each computing node. Spiderworker downloads the Page Based on the read URL, parses the HTML of the page, and extracts the included URL. Complete the extracted URL link according to the predefined unified format (the URL provided in the page link can be in multiple formats, it may be complete, including the protocol, site, and path, or omitted part of the content, or a relative path ). Filter these URLs, for example, remove the URLs with "?" . Finally, count the number of downloaded URLs and report the newly found URLs.

Because of the features of the active object, the method calls to these active objects are in the same form as the method calls to common objects. During the call, you do not need to consider the computer on which the object is located, the JVM and the node. After the service rules are set (the default is FIFO), you do not need to consider the specific implementation details. Because the active object provides services in an orderly manner according to rules, you do not have to consider synchronization issues.

2.2 Implementation of main p-spider classes

The main algorithms of the P-spider class and spiderworker class are described as follows:

Public class spider {<br/> Public void Init () {<br/> proactivedescriptor = proactive. getproactivedescriptor ("file:" + ". // descriptors // spider. XML "); <br/> // specify a specific deployment file for proactivedescriptor, Spider. XML is the deployment file name <br/> proactivedescriptor. activatemappings (); <br/> // start the Virtual Machine Based on the deployment file and create a node <br/> proactive. turnactive (this); <br/> // turn the object spider into an active object <br/> spiderworkgroup = (spiderworker) proactivegroup. newgroup (spiderworker. Class. getname (), Params, nodes); <br/> // use the data in the array Params as the parameter to generate an active object of the spiderworker class. The number of active objects equals to the number of parameters in Params. And deploy these active objects to each node specified in nodes. Define a type group named spiderworkgroup. <Br/> intwrapper pagecount = spiderworkgroup. startwork (); <br/> // call the startwork () method of each active object in spiderworkgroup. The return value is pagecount. <Br/> proactivegroup. waitall (pagecount); <br/> // blocks the thread and waits for all the members of the group pagecount to return. <Br/>}< br/> public class spiderworker {<br/> Public intwrapper startwork () {<br/> int COUNT = 0; // count the number of downloaded URLs <br/> while (cururl = dequeue (urlqueue ))! = NULL) par-do {<br/> // nodes work in parallel <br/> page = downloadpage (formaturl (cururl )); // download <br/> foundurls = extracturls (PAGE); <br/> // find the URL contained on the page <br/> reporturl (foundurls ); <br/> // URLs reported <br/> count ++; <br/>}< br/> return New intwrapper (count ); <br/> // The condition for asynchronous proactive calls. Wrapper type is required. <br/>}< br/>} 

2.3 deployment of p-spider

In the componentdefinition section of the XML deployment file, p-spider defines the entire system as a VN named spidernode. In the Deployment Section, specify the JVM mapped to VN. The central node maps a JVM to each computing node, and each JVM has a node. Set relevant parameters for each JVM in infrastructure, specify that the central node JVM runs locally, each computing node runs on a remote machine, define a local process for the local JVM, and specify Org. objectweb. proactive. core. process. jvmnodeprocess is a specific execution class. It defines an SSH remote process for running JVM on each computing node. It references a local process and specifies the execution class as Org. objectweb. proactive. core. process. ssh. sshprocess. The central node deploys the remote node through SSH.

This deployment is simple and flexible. As long as the program reads the deployment file on the central node, It can automatically complete the deployment of each node without manual work on each computing node. To increase or decrease the number of computing nodes, you only need to modify the deployment file. So that P-spider has good scalability.

3. Experiment and Analysis

P-spider references Jeff Heaton's bot package in implementation. The bot package is a multi-thread web spider in a single-host environment. In order to verify the effectiveness of the P-spider system architecture and the effect of parallel distribution, we use a single-host multi-thread spider and p-spider for a comparative experiment. The experiment environment is as follows: four computers with a CPU of 2.4 GHz and a memory of MB are connected to the Internet through a mb lan. The software environment is Linux red hat9, jdk1.5, and proactive3.1.

First, we use a computer to collect the campus network with different threads and calculate the download speed and the number of URLs downloaded per second based on the obtained data. The lab program uses the program in the bot package. Repeat the experiment twice at different times to obtain the average value, as shown in table 1.

Then, we use four computers, one of which serves as the center node operation coordinator and the other three as computing nodes to run crawlers and collect experiments on the campus network of Guangxi University, use the graphical monitoring tool ic2d provided by proactive to monitor the system, as shown in result 2. The average value of the two repeated experiments at the same time as that of the single-host experiment is obtained, as shown in table 2.

 

 

Figure 2 ic2d monitoring results of p-spider running

From the experiment results, we can see that a single-host spider cannot increase the number of threads to significantly improve the collection efficiency of spider. This is because standalone multithreading is not really parallel. The computing power and memory of standalone CPU and other system resources are also very limited, and the synchronization method and some exclusive resources have become the bottleneck of efficiency. The test collection efficiency of the distributed parallel P-spider is 2.2 times that of the best case for a single machine. The collection efficiency is significantly higher than that of the multi-thread spider. There are two main reasons, the increase in the number of processors and network interfaces, and the proactive-based system architecture, the asynchronous call Mechanism Implemented by the future object, such as the active object, reduces the thread waiting to a certain extent and improves the collection efficiency of p-spider.

4. Summary

This article introduces the distributed parallel P-spider system designed and developed by us based on the features of proactive middleware. Proactive makes the design and development of p-spider simpler, more convenient, and more flexible, greatly reducing the cost of design and development. Experiments show that the proactive-based P-spider significantly improves the collection efficiency of web spider, and the overall architecture is concise and effective.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.