Java-based distributed Crawler

Source: Internet
Author: User

Java-based distributed Crawler

Category

Distributed Web Crawlers contain multiple crawlers. Each crawler needs to complete tasks similar to a single crawler. They download webpages from the Internet and save the webpages to a local disk, extracts URLs from them and continues crawling along the URLs. Because parallel crawlers need to split download tasks, crawlers may send URLs extracted by themselves to other crawlers. These crawlers may be distributed in the same LAN or in different geographical locations.

Depending on the degree of distribution of crawlers, distributed crawlers can be divided into the following two categories:

1. Lan-based distributed Web Crawler: All crawlers of this distributed crawler run in the same LAN and communicate with each other through high-speed network connections. These crawlers access the Internet through the same network and download webpages. All the network loads are concentrated at the egress of the LAN where they are located. Because the LAN bandwidth is high, the efficiency of communication between crawlers can be ensured. However, the total bandwidth ceiling of the network egress is fixed, and the number of crawlers is limited by the LAN egress bandwidth.

2. Wan-based distributed Web Crawler: When crawlers of parallel crawlers run in different geographical locations or network locations, we call this parallel crawler a distributed crawler. For example, the crawlers of distributed crawlers may be located in China, Japan, and the United States. They are responsible for downloading webpages from these three regions, or in CHINANET, CERNET, and CEINET, download the web pages of the three networks respectively. The advantage of a distributed crawler is that it can distribute network traffic to a certain extent and reduce the load at the network egress. If crawlers are distributed in different geographic locations or network locations, it is worth considering how long it takes to communicate with each other. The communication bandwidth between crawlers may be limited. Generally, crawlers need to communicate over the Internet.

Architecture of large-scale distributed Web Crawler

Distributed Web Crawler is a very complex system. Many factors need to be considered. Performance can be said to be an important indicator. Of course, hardware resources are also required.

Architecture

The following is the overall architecture of the project. The first version is based on this solution.

The preceding web layer includes the console, basic permissions, and monitoring display. You can also perform further expansion as needed.

The core layer is centrally scheduled by the Controller, which sends tasks to workers in the worker queue for crawling. Each node dynamically sends module status and other information to the monitoring module, which is displayed at the presentation layer.

Project objectives

Public push, toutiao of the open-source version!

Distributed Web crawler based on hadoop thinking.

Fourinone, jeesite, and webmagic have been integrated and further improved. The goal of the first stage is to create a dynamically configurable distributed crawler System Based on the designer.

Current Project Status

Current Project Progress:

1. sourceer can be used to access multiple data sources. The interface has been defined as a builder encapsulation and can be used as a simple crawler ).

2. The web architecture project has been uploaded and tested successfully. permissions, basic framework transformation, and import have been recorded as videos, activiti deletion, and cms deletion ).

3. Distributed Framework Research distributed project subcontracting, adding some comments, and testing single-host crawling ).

4. Plug-in integration.

5. Various de-duplication methods and algorithms such as the article have implemented bloomfilter, fingerprint algorithm de-duplication, and simhash and Word Segmentation Algorithm ansj )).

6. bayes Test for classifier. Single-host test for text classification is successful ).

Project address:

Distributed crawler http://git.oschina.net/zongtui/zongtui-webcrawler

Removing Heavy filters) https://git.oschina.net/zongtui/zongtui-filter

Text classifier) https://git.oschina.net/zongtui/zongtui-classifier

Document directory) https://git.oschina.net/zongtui/zongtui-doc

Project interface:

Start jetty. The skin has not been changed yet.

Summary

The project is being further improved. I hope you can get more comments!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.