The development of the crawler of the search engine (ii)

Last Update:2015-12-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

years 3 months 23rd ( Monday )

Sunny, southerly

Today's data group meeting said the crawler has been developed, in the attempt to crawl the site. Because we collect the website, the URL is fixed, and the crawler page depth is 3. It doesn't need to be as powerful as the Heritrix of these generic crawlers.

The crawler mainly uses httpclient and htmlparse two Java Library, the crawler's architecture thought draws the following idea.

First, the structure diagram

There search web crawler framework mainly for e-commerce Web site for data crawling, analysis, storage, index.

Crawler: Crawler is responsible for crawling, parsing, processing the content of the Web page

Databases: storing Web page information

Index: Fingerprint information for Web content

Task Queue: List of pages to crawl

Visited table: List of pages that have been crawled

Crawler Monitoring platform:Web platform can start, stop crawler, Management crawler,task queue,visited table.

Second, crawler

1. Process

1) Scheduler start the crawler,taskmaster initialize taskqueue

2) Workers get the task from the taskqueue

3) The Worker thread calls the Web page described in fetcher crawl Task

4) The worker thread takes the crawled page to Parser parsing

5) Parser parsed data sent to Handler processing, extracting web Link and processing Web content

6) Visitedtablemanager determine If the link extracted from Urlextractor has been crawled, if not submitted to Taskqueue

2. Scheduler

Scheduler is responsible for starting the bot, calling taskmaster to initialize the taskqueue, and creating a monitor thread that is responsible for the exit of the control program.

3. Task Master

Task Manager, which is responsible for managing task queues. Task Manager abstracts the implementation of the task queue.

At this stage, we use MySQL as the task queue implementation. There is also a Redis available for replacement .

Process flow for Task Manager:

1) Task Manager initializes the task queue, and the initialization of the task queue may be different depending on the configuration. In the case of an incremental, the List is initialized according to the specified URL . In the case of full-text crawling, only the first page of one or several ecommerce sites is pre-initialized.

2) Task Manager creates a monitor thread that controls the exit of the entire program

3) The Task Manager dispatches the task if the task queue is persisted and is responsible for the load task from the Task Queue Server . Prefetching needs to be considered.

4) Task Manager is also responsible for verifying the validity of the task, the crawler monitoring platform can set some tasks in the task queue to fail?

4. Workers

Worker thread pool, where each thread executes the entire crawl process. You might consider using multiple thread pools to split the entire flow of asynchrony. Increase the utilization of threads.

5. Fetcher

Fetcher is responsible for directly crawling web pages of e-commerce sites. Implemented with HTTP Client . HTTP Core 4 has the capability of NIO and is implemented with nio .

Fetcher can be configured without the need to save HTML files

6. Parser

Parser Parse fetcher Gets the page, the general Web page may not be well-formatted (XHTML is perfectly formatted), so that the XML can not be exploited Class Library processing. We need a better HTML parser that can fix these non-perfect formatted pages , we are using htmlparse.

7. Handler

Handler is to deal with the content that Parser parse out.

Callback method (visitor ): for sax event processing, we need to be fit into sax content handler parser handlingcontext parser return together.

Active: You need to parse the entire HTMLand choose what you want. The content extracted by the Parser is processed. XML needs to be parsed into a DOM structure. Easy to use, can use Xpath,nodefilter , etc., but consumes memory.

ContentHandler: It also contains the component contentfilter. Filter content.

Urlextractor is responsible for extracting a conforming URL from the Web page , building the URL into a Task, and submitting it to Task Queue the.

8. Visitedtablemanager

Access the Table Manager to manage visited URLs. Extract the unified interface and abstract the underlying implementation. If The URL is crawled, it will not be added to the taskqueue .

Third, Task queue

The task queue stores the tasks that need to be crawled. There is an association between the tasks. We can save and manage this task relationship. This relationship is also the relationship between URLs. Save it to help the background form a Web diagram and analyze the data.

The Task queue is in a distributed crawler cluster and requires a centralized server to store it. Some lightweight databases or NOSQL support lists can be used to store them. Optional options:

1) Storage with MySQL .

2) using Redis Storage

Iv. visited table

The visited table stores the sites that have been crawled. Each crawl needs to be built.

For the current amount of data, use MySQL

V. Concluding remarks

Crawler is responsible for the development of the crawler, I do not have detailed participation, the specific details are not very clear. However, later I heard that the crawler has often dropped the problem of the line. Often blocked by the site, the problem has not been resolved.

Good at crawler technology, can guide.

Today is a rare sunny day. Outside the window, two unknown trees, the leaves have been green. The trunk of the vitality, straight into the cloud, let the distant sky, appear so near, so blue.

The development of the crawler of the search engine (ii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The development of the crawler of the search engine (ii)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The development of the crawler of the search engine (ii)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support