The development of the crawler of the search engine (ii)

Source: Internet
Author: User

years 3 months 23rd ( Monday )

Sunny, southerly

Today's data group meeting said the crawler has been developed, in the attempt to crawl the site. Because we collect the website, the URL is fixed, and the crawler page depth is 3. It doesn't need to be as powerful as the Heritrix of these generic crawlers.

The crawler mainly uses httpclient and htmlparse two Java Library, the crawler's architecture thought draws the following idea.

First, the structure diagram

There search web crawler framework mainly for e-commerce Web site for data crawling, analysis, storage, index.

Crawler: Crawler is responsible for crawling, parsing, processing the content of the Web page

Databases: storing Web page information

Index: Fingerprint information for Web content

Task Queue: List of pages to crawl

Visited table: List of pages that have been crawled

Crawler Monitoring platform:Web platform can start, stop crawler, Management crawler,task queue,visited table.

Second, crawler

1. Process

1) Scheduler start the crawler,taskmaster initialize taskqueue

2) Workers get the task from the taskqueue

3) The Worker thread calls the Web page described in fetcher crawl Task

4) The worker thread takes the crawled page to Parser parsing

5) Parser parsed data sent to Handler processing, extracting web Link and processing Web content

6) Visitedtablemanager determine If the link extracted from Urlextractor has been crawled, if not submitted to Taskqueue



2. Scheduler

Scheduler is responsible for starting the bot, calling taskmaster to initialize the taskqueue, and creating a monitor thread that is responsible for the exit of the control program.

3. Task Master

Task Manager, which is responsible for managing task queues. Task Manager abstracts the implementation of the task queue.

At this stage, we use MySQL as the task queue implementation. There is also a Redis available for replacement .

Process flow for Task Manager:

1) Task Manager initializes the task queue, and the initialization of the task queue may be different depending on the configuration. In the case of an incremental, the List is initialized according to the specified URL . In the case of full-text crawling, only the first page of one or several ecommerce sites is pre-initialized.

2) Task Manager creates a monitor thread that controls the exit of the entire program

3) The Task Manager dispatches the task if the task queue is persisted and is responsible for the load task from the Task Queue Server . Prefetching needs to be considered.

4) Task Manager is also responsible for verifying the validity of the task, the crawler monitoring platform can set some tasks in the task queue to fail?

4. Workers

Worker thread pool, where each thread executes the entire crawl process. You might consider using multiple thread pools to split the entire flow of asynchrony. Increase the utilization of threads.

5. Fetcher

Fetcher is responsible for directly crawling web pages of e-commerce sites. Implemented with HTTP Client . HTTP Core 4 has the capability of NIO and is implemented with nio .

Fetcher can be configured without the need to save HTML files

6. Parser

Parser Parse fetcher Gets the page, the general Web page may not be well-formatted (XHTML is perfectly formatted), so that the XML can not be exploited Class Library processing. We need a better HTML parser that can fix these non-perfect formatted pages , we are using htmlparse.

7. Handler

Handler is to deal with the content that Parser parse out.

Callback method (visitor ): for sax event processing, we need to be fit into sax content handler parser handlingcontext parser return together.

Active: You need to parse the entire HTMLand choose what you want. The content extracted by the Parser is processed. XML needs to be parsed into a DOM structure. Easy to use, can use Xpath,nodefilter , etc., but consumes memory.

ContentHandler: It also contains the component contentfilter. Filter content.

Urlextractor is responsible for extracting a conforming URL from the Web page , building the URL into a Task, and submitting it to Task Queue the.

8. Visitedtablemanager

Access the Table Manager to manage visited URLs. Extract the unified interface and abstract the underlying implementation. If The URL is crawled, it will not be added to the taskqueue .

Third, Task queue

The task queue stores the tasks that need to be crawled. There is an association between the tasks. We can save and manage this task relationship. This relationship is also the relationship between URLs. Save it to help the background form a Web diagram and analyze the data.

The Task queue is in a distributed crawler cluster and requires a centralized server to store it. Some lightweight databases or NOSQL support lists can be used to store them. Optional options:

1) Storage with MySQL .

2) using Redis Storage

Iv. visited table

The visited table stores the sites that have been crawled. Each crawl needs to be built.

For the current amount of data, use MySQL

V. Concluding remarks

Crawler is responsible for the development of the crawler, I do not have detailed participation, the specific details are not very clear. However, later I heard that the crawler has often dropped the problem of the line. Often blocked by the site, the problem has not been resolved.

Good at crawler technology, can guide.

Today is a rare sunny day. Outside the window, two unknown trees, the leaves have been green. The trunk of the vitality, straight into the cloud, let the distant sky, appear so near, so blue.

The development of the crawler of the search engine (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.