Web Crawler (1): overview

Source: Internet
Author: User

Recently, I started writing a search engine project. It is based on the Linux platform and written in pure C. The project mainly involves the following books:

    1. Enter search engine, Liang bin, e-Industry Press
    2. Principles, practices, and applications of search engines, Lu Liang, Zhang Bowen, and e-Industry Press
    3. Search engine-principles, technologies and systems, Li Xiaoming, Yan Hongfei, Wang Jimin, Science Press

These three books actually have a lot of duplicates, but there are really few reference materials in China, and they are too theoretical. If you have the conditions, we suggest you read them all.

Development Plan

The first step of the project is to capture and update data within a given range. The target region is all the websites of Haida or CERNET, with tens of millions of data records.

This plan is divided into three steps:

    1. Single-thread targeted capture
    2. Multi-thread capture
    3. Distributed crawling

Currently, Step 1: The targeted crawling of a single thread has been completed.

Process

The flowchart of the entire system is as follows:

System Module

The system is divided into seven modules based on the system process:

    1. Store: stores data, including storing pages on disks and obtaining the next unanalyzed page.
    2. Url_info: Responsible for URL Information Conversion. It extracts domain names and Server IP addresses based on URLs.
    3. Url_index: determines whether a URL has been crawled.
    4. Url_filter: URL Filter for targeted capturing.
    5. HTTP: captures pages based on URLs.
    6. Page_parse: parse the page and extract the contained URL.
    7. Main: MasterProgram.

The implementation of each module will be introduced later.

Development Environment

Operating System: Ubuntu 8.04

Deployment Server: Ubuntu 8.04 Server

Compilation tool: GCC, make

IDE: Eclipse CDT 4.0.1

Version Control: Subversion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.