Recently, I started writing a search engine project. It is based on the Linux platform and written in pure C. The project mainly involves the following books:
- Enter search engine, Liang bin, e-Industry Press
- Principles, practices, and applications of search engines, Lu Liang, Zhang Bowen, and e-Industry Press
- Search engine-principles, technologies and systems, Li Xiaoming, Yan Hongfei, Wang Jimin, Science Press
These three books actually have a lot of duplicates, but there are really few reference materials in China, and they are too theoretical. If you have the conditions, we suggest you read them all.
Development Plan
The first step of the project is to capture and update data within a given range. The target region is all the websites of Haida or CERNET, with tens of millions of data records.
This plan is divided into three steps:
- Single-thread targeted capture
- Multi-thread capture
- Distributed crawling
Currently, Step 1: The targeted crawling of a single thread has been completed.
Process
The flowchart of the entire system is as follows:
System Module
The system is divided into seven modules based on the system process:
- Store: stores data, including storing pages on disks and obtaining the next unanalyzed page.
- Url_info: Responsible for URL Information Conversion. It extracts domain names and Server IP addresses based on URLs.
- Url_index: determines whether a URL has been crawled.
- Url_filter: URL Filter for targeted capturing.
- HTTP: captures pages based on URLs.
- Page_parse: parse the page and extract the contained URL.
- Main: MasterProgram.
The implementation of each module will be introduced later.
Development Environment
Operating System: Ubuntu 8.04
Deployment Server: Ubuntu 8.04 Server
Compilation tool: GCC, make
IDE: Eclipse CDT 4.0.1
Version Control: Subversion