First, the project original intention and crawler overview
1. Project Intent
My college graduation is a crawler written in C on Linux, and now I want to improve it to make him like an enterprise level project. In order to reinvent the wheel to learn the principle of the wheel, we do not use a third-party framework (this refers to the usual Linux system programming library and interface other than the 0-, originally I also ran on the virtual machine Ubuntu+putty+vim to develop) to achieve.
But we can not build the wheel behind closed doors, drew up, but to stand on the shoulders of giants, so that we can better and faster grasp the principle of the wheel, and then build faster and stronger wheels. Here are two senior bloggers, this project is in the two blogs to provide ideas and part of the code to achieve, the following is the two blog URLs, reading these two blogs will be a great help to our next study.
Tempting Sigma: (http://blog.csdn.net/l979951191/article/details/48650657)//The Great God in the sophomore in the time has already done this kind of project, there are Fourier transform signal related articles, first worship a bit.
Yin Cheng: (http://blog.csdn.net/itcastcpp/article/details/38883047)//Tsinghua Great God will not say much.
Before you do the project, you need to know what the project is about:
(1) The crawler is relatively single in function, but as a personal learning project is comparatively complete.
(2) The crawler can optimize the place too much, many of the scheme is not necessarily the best, so the crawler is only suitable for beginners to learn
(3) This is a complete project, Linux-based, pure C.
(4) Because I also take this project to learn, I think as a learning project or have a certain learning value:
Through this project, we will learn several ideas: Software framework idea, code reuse idea, iterative development idea, incremental development thought
Through this project, we will grasp and consolidate the following technical points:
1, Linux process and scheduling 2, Linux service 3, signal 4, Socket programming 5, Linux multitasking 6, file System 7, regular expression 8, Shell script 9, Dynamic Library
In addition, we will learn some additional knowledge:
1. How to use the HTTP protocol
2, how to design a system
3. How to select and use open source projects
4. How to select the I/O model
5. How to conduct system analysis
6, how to do fault-tolerant processing
7, how to conduct system testing
8, how to manage the source code
The star Sea has been horizontal in front, the cloud sails hangs, lets us begin to study together the journey!
2. Crawler overview
Web crawler is an important basic function of search engine. As the information on the Internet is very large, we can easily get the information we need by means of search engines. The search engine first needs an information acquisition system, which is a web crawler that collects Web pages or other information on the Internet locally and then creates an index on that information. When the user enters a query request, the user's query request is analyzed, then matched in the index library, the result is processed, and the result is returned.
Web crawler is not only an important part of search engine, but also widely used in some business systems, such as information collection, public opinion analysis and information collection. The acquisition of data is an important precondition for the analysis of Big data.
The workflow of web crawler is more complicated, it is necessary to filter the link with the topic, and keep the useful link and put it into the queue of the URL waiting to crawl according to certain webpage analysis algorithm.
The web crawler starts from an initial set of URLs and puts them all into an ordered queue of URLs to fetch, and then fetches the URLs sequentially from the queue, obtains the page that the URL points to by using the protocol on the web, extracts the new URLs from the retrieved pages, And keep them in the URL queue to be extracted, repeat the process to get more pages.
In the next article we made a simple design of the crawler project and implemented a crawl of a Web page through a simple HTTP request.
C Language Linix Server crawler project (i) Overview of the project intent and web crawler