C language Linix Server Web Crawler Project (I) Project intention and web crawler overview, linix Crawler
I. Overview of the project's original intention and crawler
1. original project Intention
My college project is a crawler written in c on linux. Now I want to improve it to make it look like an enterprise-level project. In order to reinvent the wheel to learn the principle of the wheel, we do not use a third-party framework (here we are talking about libraries and interfaces other than the general linux system programming textbooks-0 -, I also developed ubuntu + putty + vim on the VM.
However, we cannot keep our doors closed, but stand on the shoulders of giants, so that we can grasp the principles of the wheel better and more quickly, and then create a faster and stronger wheel. Below are the blogs of the two predecessors. This project is implemented on the ideas and some code provided by these two bloggers. Below are the URLs of these two blogs, reading these two blogs will be of great help to our next study.
'S Sigma :( http://blog.csdn.net/l979951191/article/details/48650657) // Great God in the sophomore year when there has been such a project, there are Fourier transformation signal related articles, first worship.
Yin CHENG: (http://blog.csdn.net/itcastcpp/article/details/38883047) // Qinghua big God not much said.
Before creating a project, you must first understand the project-related content:
(1) This crawler has a single function, but it is relatively complete as a learning project.
(2) There are too many things that can be optimized for this crawler. Many solutions are not necessarily the best. Therefore, this crawler is only suitable for beginners to learn.
(3) This is a complete project based on linux and is pure C.
(4) Because I also learned from this project, I think it has some learning value as a learning project:
Through this project, we will learn several ideas: Software Framework, code reuse, iterative development, and incremental development.
Through this project, we will master and consolidate the following technical points:
1. Linux Process and scheduling 2. Linux Service 3. Signal 4. Socket programming 5. Linux multi-task 6. File System 7. Regular Expression 8. shell script 9. Dynamic library
In addition, we will learn some additional knowledge:
1. How to Use HTTP
2. How to design a system
3. How to select and use open-source projects
4. How to select an I/O model
5. How to perform System Analysis
6. How to Handle Fault Tolerance
7. How to perform System Testing
8. How to manage source code
The stars and seas are standing in front of each other. Let's start learning together!
2. crawler Overview
Web Crawler is an important basic function of search engines. Because the information on the Internet is very large, we can easily obtain the information we need with the help of search engines. The search engine first needs an information collection system, that is, web crawlers, which collects web pages or other information on the Internet to a local device and then creates an index for the information. When a user inputs a query request, the user's query request is analyzed first, then matched in the index database, and finally the results are processed and returned.
Web crawlers are not only an important part of search engines, but are also widely used in business systems that require , such as information collection, public opinion analysis, and intelligence collection. is an importa nt prerequisite for analyzing big data.
The workflow of Web Crawlers is complex. You need to filter links unrelated to topics based on certain Web analysis algorithms, reserve useful links, and put them in the URL queue waiting for crawling.
Web Crawlers start from an initial URL set and put all these URLs into an ordered URL queue to be extracted. Then, the URLs are retrieved from this queue in order, obtain the page pointed to by the URL through the Web protocol, analyze and extract the new URLs from these obtained pages, and put them into the URL queue to be extracted, repeat the above process to get more pages.
In the next article, we will make a simple design for the crawler project and capture a webpage through a simple http request.