C language Linix Server Web Crawler Project (I) Project intention and web crawler overview, linix Crawler

Last Update:2018-03-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Overview of the project's original intention and crawler
1. original project Intention
My college project is a crawler written in c on linux. Now I want to improve it to make it look like an enterprise-level project. In order to reinvent the wheel to learn the principle of the wheel, we do not use a third-party framework (here we are talking about libraries and interfaces other than the general linux system programming textbooks-0 -, I also developed ubuntu + putty + vim on the VM.
However, we cannot keep our doors closed, but stand on the shoulders of giants, so that we can grasp the principles of the wheel better and more quickly, and then create a faster and stronger wheel. Below are the blogs of the two predecessors. This project is implemented on the ideas and some code provided by these two bloggers. Below are the URLs of these two blogs, reading these two blogs will be of great help to our next study.
'S Sigma :( http://blog.csdn.net/l979951191/article/details/48650657) // Great God in the sophomore year when there has been such a project, there are Fourier transformation signal related articles, first worship.
Yin CHENG: (http://blog.csdn.net/itcastcpp/article/details/38883047) // Qinghua big God not much said.

Before creating a project, you must first understand the project-related content:
(1) This crawler has a single function, but it is relatively complete as a learning project.
(2) There are too many things that can be optimized for this crawler. Many solutions are not necessarily the best. Therefore, this crawler is only suitable for beginners to learn.
(3) This is a complete project based on linux and is pure C.
(4) Because I also learned from this project, I think it has some learning value as a learning project:
Through this project, we will learn several ideas: Software Framework, code reuse, iterative development, and incremental development.
Through this project, we will master and consolidate the following technical points:
1. Linux Process and scheduling 2. Linux Service 3. Signal 4. Socket programming 5. Linux multi-task 6. File System 7. Regular Expression 8. shell script 9. Dynamic library
In addition, we will learn some additional knowledge:
1. How to Use HTTP
2. How to design a system
3. How to select and use open-source projects
4. How to select an I/O model
5. How to perform System Analysis
6. How to Handle Fault Tolerance
7. How to perform System Testing
8. How to manage source code
The stars and seas are standing in front of each other. Let's start learning together!
2. crawler Overview
Web Crawler is an important basic function of search engines. Because the information on the Internet is very large, we can easily obtain the information we need with the help of search engines. The search engine first needs an information collection system, that is, web crawlers, which collects web pages or other information on the Internet to a local device and then creates an index for the information. When a user inputs a query request, the user's query request is analyzed first, then matched in the index database, and finally the results are processed and returned.
Web crawlers are not only an important part of search engines, but are also widely used in business systems that require , such as information collection, public opinion analysis, and intelligence collection. is an importa nt prerequisite for analyzing big data.
The workflow of Web Crawlers is complex. You need to filter links unrelated to topics based on certain Web analysis algorithms, reserve useful links, and put them in the URL queue waiting for crawling.
Web Crawlers start from an initial URL set and put all these URLs into an ordered URL queue to be extracted. Then, the URLs are retrieved from this queue in order, obtain the page pointed to by the URL through the Web protocol, analyze and extract the new URLs from these obtained pages, and put them into the URL queue to be extracted, repeat the above process to get more pages.

In the next article, we will make a simple design for the crawler project and capture a webpage through a simple http request.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C language Linix Server Web Crawler Project (I) Project intention and web crawler overview, linix Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

C language Linix Server Web Crawler Project (I) Project intention and web crawler overview, linix Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support