Socket network programming-web crawler (1)

Last Update:2014-08-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Let's talk about this series-web crawler. Web crawlers are a very important part of the search engine system. They collect web pages and collect information from the Internet. These web pages are used for indexing to provide support for search engines, it determines whether the content of the entire engine system is rich and whether information is real-time. Therefore, its performance directly affects the search engine performance. Basic working principles of Web Crawlers:

(1) Select a URL from an initial URL set and download the corresponding page of the URL;
(2) parse the page, extract its contained URL set from the page, and then add the extracted URL set to the initial URL set;
(3) Repeat the first two processes until the crawler reaches a certain stop standard.

Of course, I certainly cannot talk about such complicated things as soon as I learned network programming. In this section, we will use the C language to download a webpage.

Download a webpage

1 # include <stdio. h> 2 # include <stdlib. h> 3 # include <string. h> 4 # include <sys/types. h> 5 # include <sys/socket. h> 6 # include <unistd. h> 7 # include <netdb. h> 8 # include <netinet/in. h> 9 # include <ARPA/inet. h> 10 11 # define buf_size 409612 13 int main (INT argc, char * argv []) 14 {15 struct sockaddr_in servaddr; 16 struct hostent * Host; 17 int sockfd; 18 char sendbuf [buf_size], recvbuf [buf_size]; 19 int sendsize, recvsize; 20 21 host = gethostbyname (argv [1]); 22 if (host = NULL) 23 {24 perror ("DNS resolution failed"); 25} 26 servaddr. sin_family = af_inet; 27 servaddr. sin_addr = * (struct in_addr *) Host-> h_addr); 28 servaddr. sin_port = htons (atoi (argv [2]); 29 bzero (& (servaddr. sin_zero), 8); 30 31 sockfd = socket (af_inet, sock_stream, 0); 32 If (sockfd =-1) 33 {34 perror ("socket creation failed "); 35} 36 37 If (connect (sockfd, (struct sockaddr *) & servaddr, sizeof (struct sockaddr_in) =-1) 38 {39 perror ("Connect failed "); 40} 41 42 // construct an HTTP request 43 sprintf (sendbuf, "Get/HTTP/1.1 \ r \ nhost: % s \ r \ nconnection: keep-alive \ r \ n ", argv [1]); 44 If (sendsize = Send (sockfd, sendbuf, buf_size, 0 )) =-1) 45 {46 perror ("Send failed"); 47} 48 // get http response information 49 memset (recvbuf, 0, sizeof (recvbuf )); 50 while (recvsize = Recv (sockfd, recvbuf, buf_size, 0) 51 {52 printf ("% s", recvbuf); 53 memset (recvbuf, 0, sizeof (recvbuf); 54} 55 56 return 0; 57}

For the above part of building HTTP requests, you can refer to the online information or my blog also has a brief introduction to some (understand the HTTP protocol http://www.cnblogs.com/wunaozai/p/3733432.html)

An HTTP request consists of three parts: the request line, the message header, and the request body. Let me simply talk about the requests used above. First, get/HTTP/1.1 indicates that the GET request method is used to obtain/(root directory), and http1.1 is used (this 1.1 seems to have been used for a long time, and 2.0 has already been implemented .) Next is host: the Internet host and port number that sends the request resource. The default port number is 80. Well, a complicated HTTP request is required. You do not need to write anything else. Are you sure it is easy. Other parameters are well explained when they are used. Another note is that the additional information of an HTTP request is one row. For each line, \ r \ n is used to indicate carriage return and line feed. The end mark of the request header is a blank line. The above request header ends with \ r \ n.

Next, let's try out whether this program can be used. We first build an HTTP server locally. Put it in a hello World.

The above pile is the response header. The first line is that, using the http1.1 protocol, your connection (client) is 200 OK. The next line is time, and the server uses Apache. Row 8th Content-Length: 86 indicates that the following HTML contains 86 bytes in total (do not believe it, count it ?). Next, we will return a message for the blog homepage.

I run the above program. Sometimes a problem occurs, that is, the obtained webpage is incomplete. I don't know why, but I don't know why if I change the buf_size to 512. Another problem is that I cannot pull www.baidu.com because I have too few request headers?

References:

Http://www.cnblogs.com/coser/archive/2012/06/29/2570535.html

Http://blog.csdn.net/gueter/article/details/1524447

Address:

Http://www.cnblogs.com/wunaozai/p/3900134.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Socket network programming-web crawler (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Socket network programming-web crawler (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support