Socket network programming-web crawler (1)

Source: Internet
Author: User

Let's talk about this series-web crawler. Web crawlers are a very important part of the search engine system. They collect web pages and collect information from the Internet. These web pages are used for indexing to provide support for search engines, it determines whether the content of the entire engine system is rich and whether information is real-time. Therefore, its performance directly affects the search engine performance. Basic working principles of Web Crawlers:

(1) Select a URL from an initial URL set and download the corresponding page of the URL;
(2) parse the page, extract its contained URL set from the page, and then add the extracted URL set to the initial URL set;
(3) Repeat the first two processes until the crawler reaches a certain stop standard.

Of course, I certainly cannot talk about such complicated things as soon as I learned network programming. In this section, we will use the C language to download a webpage.

Download a webpage

1 # include <stdio. h> 2 # include <stdlib. h> 3 # include <string. h> 4 # include <sys/types. h> 5 # include <sys/socket. h> 6 # include <unistd. h> 7 # include <netdb. h> 8 # include <netinet/in. h> 9 # include <ARPA/inet. h> 10 11 # define buf_size 409612 13 int main (INT argc, char * argv []) 14 {15 struct sockaddr_in servaddr; 16 struct hostent * Host; 17 int sockfd; 18 char sendbuf [buf_size], recvbuf [buf_size]; 19 int sendsize, recvsize; 20 21 host = gethostbyname (argv [1]); 22 if (host = NULL) 23 {24 perror ("DNS resolution failed"); 25} 26 servaddr. sin_family = af_inet; 27 servaddr. sin_addr = * (struct in_addr *) Host-> h_addr); 28 servaddr. sin_port = htons (atoi (argv [2]); 29 bzero (& (servaddr. sin_zero), 8); 30 31 sockfd = socket (af_inet, sock_stream, 0); 32 If (sockfd =-1) 33 {34 perror ("socket creation failed "); 35} 36 37 If (connect (sockfd, (struct sockaddr *) & servaddr, sizeof (struct sockaddr_in) =-1) 38 {39 perror ("Connect failed "); 40} 41 42 // construct an HTTP request 43 sprintf (sendbuf, "Get/HTTP/1.1 \ r \ nhost: % s \ r \ nconnection: keep-alive \ r \ n ", argv [1]); 44 If (sendsize = Send (sockfd, sendbuf, buf_size, 0 )) =-1) 45 {46 perror ("Send failed"); 47} 48 // get http response information 49 memset (recvbuf, 0, sizeof (recvbuf )); 50 while (recvsize = Recv (sockfd, recvbuf, buf_size, 0) 51 {52 printf ("% s", recvbuf); 53 memset (recvbuf, 0, sizeof (recvbuf); 54} 55 56 return 0; 57}

For the above part of building HTTP requests, you can refer to the online information or my blog also has a brief introduction to some (understand the HTTP protocol http://www.cnblogs.com/wunaozai/p/3733432.html)

An HTTP request consists of three parts: the request line, the message header, and the request body. Let me simply talk about the requests used above. First, get/HTTP/1.1 indicates that the GET request method is used to obtain/(root directory), and http1.1 is used (this 1.1 seems to have been used for a long time, and 2.0 has already been implemented .) Next is host: the Internet host and port number that sends the request resource. The default port number is 80. Well, a complicated HTTP request is required. You do not need to write anything else. Are you sure it is easy. Other parameters are well explained when they are used. Another note is that the additional information of an HTTP request is one row. For each line, \ r \ n is used to indicate carriage return and line feed. The end mark of the request header is a blank line. The above request header ends with \ r \ n.

Next, let's try out whether this program can be used. We first build an HTTP server locally. Put it in a hello World.

The above pile is the response header. The first line is that, using the http1.1 protocol, your connection (client) is 200 OK. The next line is time, and the server uses Apache. Row 8th Content-Length: 86 indicates that the following HTML contains 86 bytes in total (do not believe it, count it ?). Next, we will return a message for the blog homepage.

I run the above program. Sometimes a problem occurs, that is, the obtained webpage is incomplete. I don't know why, but I don't know why if I change the buf_size to 512. Another problem is that I cannot pull www.baidu.com because I have too few request headers?

References:

Http://www.cnblogs.com/coser/archive/2012/06/29/2570535.html

Http://blog.csdn.net/gueter/article/details/1524447

Address:

Http://www.cnblogs.com/wunaozai/p/3900134.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.