Write a csdn blog backup tool by yourself-source code analysis of blogspider (3)

Last Update:2018-12-03 Source: Internet

Author: User

Tags website server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Author: gzshun. Original Works. Reprinted, please indicate the source!
Source:Http://blog.csdn.net/gzshun

Zhou xingchi: Hair cutting should not be done by others! You can see that your hair style does not work with your face or body. The body and hair style are totally different, and they do not work with each other !! Huan Ge! What are you doing? Fortune Teller

At the beginning, we will have a happy new year and a happy New Year!
I have written several articles and contributed the code to people who need it. Here I will list the previous articles and jump to them immediately,Ma Li ..
《Write a csdn blog backup tool by yourself-blogspider"

《Write a csdn blog backup tool by yourself-source code analysis of blogspider (1)"

《Write a csdn blog backup tool by yourself-source code analysis of blogspider (2)"

This article is the most important part of blogspider. It starts to download and analyze the csdn blog, analyze the URL of the blog, and add it to the linked list. Go!

1. Download the main blog page to your index.html

To download a webpage to a local machine:
Establish a connection-> connect to the website server-> send a request-> receive a response-> Save it to your local device
Connect_web-> send_request-> recv_response
Source code:

/*************************************** * ************************* Download the personal blog homepage ********* **************************************** * **************/static int download_index (blog_spider * spider_head) {int ret; ret = connect_web (spider_head); If (Ret <0) {goto fail_download_index;} ret = send_request (spider_head); If (Ret <0) {goto fail_download_index ;} ret = recv_response (spider_head); If (Ret <0) {goto fail_download_index;} Close (spider_head-> blog-> B _sockfd); Return 0; fail_download_index: close (spider_head-> blog-> B _sockfd); Return-1 ;}

2. Establish a connection and connect to the website Server

Obtain the IP address from the "blog.csdn.net" host name as follows:

/*************************************** * ****************** Obtain host information based on the host name, obtain the IP address. **************************************** * ****************/static int get_web_host (const char * hostname) {/* Get Host IP */web_host = gethostbyname (hostname); If (null = web_host) {# ifdef spider_debugfprintf (stderr, "gethostbyname: % s \ n ", strerror (errno); # endifreturn-1 ;}# ifdef spider_debuuplintf ("IP: % s \ n", inet_ntoa (* (struct in_addr *) web_host-> h_addr_list [0]); # endifreturn 0 ;}

Start initializing the socket and connect to the website Server:

/*************************************** * ******************* Initialize the socket, and connect to the website server ************************************ * ********************/static int connect_web (const blog_spider * spider) {int ret; struct sockaddr_in server_addr;/* init socket */Spider-> blog-> B _sockfd = socket (af_inet, sock_stream, 0 ); if (Spider-> blog-> B _sockfd <0) {# ifdef spider_debugfprintf (stderr, "socket: % s \ n", strerror (errno); # endifreturn-1 ;} memset (& server_addr, 0, sizeof (server_addr); server_addr.sin_family = af_inet; dependencies = htons (Spider-> blog-> B _port); dependencies = * (struct in_addr *) web_host-> h_addr_list [0]); ret = connect (Spider-> blog-> B _sockfd, (struct sockaddr *) & server_addr, sizeof (server_addr )); if (Ret <0) {# ifdef spider_debugfprintf (stderr, "Connect: % s \ n", strerror (errno); # endifreturn-1;} return 0 ;}

3. Send a request to the website Server

HTTP protocolThere are two important methods:GetAndPost
Send a request to the website Server:
Get % s HTTP/1.1 \ r \ n
Accept: */* \ r \ n
Accept-language: ZH-CN \ r \ n
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0) \ r \ n
HOST: % s: % d \ r \ n
Connection: Close \ r \ n
Get is followed by the request file, and the rest is some basic information. The End mark of the protocol header is a blank line, therefore, the program can determine "\ r \ n" as the end mark. For more information about the HTTP protocol, visit the internet.

Source code:

/*************************************** * ***************** Send a request to the website server **************** **************************************** **/static int send_request (const blog_spider * spider) {int ret; char request [bufsize]; memset (request, 0, sizeof (request); sprintf (request, "Get % s HTTP/1.1 \ r \ n" "accept: */* \ r \ n" "Accept-language: zh-CN \ r \ n "" User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0) \ r \ n "" Host: % s: % d \ r \ n "" connection: Close \ r \ n "" \ r \ n ", spider-> blog-> B _page_file, spider-> blog-> B _host, spider-> blog-> B _port); ret = Send (Spider-> blog-> B _sockfd, request, sizeof (request), 0); If (Ret <0) {# ifdef spider_debugfprintf (stderr, "Send: % s \ n", strerror (errno); # endifreturn-1 ;}# ifdef spider_debuuplintf ("request: \ n % s \ n ", request); # endifreturn 0 ;}

Zhou xingchi: sweeping the floor is just my superficial work. My real identity is a researcher (student ). Shaolin football
Just relax and continue...

4. Receive response messages

If a request is sent to the website server, it must be received locally. The speed of receiving response messages and message body is a little slow due to the slow speed of the network. Used hereSelectFunctions andFd_setWhen the socket is read, the system starts to read the message and save it to the local device.

/*************************************** **************************************** * ******* Accept the feedback from the website server, the requested file content sends the request information to the server or the server's response message, which ends with a blank line. Therefore, you can use "\ r \ n" to determine the end mark select: int select (INT maxfdp1, fd_set * readset, fd_set * writeset, fd_set * reset tset, const struct timeval * timeout);> 0: Correct-1: error 0: timeout void fd_zero (fd_set * fdset); // clear all bits in fdsetvoid fd_set (int fd, fd_set * fdset ); // turn on the bit for FD in fdsetvoid fd_clr (int fd, fd_set * fdset); // turn off the bit for FD in fdsetint fd_isset (int fd, fd_set * fdset ); // is the bit for FD on in fdset ****************************** **************************************** * ***************/static int recv_response (const blog_spider * spider) {int ret, end, recvsize, count; char recvbuf [bufsize]; fd_set read_fds; struct timeval timeout; file * FP;/* It is recommended to take a long time, the possible cause of select failure is that the response message received from the website times out */timeout. TV _sec = 30; timeout. TV _usec = 0; while (1) {fd_zero (& read_fds); fd_set (Spider-> blog-> B _sockfd, & read_fds); timeout. TV _sec = 30; timeout. TV _usec = 0; ret = select (Spider-> blog-> B _sockfd + 1, & read_fds, null, null, & timeout); If (-1 = RET) {/* error, direct Return Error */# ifdef spider_debugfprintf (stderr, "select: % s \ n", strerror (errno); # endifreturn-1 ;} else if (0 = RET) {/* times out, continue polling */# ifdef spider_debugfprintf (stderr, "select Timeout: % s \ n ", spider-> blog-> B _title); # endifgoto fail_recv_response;}/* receive data */If (fd_isset (Spider-> blog-> B _sockfd, & read_fds )) {end = 0; Count = 0;/* errors may be caused by irregular file names, such as "3/5 ", '/' indicates the directory */FP = fopen in Linux (Spider-> blog-> B _local_file, "W +"); If (null = FP) {goto fail_recv_response ;} spider-> blog-> B _download = blog_download; while (read (Spider-> blog-> B _sockfd, recvbuf, 1) = 1) {If (end <4) {If (recvbuf [0] = '\ R' | recvbuf [0] =' \ n') {end ++;} else {end = 0 ;} /* here is the message header fed back by the HTTP server. If necessary, you can save it */} else {fputc (recvbuf [0], FP); count ++; if (1024 = count) {fflush (FP) ;}} fclose (FP); break ;}return 0; fail_recv_response: Spider-> blog-> B _download = blog_undownload; return-1 ;}

5. Get the URL of the csdn blog, the posting date, the number of reads, and the number of comments of the blog, and add them to the crawler linked list.

/*************************************** ************************** Analyze the personal blog homepage, obtain the URLs of all articles and add blog information to the crawler linked list. **************************************** * ***********************/static int analyse_index (blog_spider * spider_head) {file * FP; int ret; int Len; int reads, comments; char * POSA, * POSB, * POSC, * posd; char line [bufsize * 4] = {0}; char tmpbuf [bufsize] = {0}; char tmpbuf2 [bufsize] = {0 }; char page_file [bufsize] = {0}; char URL [bufsize] = {0}; char title [bufsize] = {0}; char date [bufsize] = {0 }; fp = fopen (spider_head-> blog-> B _local_file, "R"); If (FP = NULL) {# ifdef spider_debugfprintf (stderr, "fopen: % s \ n ", strerror (errno); # endifreturn-1;} while (1) {If (feof (FP) {break;}/* search blog */while (fgets (line, sizeof (line), FP) {POSA = strstr (line, html_article); If (POSA) {/* search for blog URL */POSA + = strlen (html_article) + strlen (blog_href); POSB = strchr (POSA, '"'); * POSB = 0; memset (page_file, 0, sizeof (page_file); memset (URL, 0, sizeof (URL); strcpy (page_file, POSA); sprintf (URL, "% S % s", csdn_blog_url, POSA ); /* search for the blog title */POSB + = 1; POSC = strstr (POSB, blog_title);/* in the same line as the blog address */POSC + = strlen (blog_title ); posd = strstr (POSC, "\"> "); * posd = 0; memset (title, 0, sizeof (title); strcpy (title, POSC ); /* search for the blog posting date */while (fgets (line, sizeof (line), FP) {POSA = strstr (line, blog_date); If (POSA) {POSA + = strlen (blog_date); POSB = strstr (POSA, blog_span_end); * POSB = 0; memset (date, 0, sizeof (date); strcpy (date, POSA); break;}/* search for blog readings */while (fgets (line, sizeof (line), FP) {POSA = strstr (line, blog_read ); if (POSA) {POSA + = strlen (blog_read); POSB = strchr (POSA, '(') + 1; POSC = strchr (POSB ,')'); * POSC = 0; reads = atoi (POSB); break ;}/ * Find blog comments */while (fgets (line, sizeof (line), FP )) {POSA = strstr (line, blog_comment); If (POSA) {POSA + = strlen (blog_comment); POSB = strchr (POSA, '(') + 1; POSC = strchr (POSB, '); * POSC = 0; Comments = atoi (POSB); break ;}} spider_head-> blog-> B _download = blog_download; blog_spider * spider; ret = init_spider (& spider); If (Ret <0) {return-1;} spider-> blog-> B _page_file = strdup (page_file ); spider-> blog-> B _url = strdup (URL); spider-> blog-> B _date = strdup (date); spider-> blog-> B _reads = reads; spider-> blog-> B _comments = comments; spider-> blog-> B _seq_num = ++ g_seq_num; memset (tmpbuf, 0, sizeof (tmpbuf); sprintf (tmpbuf, "% d. % s ", spider-> blog-> B _seq_num, title); spider-> blog-> B _title = strdup (tmpbuf); memset (tmpbuf, 0, sizeof (tmpbuf )); memset (tmpbuf2, 0, sizeof (tmpbuf2); strcpy (tmpbuf2, spider-> blog-> B _title); strfchr (tmpbuf2); sprintf (tmpbuf, "% S/%s.html", csdn_id, tmpbuf2); spider-> blog-> B _local_file = strdup (tmpbuf);/* Insert the blog into the blog crawler linked list */insert_spider (spider_head, spider); fputc ('. ', stdout) ;}}fclose (FP); # ifdef spider_debuuplintf ("\ nspider size = % d \ n", spider_size (spider_head); # endifreturn 0 ;}

The Code itself has been clearly annotated, And it is enough to read the annotation. HTTP protocol involves a lot of knowledge points. You can write programs to train your hands when you are free of time. blogspider is still not efficient enough. When you are free to add threads for processing, you can download Multiple blogs at the same time to improve efficiency.

You can leave the source code of blogspider with email.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More