Write a csdn blog backup tool by yourself-source code analysis of blogspider (1)

Source: Internet
Author: User
Tags website server

Author: gzshun. Original Works. Reprinted, please indicate the source!
Source: http://blog.csdn.net/gzshun

Previous blog 《Write a csdn blog backup tool by yourself-blogspiderThis article introduces how to use blogspider. It is easy to use. blogspider can download his own csdn blog to a local machine. Here, only the most basic functions are provided. In the past two days, many friends have sent me emails. If you want the source code of blogspider, the program is open-source. If you need it, you can leave contact information.
Today, we will introduce the source code of blogspider. In fact, the core thing here is how to apply for the webpage files we need from the website server. In Java, some network packages have been provided, and HTTP Protocol items have been integrated into the package, which makes implementation easier. Recently, during the Spring Festival, everyone purchased tickets on the 12306 website. Therefore, a ticket-grabbing software appeared on the Internet. It was written in Java and is a Google plug-in. In fact, the software was written by a former colleague of mine. We all benefited from it and bought a ticket to go home for the Chinese New Year. Here, I would like to thank the cool man.
I learned from the Java programmer that the implementation principle of the software is very simple and the steps are as follows:
1. access the website to obtain website information
2. Receive the Response Message from the website Server
3. Submit the message to the website Server Based on the selected message (hard seat, hard sleeper ).
4. Obtain the website results.

There are two operations:One is the get method and the other is the POST method.
Get Method: Download webpage messages from the website server. For example, a Web browser can view the news and images of the csdn website. These are all downloaded from the website server get to the local device;
POST method: Submit data locally to the website server. For example, if you want to click post a blog after writing an article on the csdn blog, all the information of an article is sent to the csdn server.
The main purpose of blogspider is to download the function. Here we use the get method, which is relatively low-level written in C language. These basic functions need to be implemented by ourselves, take a look at the implementation of the object-oriented programming language.

Speak nonsense, source code:

I. paste the debugging macro in the code, sweat, and so on.

/*Debug program macro*/#if 0#define SPIDER_DEBUG#endif

2. paste some macro definitions in the code, which involve the syntax of HTML files. However, this code does not need HTML and only needs the most basic string processing:

#define BUFSIZE     1024#define HTML_ARTICLE     ("<span class=\"link_title\">")#define HTML_MULPAGE     ("class=\"pagelist\"")#define BLOG_NEXT_LIST   ("article/list")#define BLOG_TITLE       ("title=\"")#define BLOG_HREF        ("<a href=\"")#define BLOG_DATE        ("<span class=\"link_postdate\">")#define BLOG_READ        ("<span class=\"link_view\"")#define BLOG_COMMENT     ("<span class=\"link_comments\"")#define BLOG_SPAN_HEAD   ("<span>")#define BLOG_SPAN_END    ("</span>")#define BLOG_RANK        ("blog_rank")#define BLOG_LI          ("<li>")#define BLOG_INDEX       ("index.html")#define CSDN_BLOG_URL    ("http://blog.csdn.net")#define CSDN_BLOG_HOST   ("blog.csdn.net")#define CSDN_BLOG_PORT   (80)#define BLOG_LOCK        (10)#define BLOG_UNLOCK      (11)#define BLOG_DOWNLOAD    (20)#define BLOG_UNDOWNLOAD  (21)

The above blog_lock and blog_unlock are the processing locks of the crawler linked list, which are reserved for extension and are useless now. I had to use multiple threads to process the linked list. However, after testing, competition will occur, resulting in connect timeout. I will try again after the end of the year.

3. Here we will show the structure of the crawler linked list and the structure of the blog for storing basic information. There are more variables, but they are not actually used. Some are reserved:

Typedef struct tag_blog_info {char * B _url;/* URL */char * B _host;/* website server host name */char * B _page_file;/* page file name */char * B _local_file; /* Name of the locally saved file */char * B _title;/* blog Topic */char * B _date;/* blog posting date */INT B _port; /* URL Port Number */INT B _sockfd;/* network socket */INT B _reads;/* times of Reading */INT B _comments;/* times of comment */INT B _download; /* download status */INT B _lock;/* handle lock */INT B _seq_num;/* serial number */} blog_info; typedef struct tag_blog_spider {blog_info * blog; struct tag_blog_spider * Next ;} blog_spider; typedef struct tag_blog_rank {int B _page_total;/* Total Number of blog pages */char * B _title;/* blog title */char * B _page_view;/* blog access volume */char * B _integral; /* blog points */char * B _ranking;/* blog ranking */char * B _original;/* Number of original blog articles */char * B _reship; /* Number of blog reposted articles */char * B _translation;/* Number of blog translated articles */char * B _comments;/* Number of blog comments */} blog_rank;

4. Using global variables in a program is not the best method, but they both have advantages and disadvantages:

Use global variables:
1. Advantages: Simple operation. You do not need to provide too many function parameters;
2. Disadvantages: It is difficult to maintain and the code is less readable. Therefore, this program only uses three global variables.

/*global variable*/static int g_seq_num = 0;static char csdn_id[255];static struct hostent *web_host;

The web_host variable is used to save the "blog.csdn.net" host information. The IP address in the initialization socket will be used. web_host-> h_addr_list [0];

5. Many functions are defined in the program as follows:

Static char * strrstr (const char * S1, const char * S2); // search for the S2 string from the S1 string, returns the last occurrence address static char * strfchr (char * s); // filters out irregular characters in the S string static int init_spider (blog_spider ** spider ); // initialize the blog crawler node. the pointer must be used; otherwise, the static int init_rank (blog_rank ** rank) cannot be achieved ); // initialize the static void insert_spider (blog_spider * spider_head, blog_spider * spider) Structure of the blog to store basic information; // Insert the blog into the static int spider_size (blog_spider * spider_head ); // calculate the length of the crawler linked list static void print_spider (blog_spider * spider_head); // print the crawler linked list and save it to the current directory *. log File static void print_rank (blog_rank * Rank); // print the basic blog information static void free_spider (blog_spider * spider_head ); // release the space static void free_rank (blog_rank * Rank) of the crawler linked list; // release the space static int get_blog_info (blog_spider * spider_head, blog_rank * Rank) of the basic blog information ); // obtain the blog title, total number of pages, points, and rankings on the blog homepage static int analyse_index (blog_spider * spider_head); analyze the blog information on each page, and add it to the crawler linked list static int download_index (blog_spider * spider_head); // download the static int download_blog (blog_spider * spider) on the blog homepage ); // download each blog static int get_web_host (const char * hostname); // obtain the host information of the "blog.csdn.net" website static int connect_web (const blog_spider * spider ); // initialize the socket and connect to the website server static int send_request (const blog_spider * spider); // send the request static int recv_response (const blog_spider * spider) to the website server ); // receive the Response Message from the website Server

6. Give the above two string processing functions first. This guy is a little cool.

/*************************************** * ********************** Strrstr: searches for the specified string and returns the last address, self-implemented ************************************** * **********************/static char * strrstr (const char * S1, const char * S2) {int len2; char * PS1; If (! (Len2 = strlen (S2) {return (char *) S1;} PS1 = (char *) S1 + strlen (S1)-1; PS1 = PS1-len2 + 1; while (PS1> = S1) {If (* PS1 = * S2) & (strncmp (PS1, S2, len2) = 0) {return (char *) PS1;} PS1 --;} return NULL ;} /*************************************** * ***************** strfchr: searches for irregular characters in the specified string and assigns null values. If these irregular characters are not deleted, ******************************** * ***********************/static char * strfchr (C Har * s) {char * P = s; while (* P) {If ('/' = * P) | ('? '= * P) {* p = 0; strcat (S, "XXX"); Return P;} p ++;} return NULL ;}

Quote a sentence from xingye:"Kung fu is definitely suitable for men, women, and children. Killing and killing is just a misunderstanding of it. Kung fu is an art and unyielding spirit. Therefore, I have been looking for ways to re-package kung fu, so that you guys can have a deeper understanding of kung fu.".
Just relax and continue:

7. initialize the crawler linked list. I have provided many independent functions for processing. This increases the readability of the program and does not support all functions in the main function.

/*************************************** * ****************** Initialize the linked list node of the blog crawler, apply for a space and assign a null value *********************************** * ********************/static int init_spider (blog_spider ** spider) {* spider = (blog_spider *) malloc (sizeof (blog_spider); If (null = * spider) {# ifdef spider_debugfprintf (stderr, "malloc: % s \ n ", strerror (errno); # endifreturn-1;} (* spider)-> blog = (blog_info *) malloc (sizeof (blog_info )); if (null = (* spider)-> blog) {# ifdef spider_debugfprintf (stderr, "malloc: % s \ n", strerror (errno )); # endiffree (* spider); Return-1;} (* spider)-> blog-> B _url = NULL; (* spider) -> blog-> B _host = strdup (csdn_blog_host); (* spider)-> blog-> B _page_file = NULL; (* spider)-> blog-> B _local_file = NULL; (* spider)-> blog-> B _title = NULL; (* spider)-> blog-> B _date = NULL; (* spider)-> blog-> B _port = csdn_blog_port; (* spider)-> blog-> B _sockfd = 0; (* spider)-> blog-> B _reads = 0; (* spider)-> blog-> B _comments = 0; (* spider)-> blog-> B _download = blog_undownload; (* spider)-> blog-> B _lock = blog_unlock; (* spider)-> blog-> B _seq_num = 0; (* spider)-> next = NULL; return 0 ;} /*************************************** * ****************** initialize the basic information structure of a blog, contains the following variables: 1. total number of blog pages 2. blog Title 3. blog visits 4. blog points 5. blog ranking 6. number of original blog articles 7. blog reposts 8. number of blog translations 9. number of blog comments ************************************* * ******************/static int init_rank (blog_rank ** rank) {* Rank = (blog_rank *) malloc (sizeof (blog_rank); If (null = * Rank) {# ifdef spider_debugfprintf (stderr, "malloc: % s \ n ", strerror (errno); # endifreturn-1;} (* Rank)-> B _page_total = 0; (* Rank)-> B _title = NULL; (* rank) -> B _page_view = NULL; (* Rank)-> B _integral = NULL; (* Rank)-> B _ranking = NULL; (* Rank)-> B _original = NULL; (* rank) -> B _reship = NULL; (* Rank)-> B _translation = NULL; (* Rank)-> B _comments = NULL; return 0 ;}

8. Some crawler linked list processing, which are relatively simple, will be pasted out.

/*************************************** ***************** Insert a blog crawler node into the crawler linked list *************** **************************************** **/static void insert_spider (blog_spider * spider_head, blog_spider * spider) {blog_spider * pspider; pspider = spider_head; while (pspider-> next) {pspider = pspider-> next;} pspider-> next = spider ;} /*************************************** * **************** return the length of the crawler linked list ****************** ***************************************/ static int spider_size (blog_spider * spider_head) {int COUNT = 0; blog_spider * pspider; pspider = spider_head; while (pspider-> next) {pspider = pspider-> next; count ++;} return count ;}

The length is a bit long. Wait for the next article...

Stephen CHOW: What are you doing here?
Zhao Wei: I want to help you compete.
Stephen CHOW: how can you help? Go back to Mars. The earth is very dangerous.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.