Back up the csdn blog body to local archive

Source: Internet
Author: User
Tags log e
My eldest brother had a new idea, but he had no technology. Fortunately, my eldest brother thought of me, so I promised to help him for free. This is a cloud-based project, the specific details are not disclosed. However, in the implementation process, I think one module can be used for my own use, so I want to extract this module, the function of this module is to download the articles on the csdn blog to a local device.
Only one template has been completed during the holidays. Although it is very spam, it can meet your own needs, I am afraid that I will not be able to open some of my feelings on the Internet one day. In fact, this has happened many times. If you forget the user, the article is deleted by the Administrator, the blog is closed, and the website is no longer maintained, the problem will occur, so I had to copy my blog registered on various websites to a local document on a regular basis, including Netease, Baidu, csdn, and 51cto, and my wife's QQ space (I always publish the content on my wife's QQ, because it can be easily said), so many websites, so many logs, the workload is really a lot, over time, local archives have become increasingly messy, and many articles have been omitted. Especially for csdn blogs, I have always wanted to dump their entire dump to a local device. It is impossible to copy an article. Because there are too many posts, I have also found the whole station download server, however, the effect is not satisfactory. Taking this opportunity, the gadgets entrusted to me by others can be used for this purpose, and the satisfactory thing is that every article in the dump has cut out irrelevant content, for example, friendship links, blog visits and advertisements, the only retained is the text and text image resources.
At first, I implemented it manually using C ++, but I needed to parse the tags of the HTML document by myself. The implementation was complicated string parsing. In fact, String Parsing can be called the essence of programming, however, for actual projects, it is better to directly use the existing Parsing Library. Later, I found it easier to use the scripting language, such as using python, Perl, even grep/awk/SED can be used, but character encoding is always a big problem. By consulting a super-fierce colleague, I met the Java library htmlparser, which is too convenient. It abstracts HTML file elements into various classes, so that filtering can be easily implemented, what's more, this filtering does not even need to be implemented by yourself. htmlparser comes with the filtering function. All you need to do is to reload some methods, in this way, you can use simple code to download this blog. After the download is complete, you 'd better save it as a separate PDF document. Although Java can also implement this function, however, there are already many such tools, and there are ready-made tools that do not implement programming. This is always the truth.
First, let's take a look at the effect and then look at the code.
Refer index file, as shown in:
 
 

Expand a month archive directory and you will see a collection of articles for this month. Each article contains a directory, as shown in:


The directory contained in each article contains all images contained in this article. If no image is included, the directory is empty, for example:
 

Open an article at will and you will see the title and body of the Article. All the images are included and linked to the corresponding image in the _ Files directory of the article, for example:


 
 

Index.html displays the following:


Click each article to jump to this article.
If you open the index.html file in every article with the Administrative editor javasxcode, you will see that most of the links have been changed to local relative paths and a lot of irrelevant content has been deleted. This modification is very simple and can be done by hand, it is easier to use a program. The problem is that when the trouble of writing a program exceeds the trouble of manual modification, this programming is meaningless. Fortunately, htmlparser can easily achieve this, and it is not complicated at all. This programming makes sense.
The code is very simple, basically a few blocks:
1. traverse several times-traverse the month information by homepage, traverse articles by month archive, and traverse pictures by each article;
2. parse key information, such as the title and image in the article, and fill in the data structure. This can be done through filter;
3. Generate a directory based on the information filled by the filter side effects.

It should be noted that the following code is completely procedural and does not use the OO features of the Java language. Therefore, its data and methods are completely static and no objects are generated, I just want to use the htmlparser API and many good functions of the Java language IDE, such as the automatic complementing feature of methods and method parameters. Veteran or senior students in the science class may say it more seriously, c/C ++'s ide can also support such a feature. If such a counterargument is met, I can also say that, in fact, the assembly language can also be automatically supplemented... In addition, there are a lot of hard codes in the Code. In fact, they should be abstracted, or defined as variables. This is just because they are self-used and are not ready for maintenance in the future. Also, this is the biggest problem. The Code has some bugs, such as support for titles containing strange characters and missing records of error logs (which is important. The Code is as follows:

Import org.html parser. node; import org.html parser. nodefilter; import org.html parser. parser; import org.html parser. filters. tagnamefilter; import org.html parser. util. nodelist; import org.html parser. tags. *; import Java. io. *; import java.net. *; import Java. NIO. *; import Java. util. list; import javax. management. *;/* the use of test for class names is not standardized. However, for the sake of random names, it is indeed a list of month names/month archival URL pairs of test */public class test {/* month articles */FINAL s Tatic attributelist indexlist = new attributelist ();/* Monthly article name/monthly Article URL pair list */Final Static attributelist articlelist = new attributelist (); /* local archive address of each article image/List of URL pairs of each article image */Final Static attributelist resourcelist = new attributelist (); /* Save the list of months and local archives of the Monthly articles, used to generate the directory */static attributelist monthlist = new attributelist (); /* writer used to generate the local archive directory */static outputstreamwriter index_handle = NULL; static string PR Oxy_addr = NULL; static int proxy_port = 3128;/** @ Param URL * @ Param type: 1 is text, 0 is a byte array */public static byte [] getcontent (string URL, int type) {byte RET [] = NULL; try {httpurlconnection conn = NULL; inputstream urlstream = NULL; Url Surl = new URL (URL); Int J =-1; if (proxy_addr! = NULL) {inetsocketaddress SOA = new inetsocketaddress (inetaddress. getbyname (proxy_addr), proxy_port); proxy = new proxy (proxy. type. HTTP, SOA); Conn = (httpurlconnection) Surl. openconnection (proxy);} else {conn = (httpurlconnection) Surl. openconnection ();}/* must be added as a Mozilla Browser; otherwise, csdn rejects connection */Conn. setrequestproperty ("User-Agent", "Mozilla/4.0"); Conn. connect (); urlstream = Conn. getinputstream (); If (type = 1) {string stotalstring = ""; bufferedreader reader = new bufferedreader (New inputstreamreader (urlstream, "UTF-8"); charbuffer VV = charbuffer. allocate (1024); While (j = reader. read (vv. array ()))! =-1) {stotalstring + = new string (vv. array (), 0, J); VV. clear ();} stotalstring = stotalstring. replace ('\ n', ''); stotalstring = stotalstring. replace ('\ R', ''); ret = stotalstring. getbytes ();} else {bytebuffer VV = bytebuffer. allocate (1024);/* Maximum number of images allowed by csdn */bytebuffer buffer = bytebuffer. allocate (5000000); While (j = urlstream. read (vv. array ()))! =-1) {buffer. put (vv. array (), 0, J); VV. clear ();} ret = buffer. array () ;}} catch (exception e) {e. printstacktrace (); // append Error Log} return ret ;} /** @ Param path file path * @ Param content byte array of file content * @ return success or failure */public static Boolean writefile (string path, byte [] content) {try {fileoutputstream OSW = new fileoutputstream (PATH); OSW. write (content); OSW. close ();} catch (exception e) {e. printstacktrace (); // append the error log re Turn false;} return true;}/** @ Param path directory path * @ return success or failure */public static Boolean mkdir (string path) {try {file fp = new file (PATH); If (! FP. exists () {FP. mkdir () ;}} catch (exception e) {e. printstacktrace (); // return false;} return true ;} /** @ Param path file path * @ Param URL of the article on the blog * @ Param articles Save the list archived this month * @ return none */public static void handlehtml (string path, string URL, attributelist articles) {try {stringbuffer text = new stringbuffer (); nodelist nodes = handletext (new string (getcontent (URL, 1), 3 ); node node = nodes. element At (0); String title = (string) (list <attribute>) resourcelist. aslist ()). get (0 ). getvalue (); string filepath = path + "/" + title; List <attribute> li = resourcelist. aslist ();/* Add meta information */text. append (new string ("<meta http-equiv = \" Content-Type \ "content = \" text/html; chaset = UTF-8 \ "/>"); text. append ("


So what if you use the above Code to back up your own csdn blog? You can simply change dog250 to your ID. I commented a large part of content in the main method. You can also expand it, and then it is common. Try it.


 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.