In the article "back up the csdn blog text to local archives", the blog post on csdn is backed up in hard-coded mode. The effect is good, but many people have encountered Encoding Problems, this is easier to solve. The so-called encoding problem is nothing more than two points. The first is that the encoding configuration of the browser is incorrect, which will lead to garbled characters when the file is opened and saved to a local place in the browser; the second is that the encoding configuration of the operating system is incorrect. This will cause garbled characters in the file name of the local article to be saved. Configure the two. The operating system I use is the Chinese version of Mac OS x 10.7, And the browser is Firefox. I have not encountered any encoding problems. However, to use safari, You need to manually set the encoding to gb2312 or gb18030.
In addition to coding, another problem is that it does not support other blogs. It is not difficult to solve this problem. In fact, the general framework in the source code has been determined, and the rest is just to define some labels and callback functions. This is very easy for Java, such as defining super classes like below:
abstract class Site { public String atl_name; public String atl_value; public abstract int ParseIMG(NodeList nlist, int index); public abstract int ParseTITLE(NodeList nlist, int index); public abstract int ParseAUTHOR(NodeList nlist, int index); public abstract int ParseMonthArticle(NodeList nlist, int index); public abstract int ParsePerArticle(NodeList nlist, int index); public abstract int ParsePAGE(NodeList nlist, int index);}
Then, make an adjustment similar to the following in the source code:
public static int parseImg(NodeList nlist, int index, Site st) { return st. ParseIMG(nlist, index);}
At the same time, modify the logic of the handletext method. For example, the code similar to the following should be defined as different logics based on different subclasses:
if (node instanceof Div)
Finally, we can define different site subclasses specific to different websites and implement different abstract methods respectively. The key point here is that we have abstracted several elements for each blog website:
1. Blog title, such as "dawn of harvest"
2. Linked List archived by month
3. linked list of articles archived every month
4. Every article archived every month
5. Every image in each article
6. paging display processing for Category 1
As long as the above six elements can be processed, a super-class site is defined to process the above elements, and the corresponding methods can be implemented for different blog sites respectively. It should be noted that, currently, blogs that cannot be indexed by month are not supported.
The whole process is actually using htmlpaser to parse the page and take some actions to download it to the local device and save it according to certain rules.