Java Crawl blog post

Last Update:2018-09-02 Source: Internet

Author: User

Tags set set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

Recently I bought a personal domain name in a cloud, I would like to buy and the server to build their own personal site, because of the need to prepare too much, temporarily put on hold, think of the first to borrow GitHub pages to build a static station, the process is actually also tortuous, mainly the domain name address configuration put people to waste, But overall is smooth, website address https://chenchangyuan.cn (empty blog, style is pretty, later will be building blocks)

With Git+npm+hexo, and then in GitHub for the corresponding configuration, online tutorials a lot, if there are questions welcome comments to inform.

I was also engaged in a few years of Java, due to the company's job responsibilities, the latter is gradually bent, is now mainly to do the front-end development.

So you want to use Java to crawl the article, and then the crawled HTML into the MD (not yet implemented, welcome to the guidance of fellow students).

1. Get all the URLs for your personal blog

View Blog Address https://www.cnblogs.com/ccylovehs/default.html?page=1

Traverse by the number of blogs you write yourself

The details page address of the blog is stored in the Set collection, the details page address https://www.cnblogs.com/ccylovehs/p/9547690.html

2. Detail page URL to generate HTML file

Iterate through the set collection, generating the HTML file sequentially

Files are stored in the C://data//blog directory, and the file name is generated by capturing group 1

3. Code implementation

 PackageCom.blog.util;ImportJava.io.BufferedReader;ImportJava.io.File;ImportJava.io.InputStreamReader;ImportJava.io.OutputStreamWriter;ImportJava.io.PrintStream;Importjava.net.HttpURLConnection;ImportJava.net.URL;ImportJava.util.Iterator;ImportJava.util.Set;ImportJava.util.TreeSet;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern;/** * @authorJack Chen **/ Public classBlogutil {/*** Url_page:cnblogs URL * url_page_detail: Details page URL * Page_count: pages * urllists: All details page URL set set (Prevent duplicates) * P: Matching mode **/     Public Final StaticString url_page = "https://www.cnblogs.com/ccylovehs/default.html?page=";  Public Final StaticString url_page_detail = "https://www.cnblogs.com/ccylovehs/p/([0-9]+.html)";  Public Final Static intPage_count = 3;  Public StaticSet<string> urllists =NewTreeset<string>();  Public Final StaticPattern p =Pattern.compile (Url_page_detail);  Public Static voidMain (string[] args)throwsException { for(inti = 1;i<=page_count;i++) {geturls (i); }         for(Iterator<string> i =urllists.iterator (); I.hasnext ();)        {CreateFile (I.next ()); }    }        /**     * @paramURL *@throwsException*/    Private Static voidCreateFile (String URL)throwsException {Matcher m=p.matcher (URL);        M.find (); String FileName= M.group (1); String prefix= "c://data//blog//"; File File=NewFile (prefix +fileName); PrintStream PS=Newprintstream (file); URL u=Newurl (URL); HttpURLConnection Conn=(HttpURLConnection) u.openconnection ();        Conn.connect (); BufferedReader BR=NewBufferedReader (NewInputStreamReader (Conn.getinputstream (), "Utf-8"));                String str;  while(str = br.readline ())! =NULL) {ps.println (str);        } ps.close ();        Br.close ();    Conn.disconnect (); }        /**     * @paramidx *@throwsException*/    Private Static voidGeturls (intIdxthrowsexception{URL u=NewURL (url_page+ "" +idx); HttpURLConnection Conn=(HttpURLConnection) u.openconnection ();        Conn.connect (); BufferedReader BR=NewBufferedReader (NewInputStreamReader (Conn.getinputstream (), "Utf-8"));        String str;  while(str = br.readline ())! =NULL){            if(NULL! = str && str.contains ("https://www.cnblogs.com/ccylovehs/p/") ) {Matcher m=P.matcher (str); if(M.find ()) {System.out.println (M.group (1));                Urllists.add (M.group ());        }}} br.close ();    Conn.disconnect (); }    }

4. Conclusion

If you feel that it is useful to you, please move the mouse to give me a star, your encouragement is my greatest motivation

Https://github.com/chenchangyuan/getHtmlForJava

Because you do not want an article to manually generate MD files, the next step is to convert the HTML file into a batch of MD files in order to improve the personal blog content, not to be continued ~ ~ ~

Java Crawl blog post

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More