Java Crawl blog post

Source: Internet
Author: User
Tags set set

Objective

Recently I bought a personal domain name in a cloud, I would like to buy and the server to build their own personal site, because of the need to prepare too much, temporarily put on hold, think of the first to borrow GitHub pages to build a static station, the process is actually also tortuous, mainly the domain name address configuration put people to waste, But overall is smooth, website address https://chenchangyuan.cn (empty blog, style is pretty, later will be building blocks)

With Git+npm+hexo, and then in GitHub for the corresponding configuration, online tutorials a lot, if there are questions welcome comments to inform.

I was also engaged in a few years of Java, due to the company's job responsibilities, the latter is gradually bent, is now mainly to do the front-end development.

So you want to use Java to crawl the article, and then the crawled HTML into the MD (not yet implemented, welcome to the guidance of fellow students).

1. Get all the URLs for your personal blog

View Blog Address https://www.cnblogs.com/ccylovehs/default.html?page=1

Traverse by the number of blogs you write yourself

The details page address of the blog is stored in the Set collection, the details page address https://www.cnblogs.com/ccylovehs/p/9547690.html

2. Detail page URL to generate HTML file

Iterate through the set collection, generating the HTML file sequentially

Files are stored in the C://data//blog directory, and the file name is generated by capturing group 1

3. Code implementation
 PackageCom.blog.util;ImportJava.io.BufferedReader;ImportJava.io.File;ImportJava.io.InputStreamReader;ImportJava.io.OutputStreamWriter;ImportJava.io.PrintStream;Importjava.net.HttpURLConnection;ImportJava.net.URL;ImportJava.util.Iterator;ImportJava.util.Set;ImportJava.util.TreeSet;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern;/** * @authorJack Chen **/ Public classBlogutil {/*** Url_page:cnblogs URL * url_page_detail: Details page URL * Page_count: pages * urllists: All details page URL set set (Prevent duplicates) * P: Matching mode **/     Public Final StaticString url_page = "https://www.cnblogs.com/ccylovehs/default.html?page=";  Public Final StaticString url_page_detail = "https://www.cnblogs.com/ccylovehs/p/([0-9]+.html)";  Public Final Static intPage_count = 3;  Public StaticSet<string> urllists =NewTreeset<string>();  Public Final StaticPattern p =Pattern.compile (Url_page_detail);  Public Static voidMain (string[] args)throwsException { for(inti = 1;i<=page_count;i++) {geturls (i); }         for(Iterator<string> i =urllists.iterator (); I.hasnext ();)        {CreateFile (I.next ()); }    }        /**     * @paramURL *@throwsException*/    Private Static voidCreateFile (String URL)throwsException {Matcher m=p.matcher (URL);        M.find (); String FileName= M.group (1); String prefix= "c://data//blog//"; File File=NewFile (prefix +fileName); PrintStream PS=Newprintstream (file); URL u=Newurl (URL); HttpURLConnection Conn=(HttpURLConnection) u.openconnection ();        Conn.connect (); BufferedReader BR=NewBufferedReader (NewInputStreamReader (Conn.getinputstream (), "Utf-8"));                String str;  while(str = br.readline ())! =NULL) {ps.println (str);        } ps.close ();        Br.close ();    Conn.disconnect (); }        /**     * @paramidx *@throwsException*/    Private Static voidGeturls (intIdxthrowsexception{URL u=NewURL (url_page+ "" +idx); HttpURLConnection Conn=(HttpURLConnection) u.openconnection ();        Conn.connect (); BufferedReader BR=NewBufferedReader (NewInputStreamReader (Conn.getinputstream (), "Utf-8"));        String str;  while(str = br.readline ())! =NULL){            if(NULL! = str && str.contains ("https://www.cnblogs.com/ccylovehs/p/") ) {Matcher m=P.matcher (str); if(M.find ()) {System.out.println (M.group (1));                Urllists.add (M.group ());        }}} br.close ();    Conn.disconnect (); }    }
4. Conclusion

If you feel that it is useful to you, please move the mouse to give me a star, your encouragement is my greatest motivation

Https://github.com/chenchangyuan/getHtmlForJava

Because you do not want an article to manually generate MD files, the next step is to convert the HTML file into a batch of MD files in order to improve the personal blog content, not to be continued ~ ~ ~

Java Crawl blog post

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.