Objective
Recently I bought a personal domain name in a cloud, I would like to buy and the server to build their own personal site, because of the need to prepare too much, temporarily put on hold, think of the first to borrow GitHub pages to build a static station, the process is actually also tortuous, mainly the domain name address configuration put people to waste, But overall is smooth, website address https://chenchangyuan.cn (empty blog, style is pretty, later will be building blocks)
With Git+npm+hexo, and then in GitHub for the corresponding configuration, online tutorials a lot, if there are questions welcome comments to inform.
I was also engaged in a few years of Java, due to the company's job responsibilities, the latter is gradually bent, is now mainly to do the front-end development.
So you want to use Java to crawl the article, and then the crawled HTML into the MD (not yet implemented, welcome to the guidance of fellow students).
1. Get all the URLs for your personal blog
View Blog Address https://www.cnblogs.com/ccylovehs/default.html?page=1
Traverse by the number of blogs you write yourself
The details page address of the blog is stored in the Set collection, the details page address https://www.cnblogs.com/ccylovehs/p/9547690.html
2. Detail page URL to generate HTML file
Iterate through the set collection, generating the HTML file sequentially
Files are stored in the C://data//blog directory, and the file name is generated by capturing group 1
3. Code implementation
PackageCom.blog.util;ImportJava.io.BufferedReader;ImportJava.io.File;ImportJava.io.InputStreamReader;ImportJava.io.OutputStreamWriter;ImportJava.io.PrintStream;Importjava.net.HttpURLConnection;ImportJava.net.URL;ImportJava.util.Iterator;ImportJava.util.Set;ImportJava.util.TreeSet;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern;/** * @authorJack Chen **/ Public classBlogutil {/*** Url_page:cnblogs URL * url_page_detail: Details page URL * Page_count: pages * urllists: All details page URL set set (Prevent duplicates) * P: Matching mode **/ Public Final StaticString url_page = "https://www.cnblogs.com/ccylovehs/default.html?page="; Public Final StaticString url_page_detail = "https://www.cnblogs.com/ccylovehs/p/([0-9]+.html)"; Public Final Static intPage_count = 3; Public StaticSet<string> urllists =NewTreeset<string>(); Public Final StaticPattern p =Pattern.compile (Url_page_detail); Public Static voidMain (string[] args)throwsException { for(inti = 1;i<=page_count;i++) {geturls (i); } for(Iterator<string> i =urllists.iterator (); I.hasnext ();) {CreateFile (I.next ()); } } /** * @paramURL *@throwsException*/ Private Static voidCreateFile (String URL)throwsException {Matcher m=p.matcher (URL); M.find (); String FileName= M.group (1); String prefix= "c://data//blog//"; File File=NewFile (prefix +fileName); PrintStream PS=Newprintstream (file); URL u=Newurl (URL); HttpURLConnection Conn=(HttpURLConnection) u.openconnection (); Conn.connect (); BufferedReader BR=NewBufferedReader (NewInputStreamReader (Conn.getinputstream (), "Utf-8")); String str; while(str = br.readline ())! =NULL) {ps.println (str); } ps.close (); Br.close (); Conn.disconnect (); } /** * @paramidx *@throwsException*/ Private Static voidGeturls (intIdxthrowsexception{URL u=NewURL (url_page+ "" +idx); HttpURLConnection Conn=(HttpURLConnection) u.openconnection (); Conn.connect (); BufferedReader BR=NewBufferedReader (NewInputStreamReader (Conn.getinputstream (), "Utf-8")); String str; while(str = br.readline ())! =NULL){ if(NULL! = str && str.contains ("https://www.cnblogs.com/ccylovehs/p/") ) {Matcher m=P.matcher (str); if(M.find ()) {System.out.println (M.group (1)); Urllists.add (M.group ()); }}} br.close (); Conn.disconnect (); } }
4. Conclusion
If you feel that it is useful to you, please move the mouse to give me a star, your encouragement is my greatest motivation
Https://github.com/chenchangyuan/getHtmlForJava
Because you do not want an article to manually generate MD files, the next step is to convert the HTML file into a batch of MD files in order to improve the personal blog content, not to be continued ~ ~ ~
Java Crawl blog post