Original Address https://www.cnblogs.com/xiaoMzjm/p/3894805.html
"This article describes"
Crawl the content on someone else's web page, and listen to what seems like an interesting look, just a few steps away from what you can't get, for example? For example, the weather forecast, always can not take the instrument to test it! Of course, to get the weather forecast or use WebService good. Here is just an example. Words don't say much, look at the effect.
Effect
Let's find a weather forecast website to try it out: http://www.weather.com.cn/html/weather/101280101.shtml
As can be seen, today (6th) weather. Let's take this as an example to get the weather today!
Final spool print out:
Today: 6th weather: Thunderstorm temperature: 26°~34° Wind: Breeze
Ideas
1. Get input stream through URL ———— 2, get web page HTML code ———— 3, extract useful information with regular expression ———— 4, assemble into desired format
The most difficult thing in fact 3rd, if the regular expression is not ripe, basically in this step will hang out-for example, I t_t. Below in order to extract the correct data, I matched several times, if can match, the code is much less!
Code
1 package com.zjm.www.test; 2 3 Import Java.io.BufferedReader; 4 Import java.io.IOException; 5 Import Java.io.InputStream; 6 Import Java.io.InputStreamReader; 7 Import java.net.HttpURLConnection; 8 Import Java.net.URL; 9 Import Java.util.regex.Matcher; Ten import Java.util.regex.Pattern; 11 12/** 13 * Description: Crawl the weather on the page today * @author ZJM * @time 2014/8/6 * * * * Service {18 19/** 20 * Initiating HTTP GET request get Web page source code * @param requesturl String Request Address * @re Turn string The HTML string returned by this address is * */HttpRequest private static string (string Requesturl) { StringBuffer buffer = null; BufferedReader bufferedreader = null; InputStreamReader inputstreamreader = null; InputStream inputstream = null; HttpURLConnection httpurlconn = null; try {33//create GET request URL url = new URL (reQuesturl); Httpurlconn = (httpurlconnection) url.openconnection (); Httpurlconn.setdoinput (TRUE); PNS Httpurlconn.setrequestmethod ("GET"); 38 39//Get input stream inputstream = Httpurlconn.getinputstream (); InputStreamReader = new InputStreamReader (InputStream, "utf-8"); BufferedReader = new BufferedReader (InputStreamReader); 43 44//Read result from input stream buffer = new StringBuffer (); The String str = NULL; A while ((str = bufferedreader.readline ()) = null) {buffer.append (str); (Exception e) {e.printstacktrace (); The finally {54//Release resource (BufferedReader! = null) {57 Bufferedreader.close (); IOException} catch (e) {E.PRINTSTAcktrace (); (InputStreamReader! = null) {64 Inputstreamreader.close (); n} catch (IOException e) {e.printstacktrace (); 67} 68 } if (InputStream! = null) {$ try {inputstream.close (); 72 } catch (IOException e) {e.printstacktrace (); 74} 75} 76 if (httpurlconn! = null) {httpurlconn.disconnect (); () () (buffer.tostring); 81} 82 83/** 84 * Filter out useless information in HTML String * @param HTML string HTML strings * @return String useful data for * * * The private static string Htmlfiter (string html) {StringBuffer Buffer = new StringBuffer (); str1 String = ""; 92 String str2 = ""; The Buffer.append ("Today:"); 94 95//Remove useful range of the Pattern p = Pattern.compile ("(. *) (<li class=\ ' dn on\ ' data-dn=\ ' 7d1\ ' >) (.*?) (</li>) (.*)"); Matcher m = p.matcher (HTML); 98 if (M.matches ()) {str1 = M.group (3); 100//Match date, note: Date is included in "detailed"
34-49 line : Through the URL to get the source of the Web page, nothing to say.
96 Line : Press F12 on the Web page, look at "Today" HTML code, found such as, so our first step is to filter out the outside of this piece of HTML code.
( . *) (<li class=\ ' dn on\ ' data-dn=\ ' 7d1\ ' >) (. *?) (</li>) (.*) This regular expression, it is easy to see can be divided into the following 5 groups:
(. *) : Matches anything other than line break 0-n times
(<li class=\ ' dn on\ ' data-dn=\ ' 7d1\ ' >) : Matches the middle segment heml code once
(.*?) :. *? to match the lazy pattern, meaning to match anything except the newline character as few times as possible
(</li>) : matches the middle piece of HTML code once
(. *) : Matches anything other than line break 0-n times
In this way, we can use M.group (3) to get the string matching the middle (. *?) code. That is, we need the "Today" weather code.
101 Lines : In the middle of the piece of code to take out as shown, there are many useless tags. We have to find a way to continue removing. method as above.
106 lines : Manually stitching the strings we need.
After the above processing, the completion of a simple crawl.
Middle Regular expression part of the most dissatisfied, each road netizen if have good suggestion trouble leave valuable comment, grateful ~
Java crawls Web page data by URL-----Regular expression