Java crawls Web page data by URL-----Regular expression

Source: Internet
Author: User

Original Address https://www.cnblogs.com/xiaoMzjm/p/3894805.html

"This article describes"  

Crawl the content on someone else's web page, and listen to what seems like an interesting look, just a few steps away from what you can't get, for example? For example, the weather forecast, always can not take the instrument to test it! Of course, to get the weather forecast or use WebService good. Here is just an example. Words don't say much, look at the effect.

Effect

Let's find a weather forecast website to try it out: http://www.weather.com.cn/html/weather/101280101.shtml

As can be seen, today (6th) weather. Let's take this as an example to get the weather today!

Final spool print out:

Today: 6th weather: Thunderstorm temperature: 26°~34° Wind: Breeze

Ideas

1. Get input stream through URL ———— 2, get web page HTML code ———— 3, extract useful information with regular expression ———— 4, assemble into desired format

The most difficult thing in fact 3rd, if the regular expression is not ripe, basically in this step will hang out-for example, I t_t. Below in order to extract the correct data, I matched several times, if can match, the code is much less!

Code

  1 package com.zjm.www.test;  2 3 Import Java.io.BufferedReader;  4 Import java.io.IOException;  5 Import Java.io.InputStream;  6 Import Java.io.InputStreamReader;  7 Import java.net.HttpURLConnection;  8 Import Java.net.URL; 9 Import Java.util.regex.Matcher; Ten import Java.util.regex.Pattern; 11 12/** 13 * Description: Crawl the weather on the page today * @author ZJM * @time 2014/8/6 * * * * Service {18 19/** 20 * Initiating HTTP GET request get Web page source code * @param requesturl String Request Address * @re   Turn string The HTML string returned by this address is * */HttpRequest private static string (string Requesturl) {   StringBuffer buffer = null; BufferedReader bufferedreader = null; InputStreamReader inputstreamreader = null; InputStream inputstream = null; HttpURLConnection httpurlconn = null; try {33//create GET request URL url = new URL (reQuesturl);   Httpurlconn = (httpurlconnection) url.openconnection ();   Httpurlconn.setdoinput (TRUE);   PNS Httpurlconn.setrequestmethod ("GET");   38 39//Get input stream inputstream = Httpurlconn.getinputstream ();   InputStreamReader = new InputStreamReader (InputStream, "utf-8");   BufferedReader = new BufferedReader (InputStreamReader);   43 44//Read result from input stream buffer = new StringBuffer ();   The String str = NULL;   A while ((str = bufferedreader.readline ()) = null) {buffer.append (str);   (Exception e) {e.printstacktrace ();                     The finally {54//Release resource (BufferedReader! = null) {57 Bufferedreader.close (); IOException} catch (e) {E.PRINTSTAcktrace ();                     (InputStreamReader! = null) {64 Inputstreamreader.close ();             n} catch (IOException e) {e.printstacktrace (); 67} 68                 } if (InputStream! = null) {$ try {inputstream.close (); 72             } catch (IOException e) {e.printstacktrace (); 74} 75} 76   if (httpurlconn! = null) {httpurlconn.disconnect ();   () () (buffer.tostring);         81} 82 83/** 84 * Filter out useless information in HTML String * @param HTML string HTML strings * @return  String useful data for * * * The private static string Htmlfiter (string html) {StringBuffer   Buffer = new StringBuffer (); str1 String = "";   92      String str2 = ""; The Buffer.append ("Today:"); 94 95//Remove useful range of the Pattern p = Pattern.compile ("(. *) (<li class=\ ' dn on\ ' data-dn=\ ' 7d1\ ' >) (.*?) (</li>)   (.*)");   Matcher m = p.matcher (HTML); 98 if (M.matches ()) {str1 = M.group (3); 100//Match date, note: Date is included in 

"detailed"

34-49 line : Through the URL to get the source of the Web page, nothing to say.

96 Line : Press F12 on the Web page, look at "Today" HTML code, found such as, so our first step is to filter out the outside of this piece of HTML code.

  ( . *) (<li class=\ ' dn on\ ' data-dn=\ ' 7d1\ ' >) (. *?) (</li>) (.*)  This regular expression, it is easy to see can be divided into the following 5 groups:

  (. *) : Matches anything other than line break 0-n times

  (<li class=\ ' dn on\ ' data-dn=\ ' 7d1\ ' >) : Matches the middle segment heml code once

  (.*?) :. *? to match the lazy pattern, meaning to match anything except the newline character as few times as possible

  (</li>) : matches the middle piece of HTML code once

  (. *) : Matches anything other than line break 0-n times

In this way, we can use M.group (3) to get the string matching the middle (. *?) code. That is, we need the "Today" weather code.

101 Lines : In the middle of the piece of code to take out as shown, there are many useless tags. We have to find a way to continue removing. method as above.

106 lines : Manually stitching the strings we need.

After the above processing, the completion of a simple crawl.

Middle Regular expression part of the most dissatisfied, each road netizen if have good suggestion trouble leave valuable comment, grateful ~

Java crawls Web page data by URL-----Regular expression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.