Use java open-source tools httpClient and jsoup to capture and parse webpage data, httpclientjsoup

Source: Internet
Author: User

Use java open-source tools httpClient and jsoup to capture and parse webpage data, httpclientjsoup

When we were working on a project today, we needed to display today's calendar information on the webpage. The data format is as follows:

  • Gregorian calendar time: Monday, January 1, April 11, 2016
  • Lunar Date: January 1, lunar March 5
  • Tiangan Land support:
  • Yi: Pray for the son and pray for the opening of the bed
  • Avoid: Yutang (Huangdao) is in critical danger, avoid traveling

It mainly includes the date of the Gregorian calendar or lunar calendar, as well as information that is not suitable. But there is no ready-made data in hand for use. What should I do?

The predecessors of the Revolution once said that there is no gun, no gun, and no enemy (luo) creates for us! There are many ready-made online services on the network

The Perpetual calendar application is available for use. Although there is no ready-made interface, we can stick out and get it by ourselves. That is, the so-called data

Capture.

Two tools, httpClient and jsoup, are described as follows:

 

HttpClient is a sub-project under Apache Jakarta Common. It is used to provide an efficient, up-to-date, and function-rich client programming toolkit that supports HTTP protocol. It also supports the latest versions and suggestions of HTTP protocol. HttpClient has been applied to many projects. For example, the other two open-source projects Cactus and HTMLUnit on Apache Jakarta both use HttpClient.

HttpClient is used as follows:

1. Create an HttpClient object.

2. Create an instance of the Request Method and specify the request URL.

3. Call the execute (HttpUriRequest request) of the HttpClient object to send a request. This method returns an HttpResponse.

4. Call the HttpResponse method to obtain the corresponding content.

5. Release the connection.

 

Jsoup is a Java HTML Parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can be used to retrieve and manipulate data through DOM, CSS, and operations similar to jQuery.

 

For more information, see the official website.

HttpClient: http://hc.apache.org/httpcomponents-client-5.0.x/index.html

Jsoup: http://jsoup.org/

  

Next we directly on the code, here we capture 2345 online calendar data http://tools.2345.com/rili.htm

First, we define an object class Almanac to store the calendar data.

Almanac. java

1 package com. likx. picker. util. bean; 2 3/** 4 * Perpetual calendar tool entity class 5*6 * @ author source blog 7 * April 11, 2016 8 */9 public class Almanac {10 private String solar; /* e. g. Monday, June 1 */11 private String lunar;/* lunar calendar e.g. monkey year lunar March 5 */12 private String chineseAra;/* tiangan Geographic Support Law e.g. */13 private String shocould;/* e.e.g. pray for blessings, open the light, sacrifice the bed */14 private String avoid;/* avoid e.g. yutang (Huangdao) dangerous day, do not travel */15 16 public String getSolar () {17 return solar; 18} 19 20 public void setSolar (String date) {21 this. solar = date; 22} 23 24 public String getLunar () {25 return lunar; 26} 27 28 public void setLunar (String lunar) {29 this. lunar = lunar; 30} 31 32 public String getChineseAra () {33 return chineseAra; 34} 35 36 public void setChineseAra (String chineseAra) {37 this. chineseAra = chineseAra; 38} 39 40 public String getAvoid () {41 return avoid; 42} 43 44 public void setAvoid (String avoid) {45 this. avoid = avoid; 46} 47 48 public String getshocould () {49 return shocould; 50} 51 52 public void setshocould (String shocould) {53 this. shocould = shocould; 54} 55 56 public Almanac (String solar, String lunar, String chineseAra, String shocould, 57 String avoid) {58 this. solar = solar; 59 this. lunar = lunar; 60 this. chineseAra = chineseAra; 61 this. shocould = shocould; 62 this. avoid = avoid; 63} 64}

 

The main program for parsing is crawled. before writing the program, you need to download the required jar package from the official website.

AlmanacUtil. java

Package com. likx. picker. util; import java. io. IOException; import java. text. simpleDateFormat; import java. util. calendar; import java. util. date; import org. apache. http. httpEntity; import org. apache. http. parseException; import org. apache. http. client. clientProtocolException; import org. apache. http. client. methods. closeableHttpResponse; import org. apache. http. client. methods. httpGet; import org. apache. http. impl. cli Ent. closeableHttpClient; import org. apache. http. impl. client. httpClients; import org. apache. http. util. entityUtils; import org. jsoup. jsoup; import org. jsoup. nodes. document; import org. jsoup. nodes. element; import org. jsoup. select. elements;/*** <STRONG> class description </STRONG>: 23.45 million timeline information crawling tool <p> ** @ version 1.0 <p> * @ author tracing blog ** <STRONG> creation time </STRONG>: april 11, 2016 14:15:44 <p> * <STRONG> modification history </STRONG>: <p> * <pre> * Modified by modification time: * ----------------- modified * </pre> */public class AlmanacUtil {/*** single-sample tool class */private AlmanacUtil () {}/*** get Perpetual calendar information * @ return */public static Almanac getAlmanac () {String url = "http://tools.2345.com/rili.htm"; String html = pickData (url ); almanac almanac = analyzeHTMLByString (html); return almanac;}/** crawling webpage information */private static Stri Ng pickData (String url) {CloseableHttpClient httpclient = HttpClients. createDefault (); try {HttpGet httpget = new HttpGet (url); CloseableHttpResponse response = httpclient.exe cute (httpget); try {// get the response entity HttpEntity entity = response. getEntity (); // print the response status if (entity! = Null) {return EntityUtils. toString (entity) ;}} finally {response. close () ;}} catch (ClientProtocolException e) {e. printStackTrace ();} catch (ParseException e) {e. printStackTrace ();} catch (IOException e) {e. printStackTrace ();} finally {// close the connection and release the resource try {httpclient. close ();} catch (IOException e) {e. printStackTrace () ;}} return null;}/** use jsoup to parse webpage information */private static Almanac AlyzeHTMLByString (String html) {String solarDate, lunarDate, chineseAra, showould, avoid = ""; Document document = Jsoup. parse (html); // Gregorian date = getSolarDate (); // lunar time Element eLunarDate = document. getElementById ("info_nong"); lunardateappselunardate.child(0).html().substring(1, 3366%elunardate.html (). substring (11); // The tiangan geographic rule Element eChineseAra = document. getElementById ("info_chang"); chineseAra = eChineseAra. Text (). toString (); // shoshold = getSuggestion (document, "yi"); // avoid = getSuggestion (document, "ji"); Almanac almanac = new Almanac (solarDate, lunarDate, chineseAra, showould, avoid); return almanac;}/** forbidden/ */private static String getSuggestion (Document doc, String id) {Element element = doc. getElementById (id); Elements elements = element. getElementsByTag ("a"); StringBuffer sb = new StringBuffer (); for (El Ement e: elements) {sb. append (e. text () + "");} return sb. toString ();}/** obtain the Gregorian calendar time in the format of yyyy, MM, dd, and EEEE. * @ Return yyyy MM dd EEEE */private static String getSolarDate () {Calendar calendar = Calendar ar. getInstance (); Date solarDate = calendar. getTime (); SimpleDateFormat formatter = new SimpleDateFormat ("MM dd, yyyy"); return formatter. format (solarDate );}}

 

To make it simple and clear, I abstracted the capture parsing into several independent methods,

The pickData () method uses httpClient to capture data to a string (that is, click the HTML source code shown in the source code on the webpage ),

The analyzeHTMLByString () method is used to parse the captured string. The getSuggestion method abstracts the collection method similar to the expected data

In addition, because the Gregorian calendar time can be easily generated by itself, it is not crawled on the webpage.

 

The following is a simple test result of the test class:

AlmanacUtilTest. java

Package com. likx. picker. util. test; public class AlmanacUtilTest {public static void main (String args []) {Almanac almanac = AlmanacUtil. getAlmanac (); System. out. println ("Gregorian time:" + almanac. getSolar (); System. out. println ("Lunar Date:" + almanac. getLunar (); System. out. println ("tiangan branch:" + almanac. getChineseAra (); System. out. println ("appropriate:" + almanac. getshocould (); System. out. println ("Avoid:" + almanac. getAvoid ());}}

 

The running result is as follows:

 

The effect of integration into a project is as follows:

 

  

 

In addition, the blog has not been updated recently. Due to the technical atmosphere, I recently left the Japanese outsourcing industry and went

An Internet company. Let's talk about the recent feelings, that is, the core competitiveness of a programmer is not how many frameworks have been learned,

How many tools are mastered (of course these are indispensable to programmers), but are solid foundations and quick learning capabilities, such as today's

This project, from having no idea about httpClient and jsoup tools to coding Demo code for more than an hour

It is unimaginable for me. It is very good to get skills quickly in a place with a strong technical atmosphere.

Of course, this example is just a very simple example, and the content on the web page is also very easy to crawl. The httpClient and jsoup tools are more powerful.

For example, httpClient can not only send get requests, but also send post requests, submit forms, and send

File. For example, the most powerful aspect of jsoup is that it supports the jquery-like selector. This example only uses the simplest document. getElementById ()

Matching element. In fact, the jsoup selector is exceptionally powerful. It can be said that it is a java version of jquery, such:

 

Elements links = doc.select("a[href]"); // a with hrefElements pngs = doc.select("img[src$=.png]");  // img with src ending .pngElement masthead = doc.select("div.masthead").first();  // div with class=mastheadElements resultLinks = doc.select("h3.r > a"); // direct a after h3

 

 

 

In addition, there are many powerful functions that are not listed in detail. If you are interested, please refer to the official website documentation. Get new skills!

 

This article is copyrighted by the author and the blog. For more information, see the source of the author and the original article.

Source blog http://www.cnblogs.com/lkxsnow/

 

 

 

  

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.