Java Implementation Crawler provides data to the app (Jsoup web crawler) _java

Source: Internet
Author: User

I. Demand

The recent reconfiguration of the news App based on Material design is a problem with data sources.

Some predecessors analyzed the daily, Phoenix News and other APIs, according to the corresponding URL can get news of the JSON data. In order to exercise the ability to write code, the author intends to crawl the news page, I get the data building API.

Second, the effect chart

The image below is the page of the original site

The crawler gets the data and shows it to the APP phone.

Third, the reptile thought

The implementation of the app can refer to these articles, this article mainly explains how to crawler data.

Android Recording app operation to generate the whole process of GIF dynamic diagram:http://www.jb51.net/article/78236.htm
Learn about Android Material Design (Recyclerview instead of ListView):http://www.jb51.net/article/78232.htm
Android Project REAL-battle Imitation NetEase News page (Recyclerview):http://www.jb51.net/article/78230.htm

Jsoup Introduction

Jsoup is a Java open source HTML parser that can directly parse a URL address, HTML text content.

Jsoup mainly has the following functions:

    • -Parse HTML from a URL, file, or string;
    • -Use a DOM or CSS selector to find and retrieve data;
    • -Manipulate HTML elements, attributes, and text;
    • -Clear untrusted HTML (to prevent XSS attacks)

Four, the reptilian process

GET Request Fetch page HTML

The DOM tree of the News page HTML looks like this:

The following code obtains the HTML source code returned by the GET request based on the specified URL.

public static string Doget (String urlstr) throws commonexception {
 URL url;
 String html = "";
 try {
 url = new URL (urlstr);
 HttpURLConnection connection = (httpurlconnection) url.openconnection ();
 Connection.setrequestmethod ("get");
 Connection.setconnecttimeout (5000);
 Connection.setdoinput (true);
 Connection.setdooutput (true);
 if (connection.getresponsecode () = =) {
 InputStream in = Connection.getinputstream ();
 html = Streamtool.intostringbybyte (in);
 } else {
 throw new Commonexception ("News server return value is not a");
 }
 catch (Exception e) {
 e.printstacktrace ();
 throw new Commonexception ("Get request Failed");
 }
 return html;
}

InputStream in = Connection.getinputstream (); It is a general requirement to convert the input into a string, which we abstract out and write a tool method.

public class Streamtool {public
 static String Intostringbybyte (InputStream in) throws Exception {
 Bytearrayoutputstream outstr = new Bytearrayoutputstream ();
 byte[] buffer = new byte[1024];
 int len = 0;
 StringBuilder content = new StringBuilder ();
 while (len = in.read (buffer))!=-1) {
 content.append (new String (buffer, 0, Len, "UTF-8");
 }
 Outstr.close ();
 return content.tostring ();
 }


V. Parse HTML get title

Using Google browser's review element, find out the HTML code for the news headlines:

<div id= "Article_title" >
  
 

We need to find the section id= "Article_title" from the HTML above, using the getElementById (String id) method

String htmlstr = Httptool.doget (URLSTR);

Converts the acquired HTML source code to document
Document DOC = Jsoup.parse (htmlstr);

Element Articleele = Doc.getelementbyid ("article");
Title
Element Titleele = Articleele.getelementbyid ("Article_title");
String titlestr = Titleele.text ();

Vi. obtaining the date of publication and the source of information

Also find the HTML code for

 
 

The idea is similar to the above, using the getElementById (String ID) method to find the id= "Article_detail" as the element, and then using Getelementsbytag to get the span part. Because there are 3 <span> ... </span>, the return is elements rather than element.

Article_detail includes 2016-01-15 sources: Browse times: 177
Element detailele = Articleele.getelementbyid ("Article_detail");
Elements details = Detailele.getelementsbytag ("span");

Publish Time
String datestr = details.get (0). text ();

News source
String sourcestr = details.get (1). text ();

Vii. Resolution of Browse times

If you print out the above Details.get (2). Text (), you will only get

Browse Times:
No browsing times? Why, then?

Because the number of browsing times is JavaScript-rendered, the Jsoup crawler may simply extract the HTML content and not get the dynamically rendered data.
There are two ways to solve this problem

    • In the crawler, built-in a browser kernel, execute JS rendering page, and then crawl. The corresponding tools in this respect are selenium, htmlunit or PHANTOMJS.
    • So analysis JS request, find the corresponding data request URL

If you visit the urlhttp://see.xidian.edu.cn/index.php/news/click/id/7428 above, you will get the following results

document.write (478)

This 478 is the number of times we need to browse, we do a GET request for the URL above, a string to return, and use the positive to find the number.

Visit this news page, browsing times will be +1, the number is JS rendering
String jsstr = httptool.doget (Count_base_url + currentpage);
int readtimes = Integer.parseint (Jsstr.replaceall ("\\d+", ""));
or use the following regular method
//String readtimesstr = Jsstr.replaceall ("[^0-9]", "");

VIII. Analysis of news content

Originally to get news content in plain text form, but later found that the Android side can also display CSS format, so later content preserves HTML format.

Element Contentele = Articleele.getelementbyid ("article_content");
News subject Content
String contentstr = contentele.tostring ();
If you use the text () method, the HTML tag of the news body content will be lost
//To display HTML with WebView on Android, with ToString ()
//String CONTENTSTR = Contente Le.text ();

The resolution of the picture URL

Notice a lot of large and small pictures on a webpage, in order to get only the content of the news body, we'd better first locate the element of the news content, then use Getelementsbytag ("img") to filter out the picture.

Element Contentele = Articleele.getelementbyid ("article_content");
News subject Content
String contentstr = contentele.tostring ();
If you use the text () method, the HTML tag of the news body content will be lost
//To display HTML with WebView on Android, with ToString ()
//String CONTENTSTR = Contentele . text ();

Elements images = Contentele.getelementsbytag ("img");
string[] Imageurls = new string[images.size ()];
for (int i = 0; i < imageurls.length i++) {
 Imageurls[i] = Images.get (i). attr ("src");


X. News entity Class JavaBean

The above gets the headline of the news, release date, reading times, news content and so on, we naturally need to construct a javabean to encapsulate the acquired content into the entity class.

public class Articleitem {

 private int index;
 Private string[] Imageurls;
 Private String title;
 Private String publishdate;
 Private String source;
 private int readtimes;
 Private String body;

 public articleitem (int index, string[] imageurls, string title, String publishdate, string source, int readtimes,
 stri ng body) {
 this.index = index;
 This.imageurls = Imageurls;
 this.title = title;
 This.publishdate = publishdate;
 This.source = source;
 This.readtimes = Readtimes;
 This.body = body;
 }

 @Override public
 String toString () {return
 "Articleitem [index=] + index +", \ n imageurls= "+ arrays.tostring (i Mageurls) + ", \ n title=" + title
 + ", \ n publishdate=" + publishdate + ", \ n source=" + source + ", \ n readtimes=" + Rea Dtimes + ", \ n body=" + body
 + "]";
 }




Test

public static Articleitem Getnewsitem (int currentpage) throws Commonexception {//According to suffix number, splicing news URL String urlstr = ART

 Icle_base_url + currentpage + ". html";

 String htmlstr = Httptool.doget (URLSTR);

 Document doc = Jsoup.parse (HTMLSTR);
 Element Articleele = Doc.getelementbyid ("article");
 Title Element Titleele = Articleele.getelementbyid ("Article_title");

 String titlestr = Titleele.text ();
 Article_detail includes 2016-01-15 sources: Browse times: 177 Element Detailele = Articleele.getelementbyid ("Article_detail");

 Elements details = Detailele.getelementsbytag ("span");

 Publish time String datestr = details.get (0). text ();

 News source String sourcestr = Details.get (1). text ();
 Visit this news page, browsing times will be +1, the number is JS rendering String jsstr = Httptool.doget (Count_base_url + currentpage);
 int readtimes = Integer.parseint (Jsstr.replaceall ("\\d+", ""));

 or use the following regular method//String READTIMESSTR = Jsstr.replaceall ("[^0-9]", "");
 Element Contentele = Articleele.getelementbyid ("article_content"); News Body Content StringContentstr = Contentele.tostring (); If you use the text () method, the HTML tag for the content of the news body is lost//to display HTML with WebView on Android, with ToString ()//String Contentstr = Contentele.text (

 );
 Elements images = Contentele.getelementsbytag ("img");
 string[] Imageurls = new string[images.size ()];
 for (int i = 0; i < imageurls.length i++) {Imageurls[i] = Images.get (i). attr ("src");

Return to New Articleitem (CurrentPage, Imageurls, Titlestr, Datestr, Sourcestr, Readtimes, CONTENTSTR);

 public static void Main (string[] args) throws Commonexception {System.out.println (7928));

Output information

Articleitem [index=7928,
 imageurls=[/uploads/image/20160114/20160114225911_34428.png],
 title= Electric Courtyard 2014 development " Let the flower of Bloom the Winter campus "educational activities,
 publishdate=2016-01-14,
 source= sources: Movie news Network,
 readtimes=200,
 body=<div id=" Article_content ">
 <p style=" TEXT-INDENT:2EM; "align=" Justify "> <strong><span style=" font-size:16px;line-height:1.5; " > News </span></strong><span style= "font-size:16px;line-height:1.5;" > (Correspondent </span><strong><span style= "font-size:16px;line-height:1.5;") > Linda Ding Wang Judan </span></strong><span style= "font-size:16px;line-height:1.5;" "..." )

This article explains how to implement Jsoup Web crawler, if the article is helpful to you, then give a praise.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.