I. Demand
The recent reconfiguration of the news App based on Material design is a problem with data sources.
Some predecessors analyzed the daily, Phoenix News and other APIs, according to the corresponding URL can get news of the JSON data. In order to exercise the ability to write code, the author intends to crawl the news page, I get the data building API.
Second, the effect chart
The image below is the page of the original site
The crawler gets the data and shows it to the APP phone.
Third, the reptile thought
The implementation of the app can refer to these articles, this article mainly explains how to crawler data.
Android Recording app operation to generate the whole process of GIF dynamic diagram:http://www.jb51.net/article/78236.htm
Learn about Android Material Design (Recyclerview instead of ListView):http://www.jb51.net/article/78232.htm
Android Project REAL-battle Imitation NetEase News page (Recyclerview):http://www.jb51.net/article/78230.htm
Jsoup Introduction
Jsoup is a Java open source HTML parser that can directly parse a URL address, HTML text content.
Jsoup mainly has the following functions:
- -Parse HTML from a URL, file, or string;
- -Use a DOM or CSS selector to find and retrieve data;
- -Manipulate HTML elements, attributes, and text;
- -Clear untrusted HTML (to prevent XSS attacks)
Four, the reptilian process
GET Request Fetch page HTML
The DOM tree of the News page HTML looks like this:
The following code obtains the HTML source code returned by the GET request based on the specified URL.
public static string Doget (String urlstr) throws commonexception {
URL url;
String html = "";
try {
url = new URL (urlstr);
HttpURLConnection connection = (httpurlconnection) url.openconnection ();
Connection.setrequestmethod ("get");
Connection.setconnecttimeout (5000);
Connection.setdoinput (true);
Connection.setdooutput (true);
if (connection.getresponsecode () = =) {
InputStream in = Connection.getinputstream ();
html = Streamtool.intostringbybyte (in);
} else {
throw new Commonexception ("News server return value is not a");
}
catch (Exception e) {
e.printstacktrace ();
throw new Commonexception ("Get request Failed");
}
return html;
}
InputStream in = Connection.getinputstream (); It is a general requirement to convert the input into a string, which we abstract out and write a tool method.
public class Streamtool {public
static String Intostringbybyte (InputStream in) throws Exception {
Bytearrayoutputstream outstr = new Bytearrayoutputstream ();
byte[] buffer = new byte[1024];
int len = 0;
StringBuilder content = new StringBuilder ();
while (len = in.read (buffer))!=-1) {
content.append (new String (buffer, 0, Len, "UTF-8");
}
Outstr.close ();
return content.tostring ();
}
V. Parse HTML get title
Using Google browser's review element, find out the HTML code for the news headlines:
<div id= "Article_title" >
We need to find the section id= "Article_title" from the HTML above, using the getElementById (String id) method
String htmlstr = Httptool.doget (URLSTR);
Converts the acquired HTML source code to document
Document DOC = Jsoup.parse (htmlstr);
Element Articleele = Doc.getelementbyid ("article");
Title
Element Titleele = Articleele.getelementbyid ("Article_title");
String titlestr = Titleele.text ();
Vi. obtaining the date of publication and the source of information
Also find the HTML code for
The idea is similar to the above, using the getElementById (String ID) method to find the id= "Article_detail" as the element, and then using Getelementsbytag to get the span part. Because there are 3 <span> ... </span>, the return is elements rather than element.
Article_detail includes 2016-01-15 sources: Browse times: 177
Element detailele = Articleele.getelementbyid ("Article_detail");
Elements details = Detailele.getelementsbytag ("span");
Publish Time
String datestr = details.get (0). text ();
News source
String sourcestr = details.get (1). text ();
Vii. Resolution of Browse times
If you print out the above Details.get (2). Text (), you will only get
Browse Times:
No browsing times? Why, then?
Because the number of browsing times is JavaScript-rendered, the Jsoup crawler may simply extract the HTML content and not get the dynamically rendered data.
There are two ways to solve this problem
- In the crawler, built-in a browser kernel, execute JS rendering page, and then crawl. The corresponding tools in this respect are selenium, htmlunit or PHANTOMJS.
- So analysis JS request, find the corresponding data request URL
If you visit the urlhttp://see.xidian.edu.cn/index.php/news/click/id/7428 above, you will get the following results
This 478 is the number of times we need to browse, we do a GET request for the URL above, a string to return, and use the positive to find the number.
Visit this news page, browsing times will be +1, the number is JS rendering
String jsstr = httptool.doget (Count_base_url + currentpage);
int readtimes = Integer.parseint (Jsstr.replaceall ("\\d+", ""));
or use the following regular method
//String readtimesstr = Jsstr.replaceall ("[^0-9]", "");
VIII. Analysis of news content
Originally to get news content in plain text form, but later found that the Android side can also display CSS format, so later content preserves HTML format.
Element Contentele = Articleele.getelementbyid ("article_content");
News subject Content
String contentstr = contentele.tostring ();
If you use the text () method, the HTML tag of the news body content will be lost
//To display HTML with WebView on Android, with ToString ()
//String CONTENTSTR = Contente Le.text ();
The resolution of the picture URL
Notice a lot of large and small pictures on a webpage, in order to get only the content of the news body, we'd better first locate the element of the news content, then use Getelementsbytag ("img") to filter out the picture.
Element Contentele = Articleele.getelementbyid ("article_content");
News subject Content
String contentstr = contentele.tostring ();
If you use the text () method, the HTML tag of the news body content will be lost
//To display HTML with WebView on Android, with ToString ()
//String CONTENTSTR = Contentele . text ();
Elements images = Contentele.getelementsbytag ("img");
string[] Imageurls = new string[images.size ()];
for (int i = 0; i < imageurls.length i++) {
Imageurls[i] = Images.get (i). attr ("src");
X. News entity Class JavaBean
The above gets the headline of the news, release date, reading times, news content and so on, we naturally need to construct a javabean to encapsulate the acquired content into the entity class.
public class Articleitem {
private int index;
Private string[] Imageurls;
Private String title;
Private String publishdate;
Private String source;
private int readtimes;
Private String body;
public articleitem (int index, string[] imageurls, string title, String publishdate, string source, int readtimes,
stri ng body) {
this.index = index;
This.imageurls = Imageurls;
this.title = title;
This.publishdate = publishdate;
This.source = source;
This.readtimes = Readtimes;
This.body = body;
}
@Override public
String toString () {return
"Articleitem [index=] + index +", \ n imageurls= "+ arrays.tostring (i Mageurls) + ", \ n title=" + title
+ ", \ n publishdate=" + publishdate + ", \ n source=" + source + ", \ n readtimes=" + Rea Dtimes + ", \ n body=" + body
+ "]";
}
Test
public static Articleitem Getnewsitem (int currentpage) throws Commonexception {//According to suffix number, splicing news URL String urlstr = ART
Icle_base_url + currentpage + ". html";
String htmlstr = Httptool.doget (URLSTR);
Document doc = Jsoup.parse (HTMLSTR);
Element Articleele = Doc.getelementbyid ("article");
Title Element Titleele = Articleele.getelementbyid ("Article_title");
String titlestr = Titleele.text ();
Article_detail includes 2016-01-15 sources: Browse times: 177 Element Detailele = Articleele.getelementbyid ("Article_detail");
Elements details = Detailele.getelementsbytag ("span");
Publish time String datestr = details.get (0). text ();
News source String sourcestr = Details.get (1). text ();
Visit this news page, browsing times will be +1, the number is JS rendering String jsstr = Httptool.doget (Count_base_url + currentpage);
int readtimes = Integer.parseint (Jsstr.replaceall ("\\d+", ""));
or use the following regular method//String READTIMESSTR = Jsstr.replaceall ("[^0-9]", "");
Element Contentele = Articleele.getelementbyid ("article_content"); News Body Content StringContentstr = Contentele.tostring (); If you use the text () method, the HTML tag for the content of the news body is lost//to display HTML with WebView on Android, with ToString ()//String Contentstr = Contentele.text (
);
Elements images = Contentele.getelementsbytag ("img");
string[] Imageurls = new string[images.size ()];
for (int i = 0; i < imageurls.length i++) {Imageurls[i] = Images.get (i). attr ("src");
Return to New Articleitem (CurrentPage, Imageurls, Titlestr, Datestr, Sourcestr, Readtimes, CONTENTSTR);
public static void Main (string[] args) throws Commonexception {System.out.println (7928));
Output information
Articleitem [index=7928,
imageurls=[/uploads/image/20160114/20160114225911_34428.png],
title= Electric Courtyard 2014 development " Let the flower of Bloom the Winter campus "educational activities,
publishdate=2016-01-14,
source= sources: Movie news Network,
readtimes=200,
body=<div id=" Article_content ">
<p style=" TEXT-INDENT:2EM; "align=" Justify "> <strong><span style=" font-size:16px;line-height:1.5; " > News </span></strong><span style= "font-size:16px;line-height:1.5;" > (Correspondent </span><strong><span style= "font-size:16px;line-height:1.5;") > Linda Ding Wang Judan </span></strong><span style= "font-size:16px;line-height:1.5;" "..." )
This article explains how to implement Jsoup Web crawler, if the article is helpful to you, then give a praise.