Accidentally a month and the past, in fact, the recent small busy little busy, nonsense not much to say, directly into today's theme it.
Jsoup–java HTML Parser, with the best of DOM, CSS, and jquery, see this introduction to know that this is convenient for us Java
and Android
to parse HTML
.
HTML tags
To climb other people's HTML
labels, first of all you must have a certain HTML
basic knowledge of it. For example, commonly used tags, tag related properties, this is not much to say, there are related problems can be resolved in the www.w3school.com.cn website.
Loading Web pages
The simplest is to load a Web page directly:
Document document = Jsoup.connect("https://www.google.com").get();
That's the way to see the last one get()
. You must have guessed that there is a corresponding post()
method. In addition, the http
requested related operations can be set, including header
request parameters, request timeouts, and so on. In addition, the local file (IO stream) and so on can be directly parsed ha.
Document document = Jsoup.connect("https://android-arsenal.com") .timeout(5000) .cookie("cookie", "cxxx") .header("xx", "xx") .userAgent("") .get();
Basic label parsing
And then we got a Document
target. This object is the encapsulation of the entire request page, the relevant content can be obtained inside.
Come on, join us. There is an HTML tag that needs to be parsed:
<div class= "Project-info clearfix" > <div class= "header" > <div class= "title" > <a href= "/details/1/5442" >RendererRecyclerViewAdapter</a> <a class= "tags" href= "/tag/199" >rec Ycler views</a> </div> <a class= "badge free" href= "/free" >Free</a> <a clas s= "badge new" href= "/recent" >New</a> </div> <div class= "desc" > <p>a single Adapter For the whole project.</p> <ul> <li>now you don't need to implement adapters for Recycle Rview.</li> <li>you can easily use several types of cells in a single list.</li> <li> ; Using This library would protect you from the appearance of any business logic in an adapter.</li> </ul> </div> <div class= "Ftr L" ><i class= "fa Fa-calendar" ></i> Mar, 2017</DIV></DIV&G t;
Jsoup
The method used to find the label is select()
not too powerful. Let's take a step-by-step.
For example, if we want to find the words in the vast label <div class="project-info clearfix">
, we should take this findElementByClass()
one, so Jsoup
how do we define this piece in the middle?
Haha, very easy, that is document.select("div.project-info clearfix")
, of course, it is not so, and so on class
what is the meaning of this space in the attribute? Is it a face? The final notation here is that document.select("div.project-info.clearfix")
spaces need .
to be handled.
Elements select = document.select("div.project-info.clearfix");
Here to get is a collection. We then need to iterate through the collection and then pull out every tag in it.
The title part of the parsing, here is a <div>
nested inside a <a>
tag. This involves parsing the <a>
label. Here we need to correspond, href
also need corresponding text
, Jsoup
provide corresponding two methods attr()
and text()
.
Elements elements = e.select("div.title");if (!elements.isEmpty()) { for (Element tittle : elements) { Element first = tittle.select("a[href]").first(); if (first != null) { title = first.text(); titleUrl = first.attr("href"); System.out.println("名称:" + title); System.out.println("具体地址:" + titleUrl); } Elements select1 = tittle.select("a.tags"); if (!select1.isEmpty()) { tag = select1.text(); tagUrl = select1.attr("href"); System.out.println("tags:" + tag); System.out.println("tagUrl:" + tagUrl); } }}
Nested parsing
Here, and the introduction of the <div>
<a>
label basically done, the next is <div class="desc">
the resolution.
<div class="desc"> <p>A single adapter for the whole project.</p> <ul> <li>Now you do not need to implement adapters for RecyclerView.</li> <li>You can easily use several types of cells in a single list.</li> <li>Using this library will protect you from the appearance of any business logic in an adapter.</li> </ul></div>
Here again <ul>
and, in <li>
fact, the truth is similar, but here they have neither class
nor id
, that this we should be so to parse it?
Here is the way to go back to the select()
method, where you need to use the method at the specified level.
Elements select1 = e.select("div.desc > p"); String s = select1.toString();
For <dt>
<dd>
Related tags, you can use the +
associated connector. For example, I want to only parse the Tag
following corresponding Tag
names and related url
, how should this be written?
<dt>Tag</dt><dd><a href="/tag/9">Background Processing</a></dd><dt>License</dt><dd><a href="http://opensource.org/licenses/Apache-2.0" rel="nofollow" target="_blank">Apache License, Version 2.0</a></dd>
This is how the code is, and here it comes to the select()
nested high-level notation of the method.
Elements select4 = element.select("dt:contains(Tag) + dd");
In fact, there is not much to explain, the description is very clear. The last one is to support regular matching.
Sibling Neighbor parsing
Another situation is that the tag we need is not specific id
or class
, and it does not have a direct corresponding parent tag or some kind of fixed nested relationship, such as the following:
<a id="favoriteButton" href="#" class="fa fa-star-o favorite tshadow" title="Add to favorites"></a> <a href="/details/1/5244">ImmediateLooperScheduler</a> <div id="githubInfoValue">
Here we just need to parse to the second <a>
label, so what do we need to do with it? Here's how to use it nextElementSibling()
.
Element ssa = h1.select("a#favoriteButton").first();Element element = ssa.nextElementSibling();String title = element.text();
Fuzzy parsing
Sometimes we just know <div>
what it starts with or what ends or contains a word, and then we need to use a fuzzy search.
The Jsoup
relevant wording of these conditions is defined in, select()
where, what begins with, what is used, a[href^=http]
what end is used a[href$=.jpg]
, and what is used a[href*=/search/]
.
JavaScript parsing
Just said is the ordinary label and its contents, if I want to get JS related tags and content it? It's not that hard, it's just the end not using the text () method, but using the data () method.
The Jsoup
main thing is to write this select()
method,
final Elements script = document.select("script");String js = script.first().data();
Related combat
Android-arsenal This site is not built folks have you ever heard of it? This gives us Android developers a platform for information exchange, real-time updates on some Android-related apps, development libraries, and demos. Then, I see it also has its own client, so the moment Curiosity also intends to download down to see, the result, the client is directly loaded Web page, the key is the advertisement is flying. This makes people uncomfortable (to say the same, they do not advertise to make a little money to do this platform why. )
So brainwave, why am I not doing a Android-Arsenal
client of my own? This makes it easy to see the latest things on your phone. So I made a client, and used Jsoup
to crawl the corresponding page. Then the corresponding ads
labels are filtered, so it is very refreshing. Of course, the function is only a part of the implementation first. Favorite friends can order a star or download use Yo!
One last wave:
Project Address: https://github.com/lovejjfg/Android-Arsenal
--Edit by Joe at 2017 03 18--
Android jsoup Crawl Web page data