Android jsoup Crawl Web page data

Last Update:2017-03-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Accidentally a month and the past, in fact, the recent small busy little busy, nonsense not much to say, directly into today's theme it.

Jsoup–java HTML Parser, with the best of DOM, CSS, and jquery, see this introduction to know that this is convenient for us Java and Android to parse HTML .

HTML tags

To climb other people's HTML labels, first of all you must have a certain HTML basic knowledge of it. For example, commonly used tags, tag related properties, this is not much to say, there are related problems can be resolved in the www.w3school.com.cn website.

Loading Web pages

The simplest is to load a Web page directly:

  Document document = Jsoup.connect("https://www.google.com").get();

That's the way to see the last one get() . You must have guessed that there is a corresponding post() method. In addition, the http requested related operations can be set, including header request parameters, request timeouts, and so on. In addition, the local file (IO stream) and so on can be directly parsed ha.

Document document = Jsoup.connect("https://android-arsenal.com")        .timeout(5000)        .cookie("cookie", "cxxx")        .header("xx", "xx")        .userAgent("")        .get();

Basic label parsing

And then we got a Document target. This object is the encapsulation of the entire request page, the relevant content can be obtained inside.

Come on, join us. There is an HTML tag that needs to be parsed:

 <div class= "Project-info clearfix" > <div class= "header" > <div class= "title" > <a href= "/details/1/5442" >RendererRecyclerViewAdapter</a> <a class= "tags" href= "/tag/199" >rec Ycler views</a> </div> <a class= "badge free" href= "/free" >Free</a> <a clas s= "badge new" href= "/recent" >New</a> </div> <div class= "desc" > <p>a single Adapter For the whole project.</p> <ul> <li>now you don't need to implement adapters for Recycle Rview.</li> <li>you can easily use several types of cells in a single list.</li> <li> ; Using This library would protect you from the appearance of any business logic in an adapter.</li> </ul> </div> <div class= "Ftr L" ><i class= "fa Fa-calendar" ></i> Mar, 2017</DIV></DIV&G t;

JsoupThe method used to find the label is select() not too powerful. Let's take a step-by-step.

For example, if we want to find the words in the vast label <div class="project-info clearfix"> , we should take this findElementByClass() one, so Jsoup how do we define this piece in the middle?

Haha, very easy, that is document.select("div.project-info clearfix") , of course, it is not so, and so on class what is the meaning of this space in the attribute? Is it a face? The final notation here is that document.select("div.project-info.clearfix") spaces need . to be handled.

      Elements select = document.select("div.project-info.clearfix");

Here to get is a collection. We then need to iterate through the collection and then pull out every tag in it.

The title part of the parsing, here is a <div> nested inside a <a> tag. This involves parsing the <a> label. Here we need to correspond, href also need corresponding text , Jsoup provide corresponding two methods attr() and text() .

Elements elements = e.select("div.title");if (!elements.isEmpty()) { for (Element tittle : elements) { Element first = tittle.select("a[href]").first(); if (first != null) { title = first.text(); titleUrl = first.attr("href"); System.out.println("名称：" + title); System.out.println("具体地址：" + titleUrl); } Elements select1 = tittle.select("a.tags"); if (!select1.isEmpty()) { tag = select1.text(); tagUrl = select1.attr("href"); System.out.println("tags:" + tag); System.out.println("tagUrl:" + tagUrl); } }}

Nested parsing

Here, and the introduction of the <div> <a> label basically done, the next is <div class="desc"> the resolution.

<div class="desc">    <p>A single adapter for the whole project.</p>    <ul>    <li>Now you do not need to implement adapters for RecyclerView.</li>    <li>You can easily use several types of cells in a single list.</li>    <li>Using this library will protect you from the appearance of any business logic in an adapter.</li>    </ul></div>

Here again <ul> and, in <li> fact, the truth is similar, but here they have neither class nor id , that this we should be so to parse it?

Here is the way to go back to the select() method, where you need to use the method at the specified level.

        Elements select1 = e.select("div.desc > p");        String s = select1.toString();

For <dt> <dd> Related tags, you can use the + associated connector. For example, I want to only parse the Tag following corresponding Tag names and related url , how should this be written?

<dt>Tag</dt><dd><a href="/tag/9">Background Processing</a></dd><dt>License</dt><dd><a href="http://opensource.org/licenses/Apache-2.0" rel="nofollow" target="_blank">Apache License, Version 2.0</a></dd>

This is how the code is, and here it comes to the select() nested high-level notation of the method.

 Elements select4 = element.select("dt:contains(Tag) + dd");

In fact, there is not much to explain, the description is very clear. The last one is to support regular matching.

Sibling Neighbor parsing

Another situation is that the tag we need is not specific id or class , and it does not have a direct corresponding parent tag or some kind of fixed nested relationship, such as the following:

<a id="favoriteButton" href="#" class="fa fa-star-o favorite tshadow" title="Add to favorites"></a> <a href="/details/1/5244">ImmediateLooperScheduler</a> <div id="githubInfoValue">

Here we just need to parse to the second <a> label, so what do we need to do with it? Here's how to use it nextElementSibling() .

Element ssa = h1.select("a#favoriteButton").first();Element element = ssa.nextElementSibling();String title = element.text();

Fuzzy parsing

Sometimes we just know <div> what it starts with or what ends or contains a word, and then we need to use a fuzzy search.

The Jsoup relevant wording of these conditions is defined in, select() where, what begins with, what is used, a[href^=http] what end is used a[href$=.jpg] , and what is used a[href*=/search/] .

JavaScript parsing

Just said is the ordinary label and its contents, if I want to get JS related tags and content it? It's not that hard, it's just the end not using the text () method, but using the data () method.

The Jsoup main thing is to write this select() method,

final Elements script = document.select("script");String js = script.first().data();

Related combat

Android-arsenal This site is not built folks have you ever heard of it? This gives us Android developers a platform for information exchange, real-time updates on some Android-related apps, development libraries, and demos. Then, I see it also has its own client, so the moment Curiosity also intends to download down to see, the result, the client is directly loaded Web page, the key is the advertisement is flying. This makes people uncomfortable (to say the same, they do not advertise to make a little money to do this platform why. ）

So brainwave, why am I not doing a Android-Arsenal client of my own? This makes it easy to see the latest things on your phone. So I made a client, and used Jsoup to crawl the corresponding page. Then the corresponding ads labels are filtered, so it is very refreshing. Of course, the function is only a part of the implementation first. Favorite friends can order a star or download use Yo!

One last wave:

Project Address: https://github.com/lovejjfg/Android-Arsenal

--Edit by Joe at 2017 03 18--

Android jsoup Crawl Web page data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Android jsoup Crawl Web page data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Android jsoup Crawl Web page data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support