Android jsoup Crawl Web page data

Source: Internet
Author: User

Accidentally a month and the past, in fact, the recent small busy little busy, nonsense not much to say, directly into today's theme it.

Jsoup–java HTML Parser, with the best of DOM, CSS, and jquery, see this introduction to know that this is convenient for us Java and Android to parse HTML .

HTML tags

To climb other people's HTML labels, first of all you must have a certain HTML basic knowledge of it. For example, commonly used tags, tag related properties, this is not much to say, there are related problems can be resolved in the www.w3school.com.cn website.

Loading Web pages

The simplest is to load a Web page directly:

  Document document = Jsoup.connect("https://www.google.com").get();

That's the way to see the last one get() . You must have guessed that there is a corresponding post() method. In addition, the http requested related operations can be set, including header request parameters, request timeouts, and so on. In addition, the local file (IO stream) and so on can be directly parsed ha.

Document document = Jsoup.connect("https://android-arsenal.com")        .timeout(5000)        .cookie("cookie", "cxxx")        .header("xx", "xx")        .userAgent("")        .get();
Basic label parsing

And then we got a Document target. This object is the encapsulation of the entire request page, the relevant content can be obtained inside.

Come on, join us. There is an HTML tag that needs to be parsed:

 <div class= "Project-info clearfix" > <div class= "header" > <div class= "title" > <a href= "/details/1/5442" >RendererRecyclerViewAdapter</a> <a class= "tags" href= "/tag/199" >rec Ycler views</a> </div> <a class= "badge free" href= "/free" >Free</a> <a clas s= "badge new" href= "/recent" >New</a> </div> <div class= "desc" > <p>a single Adapter For the whole project.</p> <ul> <li>now you don't need to implement adapters for Recycle Rview.</li> <li>you can easily use several types of cells in a single list.</li> <li> ; Using This library would protect you from the appearance of any business logic in an adapter.</li> </ul> </div> <div class= "Ftr L" ><i class= "fa Fa-calendar" ></i> Mar, 2017</DIV></DIV&G t; 

JsoupThe method used to find the label is select() not too powerful. Let's take a step-by-step.

For example, if we want to find the words in the vast label <div class="project-info clearfix"> , we should take this findElementByClass() one, so Jsoup how do we define this piece in the middle?

Haha, very easy, that is document.select("div.project-info clearfix") , of course, it is not so, and so on class what is the meaning of this space in the attribute? Is it a face? The final notation here is that document.select("div.project-info.clearfix") spaces need . to be handled.

      Elements select = document.select("div.project-info.clearfix");

Here to get is a collection. We then need to iterate through the collection and then pull out every tag in it.

The title part of the parsing, here is a <div> nested inside a <a> tag. This involves parsing the <a> label. Here we need to correspond, href also need corresponding text , Jsoup provide corresponding two methods attr() and text() .

Elements elements = e.select("div.title");if (!elements.isEmpty()) { for (Element tittle : elements) { Element first = tittle.select("a[href]").first(); if (first != null) { title = first.text(); titleUrl = first.attr("href"); System.out.println("名称:" + title); System.out.println("具体地址:" + titleUrl); } Elements select1 = tittle.select("a.tags"); if (!select1.isEmpty()) { tag = select1.text(); tagUrl = select1.attr("href"); System.out.println("tags:" + tag); System.out.println("tagUrl:" + tagUrl); } }} 
Nested parsing

Here, and the introduction of the <div> <a> label basically done, the next is <div class="desc"> the resolution.

<div class="desc">    <p>A single adapter for the whole project.</p>    <ul>    <li>Now you do not need to implement adapters for RecyclerView.</li>    <li>You can easily use several types of cells in a single list.</li>    <li>Using this library will protect you from the appearance of any business logic in an adapter.</li>    </ul></div>

Here again <ul> and, in <li> fact, the truth is similar, but here they have neither class nor id , that this we should be so to parse it?

Here is the way to go back to the select() method, where you need to use the method at the specified level.

        Elements select1 = e.select("div.desc > p");        String s = select1.toString();

For <dt> <dd> Related tags, you can use the + associated connector. For example, I want to only parse the Tag following corresponding Tag names and related url , how should this be written?

<dt>Tag</dt><dd><a href="/tag/9">Background Processing</a></dd><dt>License</dt><dd><a href="http://opensource.org/licenses/Apache-2.0" rel="nofollow" target="_blank">Apache License, Version 2.0</a></dd>

This is how the code is, and here it comes to the select() nested high-level notation of the method.

 Elements select4 = element.select("dt:contains(Tag) + dd");

In fact, there is not much to explain, the description is very clear. The last one is to support regular matching.

Sibling Neighbor parsing

Another situation is that the tag we need is not specific id or class , and it does not have a direct corresponding parent tag or some kind of fixed nested relationship, such as the following:

<a id="favoriteButton" href="#" class="fa fa-star-o favorite tshadow" title="Add to favorites"></a> <a href="/details/1/5244">ImmediateLooperScheduler</a> <div id="githubInfoValue">

Here we just need to parse to the second <a> label, so what do we need to do with it? Here's how to use it nextElementSibling() .

Element ssa = h1.select("a#favoriteButton").first();Element element = ssa.nextElementSibling();String title = element.text();
Fuzzy parsing

Sometimes we just know <div> what it starts with or what ends or contains a word, and then we need to use a fuzzy search.

The Jsoup relevant wording of these conditions is defined in, select() where, what begins with, what is used, a[href^=http] what end is used a[href$=.jpg] , and what is used a[href*=/search/] .

JavaScript parsing

Just said is the ordinary label and its contents, if I want to get JS related tags and content it? It's not that hard, it's just the end not using the text () method, but using the data () method.

The Jsoup main thing is to write this select() method,

final Elements script = document.select("script");String js = script.first().data();
Related combat

Android-arsenal This site is not built folks have you ever heard of it? This gives us Android developers a platform for information exchange, real-time updates on some Android-related apps, development libraries, and demos. Then, I see it also has its own client, so the moment Curiosity also intends to download down to see, the result, the client is directly loaded Web page, the key is the advertisement is flying. This makes people uncomfortable (to say the same, they do not advertise to make a little money to do this platform why. )

So brainwave, why am I not doing a Android-Arsenal client of my own? This makes it easy to see the latest things on your phone. So I made a client, and used Jsoup to crawl the corresponding page. Then the corresponding ads labels are filtered, so it is very refreshing. Of course, the function is only a part of the implementation first. Favorite friends can order a star or download use Yo!

One last wave:

Project Address: https://github.com/lovejjfg/Android-Arsenal

--Edit by Joe at 2017 03 18--

Android jsoup Crawl Web page data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.