Disadvantages of advanced Jsoup for webpage information capturing

Last Update:2014-04-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, I encountered another task of webpage data capturing.

Speaking of webpage information capturing, we believe that Jsoup is basically the preferred tool. It is very comfortable to use JQuery-like operations. However, today we will talk about the shortcomings of Jsoup.

1. Create a new page

<Script type = text/javascript> var datas = [{href: http://news.qq.com/a/20140416/017800.htm,title: A university guard looks like the writer Mo Yan}, {href: http://news.qq.com/a/20140416/015167.htm,title: Men's single-arm lift hanging female half an hour }, {href: http://news.qq.com/a/20140416/013808.htm,title: Women's House Rent rape shot nude photos}, {href: http://news.qq.com/a/20140416/016805.htm,title: Australian camels love to drink ice town beer summer}]; window. onload = function () {var infos = document. getElementById (inf OS); for (var I = 0; I <datas. length; I ++) {var a = document. createElement (a);. href = datas [I]. href;. innerText = datas [I]. title; infos. appendChild (a); infos. appendChild (document. createElement (br) }}</script> Hello Main HttpUnit!
The page is displayed as follows:

Our review elements:

If you see such a page, you will think that Jsoup can be used to capture it. It's just like a bunch of dishes, so we wrote the code like this:

        @Testpublic void testUserJsoup() {try {Document doc = Jsoup.connect(http://localhost:8080/strurts2fileupload/main.html).timeout(5000).get();Elements links = doc.body().getElementsByTag(a);for (Element link : links) {System.out.println(link.text() +   + link.attr(href));}} catch (IOException e) {e.printStackTrace();}}

You will think that just a few lines of code can be easily done and you will be happy to get off work. As a result, the running finds that nothing can be captured.

So let's go back to the page and open the page source code, that is, the above HTML code. You suddenly realized that, by relying on it, there is no data in the body, and it's no wonder you can't catch it. This is the deficiency of Jsoup. If the data on the page to be crawled by Jsoup is obtained by ajax after the page is loaded, it cannot be captured.

Next we recommend another open-source project: HttpUnit. The name is used for testing, but it is good to capture data.

We started to write code similar to Jsoup:

@ Testpublic void testUserHttpUnit () throws FailingHttpStatusCodeException, MalformedURLException, IOException {/** HtmlUnit request web page */WebClient wc = new WebClient (BrowserVersion. CHROME); wc. getOptions (). setUseInsecureSSL (true); wc. getOptions (). setJavaScriptEnabled (true); // enables the JS interpreter. The default value is truewc. getOptions (). setCssEnabled (false); // disable css to support wc. getOptions (). setThrowExceptionOnScriptError (false); // when a js running error occurs, yes No throws an exception wc. getOptions (). setTimeout (100000); // sets the connection timeout value. The value is 10 S. If the value is 0, wait for wc. getOptions (). setDoNotTrackEnabled (false) indefinitely; HtmlPage page = wc. getPage (http: // localhost: 8080/strurts2fileupload/main.html); DomNodeList
 
  
Links = page. getElementsByTagName (a); for (DomElement link: links) {System. out. println (link. asText () + + link. getAttribute (href ));}}

Let's take a look at the running results:

The perfect solution: HttpUnit is actually equivalent to a browser without a UI. It allows the JavaScript code on the page to be executed before capturing information. For details, just google it. This article mainly introduces a solution!

If you think this article is useful to you ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Disadvantages of advanced Jsoup for webpage information capturing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Disadvantages of advanced Jsoup for webpage information capturing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support