Disadvantages of advanced Jsoup for webpage information capturing

Source: Internet
Author: User

 

Today, I encountered another task of webpage data capturing.

Speaking of webpage information capturing, we believe that Jsoup is basically the preferred tool. It is very comfortable to use JQuery-like operations. However, today we will talk about the shortcomings of Jsoup.

1. Create a new page

 

      <Script type = text/javascript> var datas = [{href: http://news.qq.com/a/20140416/017800.htm,title: A university guard looks like the writer Mo Yan}, {href: http://news.qq.com/a/20140416/015167.htm,title: Men's single-arm lift hanging female half an hour }, {href: http://news.qq.com/a/20140416/013808.htm,title: Women's House Rent rape shot nude photos}, {href: http://news.qq.com/a/20140416/016805.htm,title: Australian camels love to drink ice town beer summer}]; window. onload = function () {var infos = document. getElementById (inf OS); for (var I = 0; I <datas. length; I ++) {var a = document. createElement (a);. href = datas [I]. href;. innerText = datas [I]. title; infos. appendChild (a); infos. appendChild (document. createElement (br) }}</script> Hello Main HttpUnit! 

The page is displayed as follows:

 

Our review elements:

If you see such a page, you will think that Jsoup can be used to capture it. It's just like a bunch of dishes, so we wrote the code like this:

 

        @Testpublic void testUserJsoup() {try {Document doc = Jsoup.connect(http://localhost:8080/strurts2fileupload/main.html).timeout(5000).get();Elements links = doc.body().getElementsByTag(a);for (Element link : links) {System.out.println(link.text() +   + link.attr(href));}} catch (IOException e) {e.printStackTrace();}}
You will think that just a few lines of code can be easily done and you will be happy to get off work. As a result, the running finds that nothing can be captured.

 

So let's go back to the page and open the page source code, that is, the above HTML code. You suddenly realized that, by relying on it, there is no data in the body, and it's no wonder you can't catch it. This is the deficiency of Jsoup. If the data on the page to be crawled by Jsoup is obtained by ajax after the page is loaded, it cannot be captured.

Next we recommend another open-source project: HttpUnit. The name is used for testing, but it is good to capture data.

We started to write code similar to Jsoup:

 

@ Testpublic void testUserHttpUnit () throws FailingHttpStatusCodeException, MalformedURLException, IOException {/** HtmlUnit request web page */WebClient wc = new WebClient (BrowserVersion. CHROME); wc. getOptions (). setUseInsecureSSL (true); wc. getOptions (). setJavaScriptEnabled (true); // enables the JS interpreter. The default value is truewc. getOptions (). setCssEnabled (false); // disable css to support wc. getOptions (). setThrowExceptionOnScriptError (false); // when a js running error occurs, yes No throws an exception wc. getOptions (). setTimeout (100000); // sets the connection timeout value. The value is 10 S. If the value is 0, wait for wc. getOptions (). setDoNotTrackEnabled (false) indefinitely; HtmlPage page = wc. getPage (http: // localhost: 8080/strurts2fileupload/main.html); DomNodeList
 
  
Links = page. getElementsByTagName (a); for (DomElement link: links) {System. out. println (link. asText () + + link. getAttribute (href ));}}
 
Let's take a look at the running results:

 

The perfect solution: HttpUnit is actually equivalent to a browser without a UI. It allows the JavaScript code on the page to be executed before capturing information. For details, just google it. This article mainly introduces a solution!

 

If you think this article is useful to you ~

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.