Today, I encountered another task of webpage data capturing.
Speaking of webpage information capturing, we believe that Jsoup is basically the preferred tool. It is very comfortable to use JQuery-like operations. However, today we will talk about the shortcomings of Jsoup.
1. Create a new page
<Script type = text/javascript> var datas = [{href: http://news.qq.com/a/20140416/017800.htm,title: A university guard looks like the writer Mo Yan}, {href: http://news.qq.com/a/20140416/015167.htm,title: Men's single-arm lift hanging female half an hour }, {href: http://news.qq.com/a/20140416/013808.htm,title: Women's House Rent rape shot nude photos}, {href: http://news.qq.com/a/20140416/016805.htm,title: Australian camels love to drink ice town beer summer}]; window. onload = function () {var infos = document. getElementById (inf OS); for (var I = 0; I <datas. length; I ++) {var a = document. createElement (a);. href = datas [I]. href;. innerText = datas [I]. title; infos. appendChild (a); infos. appendChild (document. createElement (br) }}</script> Hello Main HttpUnit!
The page is displayed as follows:
Our review elements:
If you see such a page, you will think that Jsoup can be used to capture it. It's just like a bunch of dishes, so we wrote the code like this:
@Testpublic void testUserJsoup() {try {Document doc = Jsoup.connect(http://localhost:8080/strurts2fileupload/main.html).timeout(5000).get();Elements links = doc.body().getElementsByTag(a);for (Element link : links) {System.out.println(link.text() + + link.attr(href));}} catch (IOException e) {e.printStackTrace();}}
You will think that just a few lines of code can be easily done and you will be happy to get off work. As a result, the running finds that nothing can be captured.
So let's go back to the page and open the page source code, that is, the above HTML code. You suddenly realized that, by relying on it, there is no data in the body, and it's no wonder you can't catch it. This is the deficiency of Jsoup. If the data on the page to be crawled by Jsoup is obtained by ajax after the page is loaded, it cannot be captured.
Next we recommend another open-source project: HttpUnit. The name is used for testing, but it is good to capture data.
We started to write code similar to Jsoup:
@ Testpublic void testUserHttpUnit () throws FailingHttpStatusCodeException, MalformedURLException, IOException {/** HtmlUnit request web page */WebClient wc = new WebClient (BrowserVersion. CHROME); wc. getOptions (). setUseInsecureSSL (true); wc. getOptions (). setJavaScriptEnabled (true); // enables the JS interpreter. The default value is truewc. getOptions (). setCssEnabled (false); // disable css to support wc. getOptions (). setThrowExceptionOnScriptError (false); // when a js running error occurs, yes No throws an exception wc. getOptions (). setTimeout (100000); // sets the connection timeout value. The value is 10 S. If the value is 0, wait for wc. getOptions (). setDoNotTrackEnabled (false) indefinitely; HtmlPage page = wc. getPage (http: // localhost: 8080/strurts2fileupload/main.html); DomNodeList
Links = page. getElementsByTagName (a); for (DomElement link: links) {System. out. println (link. asText () + + link. getAttribute (href ));}}
Let's take a look at the running results:
The perfect solution: HttpUnit is actually equivalent to a browser without a UI. It allows the JavaScript code on the page to be executed before capturing information. For details, just google it. This article mainly introduces a solution!
If you think this article is useful to you ~