Java's HTML Parser jsoup and jquery are used to implement a web application that automatically crawls specified elements of any website page repeatedly.

Source: Internet
Author: User

Online Demo local download

If you have developed a Content Aggregation website, you must be familiar with dynamic integration of functions from different pages or website content using programs. We usually use HTML parsing in Java. For example, httpparser, the earliest integrated search of gbin1.com is to use httpparser to capture the search results of Google and Baidu, and the integration is presented to the Search user, which is the origin of the gbin1 domain name.

So today, we will introduce another awesome Java HTML Parser-jsoup. This class library can help you process HTML in real time. It provides a very convenient API to extract and process data. Most importantly, it uses jquery-like syntax to process Dom and CSS. If you have used jquery, it can be used to handle the great convenience of Dom.

Main features

Jsoup implements the whatwg HTML5 standard in the same way as modern browsers parse Dom. Main functions:

  • You can capture and parse HTML from URLs, files, or strings.
  • Use the DOM query and CSS selector to find and decompress data
  • HTML attributes, elements, and text can be processed.
  • Help Users process submitted content and prevent XSS attacks
  • Output clean html

Basically, jsoup can help you deal with various HTML problems, verify invalid tags, and create a clean DOM tree.

Implement a capture function

Here we will implement a simple crawling function. You only need to specify the URL and the specific elements you need to capture, such as ID or class. In the background, we will use jsoup to capture the results, and the foreground will use jquery to beautify the results.

Pay attention to the following points:

  1. <A> relative path problem: links on the page you crawl may use relative paths. You need to process them as absolute paths. Otherwise, the link cannot be opened on the local server.
  2. relative path problem: Same as above, you also need to handle Conversion
  3. size problem: if the image you capture is very large, you need to use the code to convert the style at a cost, you can also choose to use jquery to handle it at the front end.

After downloading the jsoup jar package, add your classpath path. If you use JSP, add it to the lib directory of the Web application WEB-INF.

The Java code is as follows:

Document doc = Jsoup.connect("http://www.gbin1.com/portfolio/lastest.html").timeout(0).get();
Elements items = doc.select(".includeitem");

The above Code defines that jsoup uses a URL to obtain HTML. Here, we use http://www.gbin1.com/portfolio/lastest.html, which lists the latest section of gbin1. If you view the source code of this page, we can see that every article is in the. shortdeitem class. Therefore, we use the doc. Select method to select the corresponding class.

Note that timeout (0) is called here, which means that the request URL is sustained. The default value is 2000. 2 seconds later. We can see that jquery-like chain call is used here, which is very convenient.

for (Element item : items) {
Elements links = item.select("a");
for(Element link: links){
link.attr("href",link.attr("abs:href"));
}

Elements imgs = item.select("img");
for(Element img: imgs){
img.attr("src",img.attr("abs:src"));
}
String html = item.html();
out.println("<li class=\"item\">" + html + "</li>");
}

In the above Code, we process each queried includeitem element. Search for "A" and "IMG" and change the href element value to the absolute path.

link.attr("abs:href")

The above code will obtain the absolute path of the corresponding link, with the attribute ABS: href. Similarly, you can obtain the absolute path ABS: SRC of the image.

After the code is run, we can see the modified Code and place them in Li.

Next we will develop and control the captured JavaScript page:

In the implementation of this page, we use the setinterval method to call the above Java code at a specified interval using Ajax. The basic code is as follows:

// Run for first time
Certificate ('{msg'{.html ('Please wait, the page is crawling...'). fadein (400 );
// Response ('{content'{.html ('');
$ ('# Content'). Load ('siteproxy. jsp # result', {URL: URL, ELEM: Element}, function (){
Certificate ('capture msg'finished .html ('captured completed '). Delay (1500). fadeout (400 );
})

The above code is very simple. We use jquery's load method to call siteproxy. jsp, and then obtain the # result element on the page generated by siteproxy. jsp, that is, capture the content. If you are not familiar with jquery's Ajax method, refer to this series of articles:

Beginner's Guide to jquery class libraryAjaxMethod-Part 1

Beginner's Guide to jquery class libraryAjaxMethod-Part 2

Beginner's Guide to jquery class libraryAjaxMethod-Part 3

Beginner's Guide to jquery class libraryAjaxMethod-Part 4

To enable the code to run the capture at a specified interval, we put the method into setinterval, as shown below:

Runid = setinterval (
Function getinfo (){
Certificate ('{msg'{.html ('Please wait, the page is crawling...'). fadein (400 );
// Response ('{content'{.html ('');
$ ('# Content'). Load ('siteproxy. jsp # result', {URL: URL, ELEM: Element}, function (){
Certificate ('capture msg'finished .html ('captured completed '). Delay (1500). fadeout (400 );
})
}, Interval * 1000 );

Through the above method, we can trigger the capture action at a specified time after the user triggers the capture.

The complete JS Code is as follows:

$ (Document). Ready (function (){
VaR URL, element, interval, runid;
$ ('# Start'). Click (function (){
Url = $ ('# url'). Val ();
Element = $ ('# element'). Val ();
Interval = $ ('# interval'). Val ();

// Run for first time
Certificate ('{msg'{.html ('Please wait, the page is crawling...'). fadein (400 );
// Response ('{content'{.html ('');
$ ('# Content'). Load ('siteproxy. jsp # result', {URL: URL, ELEM: Element}, function (){
Certificate ('capture msg'finished .html ('captured completed '). Delay (1500). fadeout (400 );
})

Runid = setinterval (
Function getinfo (){
Certificate ('{msg'{.html ('Please wait, the page is crawling...'). fadein (400 );
// Response ('{content'{.html ('');
$ ('# Content'). Load ('siteproxy. jsp # result', {URL: URL, ELEM: Element}, function (){
Certificate ('capture msg'finished .html ('captured completed '). Delay (1500). fadeout (400 );
})
}, Interval * 1000 );
});

$ ('# Stop'). Click (function (){
Certificate ('capture msg'0000.html ('capture paused '). fadein (400). Delay (1500 );
Clearinterval (runid );
});

});

After deploying the preceding JSP and HTML files, you can see the following interface:

We need to set the captured URL and page elements, the default here is the http://www.gbin1.com/portfolio/lastest.html, the element is. includeitem, click to start crawling, you can see the application to capture the following content:

Note that the default interval is 30 seconds. Content is automatically crawled again after 30 seconds.

You can try to capture weibo.com, element. itemts, with an interval of 10 seconds. You can get the following information:

You can see that the content is the same as that automatically refreshed on the Weibo homepage.

You can use this tool as a page refresh tool to help you monitor a part of the content of a website. Of course, you can also use it to dynamically refresh your website, increase your Alexa ranking.

I hope you will like this tool. If you have any suggestions or questions, please leave us a message! Thank you!

Reproduced http://www.gbin1.com/technology/javautilities/20120720jsoupjquerysnatchpage/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.