Parsing DOM nodes in HTML, XML, or URL links using the Jsoup library

Source: Internet
Author: User

Soup is a Java HTML parser that can parse a URL address and HTML text content directly. It provides a very labor-saving API that can be used to extract and manipulate data through dom,css and jquery-like operations.

examples of use of Jsoup
<span style= "FONT-SIZE:14PX;" >import java.io.ioexception;import java.io.inputstream;import Org.jsoup.jsoup;import org.jsoup.nodes.Document; Import Org.jsoup.nodes.element;import Org.jsoup.select.elements;import Android.app.activity;import Android.os.bundle;import Android.util.log;public class Mainactivity extends Activity {private String URL = "/http/ mobile.csdn.net/"; @Overrideprotected void OnCreate (Bundle savedinstancestate) {super.oncreate (savedinstancestate); Setcontentview (R.layout.activity_main); new Thread (New Runnable () {@Overridepublic void Run () {//parsehtml (); Parseepub ();}}). Start ();} private void Parseepub () {try {InputStream is = Getassets (). Open ("Fb.ncx"); int size = Is.available (); byte[] buffer = new B Yte[size];is.read (buffer); Is.close (); String epubtext = new string (buffer, "utf-8");D ocument doc = Jsoup.parse (epubtext); String DocTitle = Doc.getelementsbytag ("DocTitle"). First (). text (); LOG.I ("info", DocTitle); Elements Elements = Doc.getelementsbytag ("Navpoint"); for (Element ele:elements) {String title = Ele.text (); String href = ele.getelementsbytag ("Content"). First (). attr ("src"); LOG.I ("info", title + ":" + href);}} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();}} private void parsehtml () {try {Document doc = jsoup.connect (URL). get (); Elements Elements = Doc.select ("Div.unit"); for (Element ele:elements) {String title = Ele.getelementsbytag ("H1"). First ( ). text (); String href = Ele.getelementsbytag ("H1"). First (). Getelementsbytag ("a"). First (). attr ("href"); LOG.I ("info", title + ":" + href); System.out.println ("Title:" +title+ ", href:" +href);}} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();}}} </span><span style= "FONT-SIZE:24PX;" ></span>

the code is as follows: http://download.csdn.net/detail/adayabetter/8947361 The main functions of Jsoup are as follows:1. Parse html;2 from a URL, file, or string. Use the DOM or CSS selector to find and remove data; 3. Can manipulate HTML elements, attributes, text, Jsoup is based on the MIT protocol published, can be assured that the use of commercial projects. The main class hierarchy for Jsoup is as follows: Next we give an example of several common scenarios to illustrate how Jsoup gracefully handles HTML documents.Document InputJsoup can load HTML documents from including strings, URL addresses, and local files, and generate a Document object instance. Here is the relevant code://Enter HTML document directly from string html = "parsing and extracting HTML elementsThis section covers the most basic functionality of an HTML parser, but Jsoup uses a different way from other open source projects-the selector, which we'll cover in detail in the last section, which you'll see how Jsoup is implemented with the simplest code. However, Jsoup also provides the traditional Dom method element parsing, see the following code: File input = new file ("d:/test.html");D ocument doc = jsoup.parse (input, "UTF-8", "url/" ); Element content =doc.getelementbyid ("content"); Elements links = Content.getelementsbytag ("a"); for (Element link:links) {String linkhref =link.attr ("href"); String LinkText =link.text ();} You may find the Jsoup method familiar, yes, like the getelementbyid  and Getelementsbytag methods are the same as the JavaScript method names, and the functions are exactly the same. You can get the corresponding element or list of elements based on the node name or the ID of the HTML element. Unlike the Htmlparser project, Jsoup does not define a corresponding class for the HTML element, and the general component of an HTML element includes: node name, attributes, and text, Jsoup provides a simple way for you to retrieve the data yourself, which is why jsoup remains thin. In terms of element retrieval, Jsoup Selectors are simply omnipotent, file input = new file ("D:\test.html");D ocument Doc =jsoup.parse (input, "UTF-8", "url"); Elements links = doc.select ("a[href]"); Links with href attributes elements PNGs = Doc.select ("img[src$=.png]");//all elements referencing PNG pictures element masthead =doc.select ("Div.masthead" ). First ();//Find the element that defines class=masthead elements resultlinks = Doc.select ("H3.R >a"); Direct A after H3 this is the place where Jsoup really impressed me, Jsoup uses the same selector as jquery to retrieve the elements, and the above retrieval methods, if replaced by other HTML interpreters, require at least a lot of line code, and Jsoup Only one line of code is required. The Jsoup selector also supports expression functionality, and we'll introduce this super selector in the last section.Modifying DataWhile parsing a document, we may need to modify some elements of the document, such as we can add clickable links to all the images in the document, modify the link address, or modify the text. Here are some simple examples: doc.select ("Div.commentsa"). attr ("rel", "nofollow");//Add Rel=nofollow attribute Doc.select for all links ("Div.commentsa "). AddClass (" Mylinkclass ");//Add Class=mylinkclass attribute Doc.select (" img ") for all links. Removeattr (" onclick ");// Delete the OnClick property of all pictures Doc.select ("Input[type=text]"). Val ("");//clear the text in all text input boxes The simple truth is that you only need to use the Jsoup selector to find the element, You can then modify it by using the method above, except that you cannot modify the label name (you can delete and then insert a new element), including the attributes and text of the element. After modifying the HTML () method of calling element (s) directly, you can get the modified HTML document.HTML Document CleanupJsoup in providing a powerful API at the same time, the human side is also doing very well. When doing the website, often will provide the user comment function. Some users are naughty, will engage in some script to comment content, and these scripts may break the entire page behavior, more serious is to obtain some confidential information, such as XSS cross-site attacks and so on. Jsoup support in this area is very powerful and very simple to use. Take a look at this code: String unsafe = "<p><a href= ' url ' onclick= ' stealcookies () ' > Open source China community </a></p>"; String safe = Jsoup.clean (unsafe, whitelist.basic ());//output://<p><ahref= "url" rel= "nofollow" > Open source China Community </a ></p>jsoup uses a whitelist class to filter HTML documents, which provides several common methods: if none of the five filters meet your requirements, such as allowing users to insert flash animations, it doesn't matter, Whitelist provides extended functionality, such as whitelist.addtags ("embed", "Object", "param", "span", "div"), or the ability to invoke addattributes to add attributes to certain elements.the Jsoup of the most extraordinary--selectorsEarlier, we have briefly described how Jsoup uses selectors to retrieve elements. In this section we focus on the powerful syntax of the selector itself. The following table is a detailed list of all the syntax for the Jsoup selector. Basic usage above is the most basic selector syntax, these grammars can also be combined to use, the following is a combination of Jsoup support usage: In addition to some basic syntax and the combination of these syntax, Jsoup also supports the use of expressions for element filtering selection. The following is a list of all the expressions supported by Jsoup:

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Parsing DOM nodes in HTML, XML, or URL links using the Jsoup library

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.