1. Introduction to JSOUP
In the past, when we used java to parse HTML documents or fragments, we usually use the open source class library htmlparser (http://htmlparser.sourceforge.net. Now that we have JSOUP, it is enough to use JSOUP to process HTML content in the future. JSOUP has faster updates and more convenient APIs.
Jsoup is a Java HTML Parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can be used to retrieve and manipulate data through DOM, CSS, and jQuery-like operations. It can be viewed as a java version of jQuery.
The main functions of jsoup are as follows:
- Parse HTML from a URL, file, or string;
- Use the DOM or CSS selector to find and retrieve data;
- HTML elements, attributes, and text can be operated;
Jsoup is released based on the MIT protocol and can be safely used in commercial projects. Http://jsoup.org/
Ii. parse and traverse HTML documents
Jsoup processes HTML files. It parses and converts HTML documents entered by users into a Document object for processing. Jsoup generally supports conversion of the following sources.
- Parse an html string
- Parse a body segment
- Load a Document Object Based on a url
- Load a Document Object Based on a file
(1) parse an html string
Processing an html string. We may need to parse it, extract its content, verify whether its format is complete, or modify it. Jsoup helps us solve these problems easily.
There is a static method Jsoup. parse (String html) in Jsoup, which can convert our html fragments into Document objects. Example:
12 |
String html = "<Div> <p align = \" center \ "> This is the content of the P element </p>" ; Document document = Jsoup.parse(html); |
Using the above method, we can convert the html string to the Document Object. Once the Document Object is available, we can use the appropriate method to solve the problem as needed. We can see that the converted html segment is not a legal html segment, and the div label in it is not closed. This is not a problem for Jsoup. It can handle such problems well.
(2) parse the body segment
Suppose we have an HTML snippet (for example, a div contains a pair of p tags; an incomplete HTML document) to parse it. This HTML clip can be a comment submitted by the user or edit the body section on a CMS page. We can use the Jsoup. parseBodyFragment (String html) method.
Example:
12 |
String html = "<Div> <p align = \" center \ "> This is the content of the P element </p>" ; Document document = Jsoup.parseBodyFragment(html); |
There may be questions here. This is the same as the above html snippet. The same is true. The parseBodyFragment method creates an empty shell document and inserts the parsed HTML into the body element. Assume that Jsoup is normal. the parse (String html) method usually returns the same result, but the user input is treated as a body segment, to ensure that any bad HTML provided by the user will be parsed into a body element.
The Document. body () method can retrieve all child elements of the Document body element, which is the same as doc. getElementsByTag ("body.
(3) load a Document Object Based on a URL
Sometimes we may want to extract the content from a url and convert it into a document object. In the past, we may have used an http client to simulate a request and then obtain the returned content. Using Jsoup, we can easily solve this problem. Example:
123 |
Document document = Jsoup.connect( "http://www.baidu.com" ).get(); String title = document.title(); String text = document.text(); |
Connect (String url) method to create a new Connection, and get () to obtain and parse an HTML file. If an error occurs when retrieving HTML from the URL, an IOException is thrown and should be handled as appropriate.
The Connection interface also provides a method chain to solve special requests, as follows:
1 |
Document doc = Jsoup.connect( "http://test.com" ).data( "query" , "Java" ).userAgent( "Mozilla" ).cookie( "auth" , "token" ).timeout( 3000 ).post(); |
You can post parameters to the link address, set userAgent, cookie, timeout, and so on. Here, the link operation is very convenient (jQuery should be familiar with such link operations ).
(4) load a document based on the file
Sometimes the html content we want to process may exist in a file on the hard disk. We need to extract or parse some content from it. We can use Jsoup to process it like this. The sample code is as follows:
12 |
File input = new File( "d:/input.html" ); Document doc = Jsoup.parse(input, "UTF-8" , "http://test.com/" ); |
There may be a question: the first parameter is the file, the second is the encoding, and the third is what? The third parameter is baseUrl. We can use it to easily handle the relative path problem. If you do not need it, you can skip it. This is a multi-state method. In the first three parts, you can add such a baseUrl, which will be detailed later.
Iii. Data Extraction (1) Use Dom to traverse documents
In chapter 2, we can obtain the object of a document. We can use this object to traverse the document, for example:
1234567 |
Document doc = Jsoup.parse(input, "UTF-8" , "http://test.com/" ); Element content = doc.getElementById( "content" ); Elements links = content.getElementsByTag( "a" ); for (Element link : links) { String linkHref = link.attr( "href" ); String linkText = link.text(); } |
Here we can easily use the Doument object method to obtain the content. The common method is as follows:
Search Element
- GetElementById (String id)
- GetElementsByTag (String tag)
- GetElementsByClass (String className)
- GetElementsByAttribute (String key) (and related methods)
- Element siblings: siblingElements (), firstElementSibling (), lastElementSibling (); nextElementSibling (), previuselementsibling ()
- Graph: parent (), children (), child (int index)
Element Data
- Attr (String key) gets the attribute attr (String key, String value) sets the attribute
- Attributes () Get all attributes
- Id (), className () and classNames ()
- Text () Get text content text (String value) set text content
- Html () gets HTML (String value) in the element and sets the html content in the element.
- OuterHtml () is used to obtain HTML content outside the element.
- Data () Get data content (for example, script and style labels)
- Tag () and tagName ()
Operate HTML and text
- Append (String html), prepend (String html)
- AppendText (String text), prependText (String text)
- AppendElement (String tagName), prependElement (String tagName)
- Html (String value)
(2) Use selector to find elements
When using jQuery, we all sigh for its powerful selector. jsoup has the same powerful selector, which makes it easy for us to process documents. The sample code is as follows:
123456 |
Elements links = doc.select( "a[href]" ); // A element with the <span style = "text-decoration: underline;"> href </span> attribute Elements pngs = doc.select( "img[src$=.png]" ); // The image with the extension. <span style = "text-decoration: underline;"> png </span> Element masthead = doc.select( "div.masthead" ).first(); // Class equals to <span style = "text-decoration: underline;"> masthead </span> <span style = "text-decoration: underline; "> div </span> label Elements resultLinks = doc.select( "h3.r > a" ); // The a element after the h3 Element |
The jsoup elements object supports selector syntax similar to CSS (or jquery) to implement very powerful and flexible search functions ..
This select method can be used in Document, Element, or Elements objects. It is context-related. Therefore, you can filter specified elements or select access through a chain.
The Select method returns an Elements set and provides a set of methods to extract and process the results.
(3) extracting attributes and documents from elements
Use Jsoup to extract attributes. The general method is as follows:
- To obtain an attribute value, you can use the Node. attr (String key) method.
- The Element. text () method can be used for text in an Element.
- You can use element.html () or Node. outerHtml () to obtain the HTML content of a element or element.
Example:
1234567891011 |
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>" ; Document doc = Jsoup.parse(html); // Parse the HTML string and return a Document Implementation Element link = doc.select( "a" ).first(); // Search for the first a element </pre> String text = doc.body().text(); // "An example link" // obtain the text in the string String linkHref = link.attr( "href" ); // "Http://example.com/" // get URL String linkText = link.text(); // "Example" "// get the text in the link address </pre> String linkOuterH = link.outerHtml(); // "<a href="http://example.com"><b>example</b></a>" String linkInnerH = link.html(); // "<B> example </B>" // obtain the html content in the link. |
(4) URL Processing
We often encounter this problem when processing HTML content. We need to convert the link address in the html page from the relative address to the absolute address. jsoup has a method to solve this problem. The baseurl we mentioned earlier is used to solve this problem. The sample code is as follows:
12345 |
Document doc = Jsoup.connect( "http://www.baidu.com/" ).get(); Element link = doc.select( "a" ).first(); String relHref = link.attr( "href" ); // == "/" String absHref = link.attr( "abs:href" ); // "http://www.baidu.com/gaoji/preferences.html" |
In HTML elements, URLs often writes the relative path relative to the document location: <a href = "/download">... </a>. when you use Node. when the attr (String key) method is used to obtain the href attribute of Element a, it directly returns the value specified in the HTML source code.
If you want to obtain an absolute path, you need to add the abs: prefix before the attribute name. In this way, you can return the URL address attr ("abs: href") containing the root path ")
Therefore, it is very important to define the base URI when parsing HTML documents. If you do not want to use abs: prefix, another method can implement the same function Node. absUrl (String key ).
Iv. data modification (1) Setting attribute values
When processing html, we may sometimes need to modify the attribute values, slice addresses, class names, and other attributes.
You can use the attribute setting method Element. attr (String key, String value), and Elements. attr (String key, String value ).
If you need to modify the class attribute of an Element, you can use the Element. addClass (String className) and Element. removeClass (String className) methods.
Elements provides methods to operate element attributes and classes in batches. For example, to add a rel = "nofollow" for each a element in the div"
You can use the following method:
1 |
doc.select( "div.comments a" ).attr( "rel" , "nofollow" ); |
The jsoup method also supports link operations as follows:
1 |
doc.select( "div.masthead" ).attr( "title" , "jsoup" ).addClass( "round-box" ); |
(2) set the html content of the element
You can perform the following operations to add html fragments to html:
123456789 |
Element div = doc.select( "div" ).first(); // <div></div> div.html( "<p>lorem ipsum</p>" ); // <div><p>lorem ipsum</p></div> div.prepend( "<p>First</p>" ); // Add html content before the div div.append( "<p>Last</p>" ); // Add html content after the div // Result: <div> <p> First </p> <p> lorem ipsum </p> <p> Last </p> </div> Element span = doc.select( "span" ).first(); // <span>One</span> span.wrap( "<li><a href='http://example.com/'></a></li>" ); // Result of adding an external HTML content to the element package: //<li><a href="http://example.com"><span>One</span></a></li> |
(3) set the text content of the element
To modify the text content in an element, perform the following operations:
12345 |
Element div = doc.select( "div" ).first(); // <div></div> div.text( "five > four" ); // <div>five > four</div> div.prepend( "First " ); div.append( " Last" ); // now: <div>First five > four Last</div> |
Description
The text setting method is the same as the HTML setter method:
Element. text (String text) clears the internal HTML content of an Element and replaces the provided text.
Element. prepend (String first) and Element. append (String last) Add text nodes before and after the Element's internal html.
If the input text contains characters such as <,>, it is processed in text instead of HTML.