Parsing and traversing an HTML document
How to parse an HTML document:
String html = "
(more details can be seen parsing an HTML string.)
Its parser can do everything possible from the HTML document you provide to transcend a clean parsing result, regardless of whether the HTML format is complete. For example, it can handle:
- No closed tags (e.g. parses to
<p>Lorem <p>Ipsum
<p>Lorem</p> <p>Ipsum</p>
)
- implicit tags (e.g., it can be wrapped automatically
<td>Table data</td>
<table><tr><td>?
)
- Create a reliable document structure (HTML tags contain head and body, only the right elements appear in the head)
Object model for a document
- The document consists of multiple elements and textnodes (as well as other auxiliary nodes: details can be viewed: Nodes package tree).
- Its inheritance structure is as follows:
Document
inheritance inheritance Element
Node
. TextNode
Inheritance Node
.
- An element contains a collection of child nodes and has a parent element. They also provide a unique sub-element filter list.
Data extractionYou have an HTML document that you want to extract data from. And you know the general structure of HTML documents. An HTML document can be parsed using similar DOM methods.
1 /**2 * Get htmlelement element3 * @authorBling4 * @throwsIOException5 * @create date:2014-07-136 */7 @Test8 Public voidgetDataElement ()throwsioexception{9File input =NewFile ("tmp/input.html");TenDocument doc = jsoup.parse (input, "UTF-8", "http://example.com/"); One AElement content = Doc.getelementbyid ("Content"); -Elements links = Content.getelementsbytag ("a"); - for(Element link:links) { theString linkhref = link.attr ("href"); -String LinkText =Link.text (); -System.out.println ("Linkhref:" +linkhref+ "------" + "LinkText:" +linkText); - } +}
Elements provides a method similar to find element, and extracts operational data, the DOM object is context: finds the document under match based on Father document and finds the child element under it based on the document found. Use this method to find the data you want.
getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key)
(and related methods)
- Element siblings:
siblingElements()
, firstElementSibling()
, lastElementSibling()
; nextElementSibling()
,previousElementSibling()
- Graph:
parent()
, children()
,child(int index)
- Methods for obtaining the element data
attr(String key)
To get and to attr(String key, String value)
set attributes
attributes()
To get all attributes
id()
, and className()
classNames()
text()
To get and to text(String value)
set the text content
html()
To get and to html(String value)
set the inner HTML content
outerHtml()
To get the outer HTML value
data()
To get data content (e.g of and script
style
tags)
tag()
andtagName()
- Methods for manipulating HTML and text
append(String html)
,prepend(String html)
appendText(String text)
,prependText(String text)
appendElement(String tagName)
,prependElement(String tagName)
html(String value)
- Data extraction: Selector syntax (using selector syntax, reference)
GitHub Example code: Https://github.com/Java-Group-Bling/Jsoup-learn