Jsoup is a Java HTML parser that resolves URL addresses, HTML text content, and can be used by   Dom,CSS, and JavaScript and jquery- like Operation method to remove and manipulate data   
jsoup main functions:
1. Parsing html from a URL, file, or string
2. Extract data using DOM or CSS or JavaScript or a selector like jquery
3. Operable HTML elements, attributes, text
 
Jsoup parsing:
Jsoup provides a series of static parsing methods to generate document objects 
  Static document parse (file in, string charsetname) static document parse (file in, String CharsetName, string baseUri) static Document Parse (InputStream in, String CharsetName, string baseUri) static document parse (String HTML) static document parse (string html, string baseUri) static document parse (URL url, int timeoutmillis) static document Parsebodyfragment (string         bodyhtml) Static Document parsebodyfragment (String bodyhtml, String baseUri) Note: 1. BaseUri indicates that the relative URL retrieved is relative to the Baseuriurl 2. CharsetName represents a character set 
  *************************************************************************************************************** ******** 
 
 
   
 Connection provides some ways to crawl the content of Web pages, generally I use to crawl data on the pageNote: Connection Connect (string URL) Creates a connection Connection cookie (string name, String value) based on the given URL (must be HTTP or HTTPS) Place a cookie when sending a request 
 
Connection data (map<string,string> data) Pass request parameters 
 
Connection data (String ... keyvals) passing request parameters 
 
Document get () sends the request in Get mode and parses the returned result 
 
Document post () sends the request in post and parses the returned result 
 
Connection useragent (String useragent) 
 
Connection Header (string name, string value) to add the request header 
 
Connection referrer (String referrer) Set Request source 
   
 *********************************************************************************************************** ************ 
 Jsoup provides similar JS to get HTML elements 
 
getElementById (String id) get element with ID 
 
Getelementsbytag (String tag) to get elements with tags 
 
Getelementsbyclass (String className) using class to get elements 
 
Getelementsbyattribute (String key) to get elements with attributes 
 
The following methods are also available to provide access to sibling nodes: Siblingelements (), firstelementsibling (), lastelementsibling (); nextelementsibling (), Previouselementsibling () *************************************************************************************** ******************************** 
 jsoup similar to CSS selector operation 
  Get the data with the SET element 
 
attr (string key) Gets the element's data attr (string key, String value) to set the element data 
 
Attributes () Gets the property so 
 
ID (), className () Classnames () Gets the ID class value 
 
Text () Gets the literal value 
 
Text (String value) sets the literal value 
 
HTML () Get HTML 
 
HTML (String value) to set HTML 
 
outerHTML () Get internal HTML 
 
Data () to get the content 
 
Tag () get tag and tagName () get TagName ********************************************************************************* ************************************** 
 manipulating HTML elements: 
 
Append (string html), prepend (string html) 
 
AppendText (string text), Prependtext (string text) 
 
Appendelement (String tagName), Prependelement (string tagName) 
 
HTML (String value) ********************************************************************************************* ************************** 
 Jsoup also provides a selector similar to the jquery approach 
 
 
 using selectors to retrieve data 
 
TagName use tag names to locate, for example a 
 
Ns|tag using namespaces for label positioning, such as fb:name to find <fb:name> elements 
 
#id are positioned using element IDs, such as #logo 
 
The. Class is positioned using the class attribute of the element, such as. Head 
 
 
 * Locate all elements 
 
[attribute] uses the attributes of the element to locate, such as [href] to retrieve all elements that have an HREF attribute 
 
[^attr] uses the element's property name prefix for positioning, for example [^data-] to find the DataSet property of HTML5 
 
[Attr=value] uses property values for positioning, such as [width=500] to locate all elements with a width property value of 500 
 
[Attr^=value],[attr$=value],[attr*=value] These three syntax represent, the attribute begins with value, the end and contains 
 
[Attr~=regex] uses regular expressions to filter property values, such as img[src~= (? i) \. ( PNG|JPE?G)] 
 
The above is the most basic selector syntax, which can also be combined to use 
 Combination usage
 
El#id locating an element of ID value, such as A#logo, <a Id=logo href= ... >
 
El.class Locate the element with the specified value, such as Div.head, <div class= "Head" >xxxx</div>
 
EL[ATTR] Locates all elements that define a property, such as A[href]
 
Above three any combinations such as a[href] #logo, A[name].outerlink
 
 
*************************************************************************************************************** ********
 
 In addition to some basic syntax and the combination of these syntaxes, Jsoup also supports element filtering using expressions  
 
 : LT (n) For example TD:LT (3) indicates less than three columns 
 
 : GT (n) Div p:gt (2) means that the div contains more than 2 p 
 
 : eq (n) Form Input:eq (1) Represents a form that contains only one input 
 
 : Has (seletor) Div:has (p) represents the div containing the P element 
 
 : Not (selector) Div:not (. logo) represents all Div lists that do not contain class= "logo" elements 
 
 : Contains (text) contains elements of a text that are not case-sensitive, such as P:contains (Oschina) 
 
 : Containsown (text) text message is exactly equal to the filter of the specified condition 
 
 : Matches (regex) uses regular expressions for text filtering: Div:matches ((? i) login) 
 
 
 
 : Matchesown (Regex) uses regular expressions to find its own text
  
 
 
  
 
 *************************************************************************************************************** ********
  
 
 Jsoup use
  
 
 
 URL URLs as input source document doc = Jsoup.connect ("http://www.example.com"). Timeout (60000). get ();//file file as input source: New File ("/tmp/input.html"); 
 
Document doc = Jsoup.parse (input, "UTF-8", "http://www.example.com/");//string as input source document doc = Jsoup.parse (htmlstr) , like Java Script, Jsoup provides the following function getElementById (String ID) to get the element by ID 
 
Getelementsbytag (String tag) to get elements from a tag 
 
Getelementsbyclass (String className) Gets the element by class 
 
Getelementsbyattribute (String key) Gets the element through the property and also provides the following method to provide a sibling node: 
 
Siblingelements (), firstelementsibling (), lastelementsibling (); nextelementsibling (), previouselementsibling () 
 
 
 
Get the data for the element in the following way: 
 
attr (String key) Gets the data for the element 
 
attr (string key, String value) sets the element data 
 
Attributes () Get all properties 
 
ID (), className () Classnames () Gets the value of the ID class 
 
Text () Gets the literal value 
 
Text (String value) sets the literal value 
 
HTML () Get HTML 
 
HTML (String value) to set HTML 
 
outerHTML () Get internal HTML 
 
Data () to get the content 
 
Tag () get tag and tagName () get TagName 
 
Manipulating HTML provides the following methods: 
 
Append (string html), prepend (string html) 
 
AppendText (string text), Prependtext (string text) 
 
Appendelement (String tagName), Prependelement (string tagName) 
 
HTML (String value) 
 columns such as:Document doc = Jsoup.connect ("http://example.com"). Data ("Key1", "value1")//Multiple data sent asynchronously. Data (" 
 Key2", " 
 value2"). useragent (" Mozilla "). Cookie (" 
 Cookie1", " 
 cookieValue1")//can send multiple Cookie.cookie (" 
 Cookie2 
 ", " 
 cookieValue2 
 "). Timeout (3000)//maximum delay. Post ()/.get ()//request Note that a request must be clarified what cookies are required, and you can use Chrome's F12 Applicaton to view cookies 
 
 
 
 
  
 
 
 
 
  
 Jsoup common method for crawler, with asynchronous request implementation