Jsoup Introduction
Jsoup is an HTML parser that can be used to parse URL addresses, HTML text equivalents, operate like jquery, find data through the DOM, manipulate data, and introduce Jsoup jar when used
Jsoup can load HTML documents from contained strings, URLs, and local files, generate document objects, and manipulate the data in documents
eg
// by URL Document doc = Jsoup.connect ("Http://www.cnblogs.com/wishyouhappy"). get (); // through HTML strings String html = ";D ocument = jsoup.parse (HTML); // with a file load, the third parameter indicates BaseURL New File ("d:/test.html"); = Jsoup.parse (Input, "UTF-8", "http://www.cnblogs.com/wishyouhappy");
Data manipulation eg:
Document doc = Jsoup.connect ("Http://www.cnblogs.com/wishyouhappy"). get (); System.out.println (Doc.title ());
Common functions
Parse Related:
StaticDocument Parse (File in, String charsetname)StaticDocument Parse (File in, String CharsetName, string BaseUri)StaticDocument Parse (InputStream in, String CharsetName, string BaseUri)StaticDocument Parse (String html)StaticDocument Parse (string html, string BaseUri)StaticDocument parse (URL url,intTimeoutmillis)StaticDocument parsebodyfragment (String bodyhtml)StaticDocument parsebodyfragment (String bodyhtml, String BaseUri)
URL Connect Related:
Connection connect (String URL)//Create a connection based on a given URL (must be http or HTTPS)Connection Cookie (string name, string value)//Place a cookie when sending a requestConnection data (map<string,string> data)//Passing Request ParametersConnection data (String ... keyvals)//Passing Request ParametersDocument get ()//send the request in Get and parse the returned resultDocument Post ()//send the request as post and parse the returned resultConnection useragent (string useragent) Connection header (string name, string value)//Add a request headerConnection referrer (String referrer)//Set Request Source
Get HTML elements:
// get an element with an ID // get elements with tags // use class to get element Getelementsbyattribute (String key) // to get element siblingelements () with attributes, Firstelementsibling (), lastelementsibling (); nextelementsibling (), previouselementsibling ()
Gets and sets the value of the element:
attr (String key)//get the data for an elementattr (string key, String value)//set element DataAttributes ()//get so attributesID (), className () classnames () text ()//Get text valueText (String value)//Setting text valuesHTML ()//Get HTMLHTML (String value)//Set HTMLouterhtml () data () tag ()//Get TagTagName ()//Get tagname
To add an element:
Append (string html), prepend (string html) appendtext (string text), Prependtext (string text) appendelement (string TagName), prependelement (String tagName)
Selector:
| tagname |
use tag names to locate, for example a |
| Ns|tag |
Use namespace label positioning, such as fb:name to find <fb:name> elements |
| #id |
Use element ID to locate, for example #logo |
| . class |
Use the class property of the element to locate, for example,. Head |
| [Attribute] |
Use the attributes of an element for positioning, such as [href] to retrieve all elements that have an HREF attribute |
| [^attr] |
Use the element's property name prefix for positioning, such as [^data-] to find the DataSet property of HTML5 |
| [Attr=value] |
Use property values for positioning, such as [width=500] to locate all elements with a width property value of 500 |
| [Attr^=value], [Attr$=value], [Attr*=value] |
These three grammars represent, respectively, the attributes begin with value, end with a and contain |
| [Attr~=regex] |
Use regular expressions to filter property values, such as img[src~= (? i) \. ( PNG|JPE?G)] |
| * |
Locate all elements |
| El#id |
locate an element of ID value, such as A#logo-<a id=logo href= ... > |
| El.class |
Locate the element with the specified value, such as Div.head, <div class= "Head" >xxxx</div> |
| EL[ATTR] |
Locates all elements that define a property, such as A[href] |
| Any combination of the above three |
For example A[href] #logo, A[name].outerlink |
| Ancestor Child |
These five are the selector syntax for combining relationships between elements, including parent-child relationships, merge relationships, and hierarchical relationships. |
| Parent > Child |
|
| Siblinga + SIBLINGB |
|
| Siblinga ~ SIBLINGX |
|
| : LT (n) |
For example, TD:LT (3) indicates less than three columns |
| : GT (N) |
Div P:GT (2) indicates that a div contains more than 2 p |
| : EQ (N) |
Form Input:eq (1) indicates that only one input is included |
| : Has (Seletor) |
Div:has (p) represents the div containing the P element |
| : Not (selector) |
Div:not (. logo) represents all Div lists that do not contain class= "logo" elements |
| : Contains (text) |
An element that contains text that is not case-sensitive, such as P:contains (Oschina) |
| : Containsown (text) |
The text information is exactly equal to the filter of the specified condition |
| : Matches (regex) |
Using regular expressions for text filtering: Div:matches ((? i) login) |
| : Matchesown (Regex) |
Find your own text using regular expressions
|
Example:
PackageJsoup;/*** * Created by: Wish * created: June 13, 2014 PM 1:22:49*/Importjava.io.IOException;ImportOrg.jsoup.Jsoup;Importorg.jsoup.nodes.Document;Importorg.jsoup.nodes.Element;Importorg.jsoup.select.Elements; Public classBlogcatch {/*** Main *@paramargs *@throwsException*/ Public Static voidMain (string[] args)throwsException {//Getarticletitle ("Http://www.cnblogs.com/wishyouhappy");Document doc = Jsoup.connect ("Http://www.cnblogs.com/wishyouhappy"). Data ("Query", "Java")//Request Parameters. useragent ("I ' m Jsoup")//Set User-agent. Cookie ("auth", "token")//Set Cookies. Timeout (3000)//Setting the connection time-out period. Post (); System.out.println (Doc.title ()); } /*** Gets the specified body * Incoming HTML string in HTML document *@throwsIOException*/@SuppressWarnings ("Unused") Private Static voidgetblogbodybystring (String html) {Document doc=jsoup.parse (HTML); System.out.println (Doc.body ()); } /*** * Getblogbodybyurl get document Body by URL *@paramURL *@return * */@SuppressWarnings ("Unused") Private Static voidGetblogbodybyurl (String URL)throwsIOException {//loading HTML documents directly from URLsDocument doc2 =jsoup.connect (URL). get (); String title=doc2.body (). toString (); System.out.println (title); } /*** * Article get post title and link on Blog *@paramURL *@return* @Exception Exception Object*/ Public Static voidgetarticletitle (String url) {Document doc; Try{doc=jsoup.connect (URL). get (); Elements Listdiv= Doc.getelementsbyattributevalue ("Class", "Posttitle"); for(Element element:listdiv) {Elements links= Element.getelementsbytag ("a"); for(Element link:links) {String linkhref= Link.attr ("href"); String LinkText=link.text (). Trim (); System.out.println (LINKHREF); System.out.println (LinkText); } } } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); } } /*** * Getblog Get the content of the specified blog post *@paramname *@return* @Exception Exception Object*/ Public Static voidgetblog (String url) {Document doc; Try{doc=jsoup.connect (URL). get (); Elements Listdiv= Doc.getelementsbyattributevalue ("Class", "Postbody"); for(Element element:listdiv) {System.out.println (element.html ()); } } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); } }}