Jsoup is a Java HTML parser that is used primarily for parsing HTML. Official Chinese documents
In the crawler, when we use the framework such as httpclient, access to the Web page source code, we need to remove the content from the source of the Web page,
You can use HTML parsers such as Jsoup. Can be implemented very easily.
Although Jsoup also support from an address directly to crawl the Web page source, but only support HTTP,HTTPS protocol, support is not rich enough.
Therefore, the main use is to parse the HTML.
Where the HTML to be parsed can be a string of HTML, can be a URL, can be a file.
Org.jsoup.Jsoup converts the input HTML into a Org.jsoup.nodes.Document object, and then extracts the desired element from the Document object.
Org.jsoup.nodes.Document inherited the Org.jsoup.nodes.element,element and inherited the Org.jsoup.nodes.Node class. There is a rich way to get the elements of HTML.
◇ Parsing HTML strings
String html = "= jsoup.parse (HTML);
◇ get HTML from URL to parse
Document doc = Jsoup.connect ("http://example.com/"= Doc.title ();
where Jsoup.connect ("XXX") method returns a Org.jsoup.Connection object.
In the Connection object, we can execute a GET or post to execute the request. But before the request is executed,
We can use the connection object to set up some request information. For example: header information, cookies, request waiting time, agents and so on to simulate the behavior of the browser.
Document doc = Jsoup.connect ("http://example.com") . Data ("query", "Java") . useragent (" Mozilla ") . Cookie (" auth "," token "). Timeout (+) . Post ();
◇ load HTML from file to parse
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Once the document object is obtained, the next step is to parse the document object and get the elements we want from it.
The document provides a rich way to get the specified element.
◇ use DOM to get
getElementById (String ID): Gets by ID
Getelementsbytag (String tagName): Get by Tag name
Getelementsbyclass (String className): Gets by the class name
Getelementsbyattribute (String key): Gets by property name
Getelementsbyattributevalue (string key, String value): Gets the property value by specifying the property name.
Getallelements (): Get all elements
◇ Find an element by using a selector similar to CSS or jquery
The following method is used for the element class:
Public Elements Select (String cssquery)
Finds the specified element by passing in a selector string similar to CSS or jquery.
Example:
New File ("/tmp/input.html"= jsoup.parse (input, "UTF-8", "http://example.com/"// A element with href attribute Elements pngs = Doc.select ("img[src$=.png]"); // pictures with a. png extension = Doc.select ("Div.masthead"). First (); // class equals masthead div tag // A element after the H3 element
More syntax for selectors (you can find more syntax for selectors in Org.jsoup.select.Selector):
TagName: Find elements by tags, such as: a
Ns|tag: Find elements in namespaces through tags, such as: You can find <fb:name> elements with fb|name syntax
#id: Find elements by ID, such as: #logo
. Class: Finds elements by class name, for example:. Masthead
[attribute]: Use attributes to find elements, such as: [href]
[^attr]: Use the attribute name prefix to find elements, such as: You can use [^data-] to find the element with the HTML5 dataset property
[Attr=value]: Use attribute values to find elements, such as: [width=500]
[Attr^=value], [Attr$=value], [Attr*=value]: Finds an element with a matching attribute value beginning, ending, or containing an attribute value, such as: [href*=/path/]
[Attr~=regex]: Use attribute values to match regular expressions to find elements, such as: img[src~= (? i) \. ( PNG|JPE?G)]
*: This symbol will match all elements
The
Selector selector combination uses the
El#id: element +id, such as: Div#logo
El.class: Element +class, for example: Div.masthead
El[attr]: element +class, such as : A[href]
Any combination, such as: A[href].highlight
Ancestor Child: Finds a child element of an element, such as: you can use the. Body p to find all p elements under the "body" element
Parent ; Child: Find the immediate sub-elements under a parent element, such as: You can find the P element with div.content > P, or you can use body > * To find all the immediate child elements under the body tag
Siblinga + SIBLINGB: Find before a element The first sibling element B, such as: Div.head + div
Siblinga ~ siblingx: Finds the sibling x element before the a element, such as: H1 ~ P
El, El, el: Multiple selector combinations, find unique elements that match either selector, for example: div . Masthead, Div.logo
Pseudo Selector Selectors
: LT (n): finds which element's sibling index value (its position is relative to its parent node in the DOM tree) is less than n, for example: Td:lt (3) represents an element less than three columns
: GT (N): Find which elements have a sibling index value greater than N, for example: Div p:gt (2) indicates which Div contains more than 2 p elements
: EQ (n): Find which elements have the same sibling index value as N, for example: Form Input:eq (1) represents a form element that contains an input tag
: Has (Seletor): Finds elements that match selectors that contain elements, such as: Div:has (P), which div contains the P element
: Not (selector): Finds elements that do not match the selector, such as: Div:not (. logo) for all Div lists that do not contain class= "logo" elements
: Contains (text): Find the element containing the given text, search does not distinguish between large and non-written, such as: P:contains (Jsoup)
: Containsown (text): Find the element that directly contains the given text
: Matches (regex): finds which elements of text match the specified regular expression, such as: Div:matches ((? i) login)
: Matchesown (Regex): Find an element that itself contains text that matches a specified regular expression
Note: The above pseudo-selector index starts at 0, which means that the first element has an index value of 0, the second element is index 1, and so on
With the selector above, we can get a elements object that inherits the ArrayList object, which is all element objects.
The next thing we need to do is to take out what we really want from the element object.
There are usually several methods:
◇element.text ()
This method is used to get the text in an element.
◇element.html () or node.outerhtml ()
This method is used to get the HTML content of an element
◇node.attr (String key)
Get the value of a property, such as Get a hyperlink <a href= "" > The value of the HREF
HTML cleanup
Using the clean method of the Org.jsoup.safety.Cleaner class, HTML can be used to avoid XSS attacks.
Example:
String unsafe = "<p><a href= ' http://example.com/' onclick= ' stealcookies () ' >link</a></p > "= Jsoup.clean (unsafe, whitelist.basic ()); // Now : <p><a href= " http://example.com/ "rel=" >Link</a></p> "nofollow "
Comprehensive Example:
This example will combine httpclient to crawl the HTML of the page, then use Jsoup to analyze the page to extract the content of the geek headlines.
Packagecom.csdn;Importjava.io.IOException;Importorg.apache.http.HttpEntity;Importorg.apache.http.client.ClientProtocolException;ImportOrg.apache.http.client.methods.CloseableHttpResponse;ImportOrg.apache.http.client.methods.HttpGet;Importorg.apache.http.impl.client.CloseableHttpClient;Importorg.apache.http.impl.client.HttpClients;Importorg.apache.http.util.EntityUtils;ImportOrg.jsoup.Jsoup;Importorg.jsoup.nodes.Document;Importorg.jsoup.nodes.Element;Importorg.jsoup.select.Elements; Public classHttpClientJsoupTest01 {//URL Incoming "http://www.csdn.net/" Public voidget (String URL) {closeablehttpclient client=httpclients.createdefault ();//to define a default request clientHttpGet Get=NewHttpGet (URL);//define a GET requestcloseablehttpresponse Response=NULL;//Define a response Try{Response=Client.execute (GET); System.out.println (Response.getstatusline (). Getstatuscode ());//Print response status code, 200 indicates successHttpentity entity=response.getentity ();//Get response EntityString html=entityutils.tostring (entity);//Convert the contents of an entity to a string /*** Next, use Jsoup to parse the previously obtained HTML and get the title under the Geek headline in the CSDN home page*/Document Document=jsoup.parse (HTML);//use static methods of the Jsoup class to convert HTML into a Document objectElement element=document.select ("Div.wrap. Left. Hot_blog ul"). First ();//Use the Select selector to get the required set of LI elementsElements elements= Element.select ("a");//get a collection of links for(Element element2:elements) {System.out.println ("Title:" +element2.attr ("title") + "-->> Address:" +element2.attr ("href")); } } Catch(clientprotocolexception e) {//TODO auto-generated Catch blockE.printstacktrace (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); }finally { Try{response.close (); Client.close (); } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); } } }}
GitHub download
Jsoup Getting Started