A common method of jsoup that can be used as a crawler, with asynchronous request implementation

Source: Internet
Author: User
Tags baseuri tagname

Jsoup is a Java HTML parser that resolves URL addresses, HTML text content, and can be used by Dom,CSS, and JavaScript and jquery- like Operation method to remove and manipulate data

jsoup main functions: 1. Parsing html from a URL, file, or string 2. Extract data using DOM or CSS or JavaScript or a selector like jquery 3. Operable HTML elements, attributes, text   Jsoup parsing: Jsoup provides a series of static parsing methods to generate document objects
Static document parse (file in, string charsetname) static document parse (file in, String CharsetName, string baseUri) static Document Parse (InputStream in, String CharsetName, string baseUri) static document parse (String HTML) static document parse (string html, string baseUri) static document parse (URL url, int timeoutmillis) static document Parsebodyfragment (string         bodyhtml) Static Document parsebodyfragment (String bodyhtml, String baseUri) Note: 1. BaseUri indicates that the relative URL retrieved is relative to the Baseuriurl 2. CharsetName represents a character set  *************************************************************************************************************** ********
  Connection provides some ways to crawl the content of Web pages, generally I use to crawl data on the pageNote: Connection Connect (string URL) Creates a connection Connection cookie (string name, String value) based on the given URL (must be HTTP or HTTPS) Place a cookie when sending a request
Connection data (map<string,string> data) Pass request parameters
Connection data (String ... keyvals) passing request parameters
Document get () sends the request in Get mode and parses the returned result
Document post () sends the request in post and parses the returned result
Connection useragent (String useragent)
Connection Header (string name, string value) to add the request header
Connection referrer (String referrer) Set Request source   *********************************************************************************************************** ************ Jsoup provides similar JS to get HTML elements
getElementById (String id) get element with ID
Getelementsbytag (String tag) to get elements with tags
Getelementsbyclass (String className) using class to get elements
Getelementsbyattribute (String key) to get elements with attributes
The following methods are also available to provide access to sibling nodes: Siblingelements (), firstelementsibling (), lastelementsibling (); nextelementsibling (), Previouselementsibling () *************************************************************************************** ******************************** jsoup similar to CSS selector operation  Get the data with the SET element
attr (string key) Gets the element's data attr (string key, String value) to set the element data
Attributes () Gets the property so
ID (), className () Classnames () Gets the ID class value
Text () Gets the literal value
Text (String value) sets the literal value
HTML () Get HTML
HTML (String value) to set HTML
outerHTML () Get internal HTML
Data () to get the content
Tag () get tag and tagName () get TagName ********************************************************************************* ************************************** manipulating HTML elements:
Append (string html), prepend (string html)
AppendText (string text), Prependtext (string text)
Appendelement (String tagName), Prependelement (string tagName)
HTML (String value) ********************************************************************************************* ************************** Jsoup also provides a selector similar to the jquery approach
using selectors to retrieve data
TagName use tag names to locate, for example a
Ns|tag using namespaces for label positioning, such as fb:name to find <fb:name> elements
#id are positioned using element IDs, such as #logo
The. Class is positioned using the class attribute of the element, such as. Head
* Locate all elements
[attribute] uses the attributes of the element to locate, such as [href] to retrieve all elements that have an HREF attribute
[^attr] uses the element's property name prefix for positioning, for example [^data-] to find the DataSet property of HTML5
[Attr=value] uses property values for positioning, such as [width=500] to locate all elements with a width property value of 500
[Attr^=value],[attr$=value],[attr*=value] These three syntax represent, the attribute begins with value, the end and contains
[Attr~=regex] uses regular expressions to filter property values, such as img[src~= (? i) \. ( PNG|JPE?G)]
The above is the most basic selector syntax, which can also be combined to use Combination usage
El#id locating an element of ID value, such as A#logo, <a Id=logo href= ... >
El.class Locate the element with the specified value, such as Div.head, <div class= "Head" >xxxx</div>
EL[ATTR] Locates all elements that define a property, such as A[href]
Above three any combinations such as a[href] #logo, A[name].outerlink

*************************************************************************************************************** ********
In addition to some basic syntax and the combination of these syntaxes, Jsoup also supports element filtering using expressions
: LT (n) For example TD:LT (3) indicates less than three columns
: GT (n) Div p:gt (2) means that the div contains more than 2 p
: eq (n) Form Input:eq (1) Represents a form that contains only one input
: Has (seletor) Div:has (p) represents the div containing the P element
: Not (selector) Div:not (. logo) represents all Div lists that do not contain class= "logo" elements
: Contains (text) contains elements of a text that are not case-sensitive, such as P:contains (Oschina)
: Containsown (text) text message is exactly equal to the filter of the specified condition
: Matches (regex) uses regular expressions for text filtering: Div:matches ((? i) login)

: Matchesown (Regex) uses regular expressions to find its own text

*************************************************************************************************************** ********

Jsoup use

URL URLs as input source document doc = Jsoup.connect ("http://www.example.com"). Timeout (60000). get ();//file file as input source: New File ("/tmp/input.html");
Document doc = Jsoup.parse (input, "UTF-8", "http://www.example.com/");//string as input source document doc = Jsoup.parse (htmlstr) , like Java Script, Jsoup provides the following function getElementById (String ID) to get the element by ID
Getelementsbytag (String tag) to get elements from a tag
Getelementsbyclass (String className) Gets the element by class
Getelementsbyattribute (String key) Gets the element through the property and also provides the following method to provide a sibling node:
Siblingelements (), firstelementsibling (), lastelementsibling (); nextelementsibling (), previouselementsibling ()

Get the data for the element in the following way:
attr (String key) Gets the data for the element
attr (string key, String value) sets the element data
Attributes () Get all properties
ID (), className () Classnames () Gets the value of the ID class
Text () Gets the literal value
Text (String value) sets the literal value
HTML () Get HTML
HTML (String value) to set HTML
outerHTML () Get internal HTML
Data () to get the content
Tag () get tag and tagName () get TagName
Manipulating HTML provides the following methods:
Append (string html), prepend (string html)
AppendText (string text), Prependtext (string text)
Appendelement (String tagName), Prependelement (string tagName)
HTML (String value) columns such as:Document doc = Jsoup.connect ("http://example.com"). Data ("Key1", "value1")//Multiple data sent asynchronously. Data (" Key2", " value2"). useragent (" Mozilla "). Cookie (" Cookie1", " cookieValue1")//can send multiple Cookie.cookie (" Cookie2 ", " cookieValue2 "). Timeout (3000)//maximum delay. Post ()/.get ()//request Note that a request must be clarified what cookies are required, and you can use Chrome's F12 Applicaton to view cookies


Jsoup common method for crawler, with asynchronous request implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.