Jsoup Web Scraping

Source: Internet
Author: User

Jsoup Introduction

Jsoup is an HTML parser that can be used to parse URL addresses, HTML text equivalents, operate like jquery, find data through the DOM, manipulate data, and introduce Jsoup jar when used

Jsoup can load HTML documents from contained strings, URLs, and local files, generate document objects, and manipulate the data in documents

eg

// by URL Document doc = Jsoup.connect ("Http://www.cnblogs.com/wishyouhappy"). get (); // through HTML strings String html = ";D ocument  = jsoup.parse (HTML); // with a file load, the third parameter indicates BaseURL New File ("d:/test.html");   = Jsoup.parse (Input, "UTF-8", "http://www.cnblogs.com/wishyouhappy");

Data manipulation eg:

Document doc = Jsoup.connect ("Http://www.cnblogs.com/wishyouhappy"). get (); System.out.println (Doc.title ());

Common functions

Parse Related:

StaticDocument Parse (File in, String charsetname)StaticDocument Parse (File in, String CharsetName, string BaseUri)StaticDocument Parse (InputStream in, String CharsetName, string BaseUri)StaticDocument Parse (String html)StaticDocument Parse (string html, string BaseUri)StaticDocument parse (URL url,intTimeoutmillis)StaticDocument parsebodyfragment (String bodyhtml)StaticDocument parsebodyfragment (String bodyhtml, String BaseUri)

URL Connect Related:

Connection connect (String URL)//Create a connection based on a given URL (must be http or HTTPS)Connection Cookie (string name, string value)//Place a cookie when sending a requestConnection data (map<string,string> data)//Passing Request ParametersConnection data (String ... keyvals)//Passing Request ParametersDocument get ()//send the request in Get and parse the returned resultDocument Post ()//send the request as post and parse the returned resultConnection useragent (string useragent) Connection header (string name, string value)//Add a request headerConnection referrer (String referrer)//Set Request Source

Get HTML elements:

// get an element with an ID // get elements with tags // use class to get element Getelementsbyattribute (String key)  // to get element siblingelements () with attributes, Firstelementsibling (), lastelementsibling (); nextelementsibling (), previouselementsibling ()

Gets and sets the value of the element:

attr (String key)//get the data for an elementattr (string key, String value)//set element DataAttributes ()//get so attributesID (), className () classnames () text ()//Get text valueText (String value)//Setting text valuesHTML ()//Get HTMLHTML (String value)//Set HTMLouterhtml () data () tag ()//Get TagTagName ()//Get tagname

To add an element:

Append (string html), prepend (string html) appendtext (string text), Prependtext (string text) appendelement (string TagName), prependelement (String tagName)

Selector:

tagname use tag names to locate, for example a
Ns|tag Use namespace label positioning, such as fb:name to find <fb:name> elements
#id Use element ID to locate, for example #logo
. class Use the class property of the element to locate, for example,. Head
[Attribute] Use the attributes of an element for positioning, such as [href] to retrieve all elements that have an HREF attribute
[^attr] Use the element's property name prefix for positioning, such as [^data-] to find the DataSet property of HTML5
[Attr=value] Use property values for positioning, such as [width=500] to locate all elements with a width property value of 500
[Attr^=value], [Attr$=value], [Attr*=value] These three grammars represent, respectively, the attributes begin with value, end with a and contain
[Attr~=regex] Use regular expressions to filter property values, such as img[src~= (? i) \. ( PNG|JPE?G)]
* Locate all elements

El#id locate an element of ID value, such as A#logo-<a id=logo href= ... >
El.class Locate the element with the specified value, such as Div.head, <div class= "Head" >xxxx</div>
EL[ATTR] Locates all elements that define a property, such as A[href]
Any combination of the above three For example A[href] #logo, A[name].outerlink
Ancestor Child These five are the selector syntax for combining relationships between elements, including parent-child relationships, merge relationships, and hierarchical relationships.
Parent > Child
Siblinga + SIBLINGB
Siblinga ~ SIBLINGX

: LT (n) For example, TD:LT (3) indicates less than three columns
: GT (N) Div P:GT (2) indicates that a div contains more than 2 p
: EQ (N) Form Input:eq (1) indicates that only one input is included
: Has (Seletor) Div:has (p) represents the div containing the P element
: Not (selector) Div:not (. logo) represents all Div lists that do not contain class= "logo" elements
: Contains (text) An element that contains text that is not case-sensitive, such as P:contains (Oschina)
: Containsown (text) The text information is exactly equal to the filter of the specified condition
: Matches (regex) Using regular expressions for text filtering: Div:matches ((? i) login)
: Matchesown (Regex) Find your own text using regular expressions

Example:

 PackageJsoup;/*** * Created by: Wish * created: June 13, 2014 PM 1:22:49*/Importjava.io.IOException;ImportOrg.jsoup.Jsoup;Importorg.jsoup.nodes.Document;Importorg.jsoup.nodes.Element;Importorg.jsoup.select.Elements; Public classBlogcatch {/*** Main *@paramargs *@throwsException*/     Public Static voidMain (string[] args)throwsException {//Getarticletitle ("Http://www.cnblogs.com/wishyouhappy");Document doc = Jsoup.connect ("Http://www.cnblogs.com/wishyouhappy"). Data ("Query", "Java")//Request Parameters. useragent ("I ' m Jsoup")//Set User-agent. Cookie ("auth", "token")//Set Cookies. Timeout (3000)//Setting the connection time-out period. Post ();    System.out.println (Doc.title ()); }    /*** Gets the specified body * Incoming HTML string in HTML document *@throwsIOException*/@SuppressWarnings ("Unused")    Private Static voidgetblogbodybystring (String html) {Document doc=jsoup.parse (HTML);    System.out.println (Doc.body ()); }        /*** * Getblogbodybyurl get document Body by URL *@paramURL *@return      *      */@SuppressWarnings ("Unused")    Private Static voidGetblogbodybyurl (String URL)throwsIOException {//loading HTML documents directly from URLsDocument doc2 =jsoup.connect (URL). get (); String title=doc2.body (). toString ();    System.out.println (title); }    /*** * Article get post title and link on Blog *@paramURL *@return* @Exception Exception Object*/     Public Static voidgetarticletitle (String url) {Document doc; Try{doc=jsoup.connect (URL). get (); Elements Listdiv= Doc.getelementsbyattributevalue ("Class", "Posttitle");  for(Element element:listdiv) {Elements links= Element.getelementsbytag ("a");  for(Element link:links) {String linkhref= Link.attr ("href"); String LinkText=link.text (). Trim ();                    System.out.println (LINKHREF);                System.out.println (LinkText); }            }        } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); }    }        /*** * Getblog Get the content of the specified blog post *@paramname *@return* @Exception Exception Object*/     Public Static voidgetblog (String url) {Document doc; Try{doc=jsoup.connect (URL). get (); Elements Listdiv= Doc.getelementsbyattributevalue ("Class", "Postbody");  for(Element element:listdiv) {System.out.println (element.html ()); }        } Catch(IOException e) {//TODO auto-generated Catch blockE.printstacktrace (); }            }}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.