Parsing HTML using Jsoup
Then we have to use HttpClient to get the HTML first.
Also we introduce httpclient related jar packages
And the Commonio jar package.
We write the basic code of HttpClient, and then parse the page to get the Document object
We get the title and the document object that made the ID
code example:
Package Com.zhi.jsoup1;import Org.apache.http.httpentity;import Org.apache.http.client.methods.closeablehttpresponse;import Org.apache.http.client.methods.httpget;import Org.apache.http.impl.client.closeablehttpclient;import Org.apache.http.impl.client.httpclients;import Org.apache.http.util.entityutils;import Org.jsoup.jsoup;import Org.jsoup.nodes.document;import Org.jsoup.nodes.element;import Org.jsoup.select.elements;public class Demo {public static void main (string[] args) Throws Exception {closeablehttpclient httpclient=httpclients.createdefault ();//1, create instance HttpGet httpget=new HttpGet (" https://home.cnblogs.com/u/mengxinrenyu/"); 2. Create Instance Httpget.setheader ("User-agent", "mozilla/5.0 (Windows NT 5.1) applewebkit/537.36 (khtml, like Gecko) chrome/ 38.0.2125.122 safari/537.36 SE 2.X METASR 1.0 "); Closeablehttpresponse Httpresponse=httpclient.execute (HttpGet); 3, the implementation of Httpentity entity=httpresponse.getentity (); 4. Get entity string content=entityutils.tostring (Entity, "utf-8"); 5. Access to Web content HttprespoNse.close (); Httpclient.close ();D ocument doc=jsoup.parse (content); Parse Web page to get Document object Elements Elements=doc.getelementsbytag ("title"); Gets the tag is the title of all DOM elements element Element=elements.get (0); Gets the 1th element of String Title=element.text (); Returns the text of the element System.out.println ("title:" +title); Element=doc.getelementbyid ("Top_left"); Gets the DOM element of the Id=top_left String Menu=element.text (); Returns the text of the element System.out.println ("Navigation:" +menu);}}
Due to the Web page I was after the landing, so there will be the following error
Because the request is a landing account under the page, so the page will prompt to log in. No element of the corresponding ID has ever returned to the NPE.
Let's try a different news page.
code example:
public class Demo {public static void main (string[] args) throws Exception {closeablehttpclient httpclient=httpclients.cr Eatedefault (); 1, create an instance httpget httpget=new httpget ("https://news.cnblogs.com/"); 2. Create Instance Httpget.setheader ("User-agent", "mozilla/5.0 (Windows NT 5.1) applewebkit/537.36 (khtml, like Gecko) chrome/ 38.0.2125.122 safari/537.36 SE 2.X METASR 1.0 "); Closeablehttpresponse Httpresponse=httpclient.execute (HttpGet); 3, the implementation of Httpentity entity=httpresponse.getentity (); 4. Get entity string content=entityutils.tostring (Entity, "utf-8"); 5, access to Web content httpresponse.close (); Httpclient.close ();D ocument doc=jsoup.parse (content); Parse Web page to get Document object Elements Elements=doc.getelementsbytag ("title"); Gets the tag is the title of all DOM elements element Element=elements.get (0); Gets the 1th element of String Title=element.text (); Returns the text of the element System.out.println ("title:" +title); Element=doc.getelementbyid ("Top_mini_nav_block"); Gets the DOM element of the Id=top_left String Menu=element.text (); Returns the text of an elementSYSTEM.OUT.PRINTLN ("Navigation:" +menu);}}
Run
Jsoup code example, parsing Web page + extracting text