JAVA: parsing HTML -- jsoup

Source: Internet
Author: User
Tags html interpreter set cookie

The JSOUP parse html (xml) code section is very simple, and has a powerful selector to obtain html page elements, there are also a variety of ways to read html files: such as remote reading from the server, read local html;

The following are two simple codes:


// Method 1: Obtain try {String sum_content = ""; Document doc = Jsoup from the specified http://blog.csdn.net/lyc66666666666/article/details/net. connect ("http://fashion.sina.com.cn/s/ce/2013-12-27/092831654.shtml "). get (); Element content = doc. getElementById ("artibody"); Elements tags = content. getElementsByTag ("p"); for (Element tag: tags) {// traverse all the p tags under artibody String attr = tag. attr ("class"); if (attr. equals ("") {// filter useless attributes sum_content + = tag. text () ;}} System. out. println (sum_content);} catch (IOException e) {e. printStackTrace ();}


// Method 2: Obtain the private List from a specific file (R. raw. sina_all_opml is a local xml file)
 
  
> GetRssList (String id) {List
  
   
> List = new ArrayList
   
    
> (); // Get the total RSS list file InputStream is = getResources (). openRawResource (R. raw. sina_all_opml); // read the input stream Document doc; try {doc = Jsoup. parse (is, "UTF-8", ""); // read the file Element parent = doc in utf8 format. getElementById (id); Elements outlines = parent. getElementsByTag ("outline"); // you can obtain the for (Element outline: outlines) {if (! Outline. attr ("text "). equals ("") {String title = outline. attr ("text"); // directly obtain the value String xmlUrl = outline of the attribute name text. attr ("xmlUrl"); HashMap
    
     
Map = new HashMap
     
      
(); Map. put ("title", title); map. put ("xmlUrl", xmlUrl); list. add (map) ;}} catch (IOException e) {e. printStackTrace ();} return list ;}
     
    
   
  
 


The following is an excerpt from Baidu Baike, not my opinion:

The main functions of jsoup are as follows:


1. parse HTML from a URL, file, or string;
2. Use the DOM or CSS selector to find and retrieve data;
3. HTML elements, attributes, and text can be operated;
Jsoup is released based on the MIT protocol and can be safely used in commercial projects.
Shows the main class hierarchies of jsoup:


/**************************************




Document input


Jsoup can load HTML documents from strings, URLs, and local files and generate Document object instances.
The following code is used:
// Enter the HTML document directly from the string
String html ="Open source Chinese community"
+"

Here is an article about the jsoup project.

";
Document doc = Jsoup. parse (html );
// Load the HTML document directly from the URL
Document doc = Jsoup. connect ("http://blog.csdn.net/lyc666666666/article/details/netmask/"). get ();
String title = doc. title ();
Document doc = Jsoup. connect ("http://blog.csdn.net/lyc666666666/article/details/netmask /")
. Data ("query", "Java") // Request Parameters
. UserAgent ("I 'mjsoup") // set the User-Agent
. Cookie ("auth", "token") // set cookie
. Timeout (3000) // sets the connection timeout.
. Post (); // use the POST method to access the URL
// Load the HTML document from the file
File input = new File ("D:/test.html ");
Document doc = Jsoup. parse (input, "UTF-8", "http://blog.csdn.net/lyc666666666/article/details/netmask /");
Please note the third parameter of parse in the last HTML document input mode. Why do you need to specify an http://blog.csdn.net/lyc66666666666/article/details/netmask )? Because there are many HTML documents, such as links, images, referenced external scripts, and css files, the third parameter named baseURL means that when the HTML document uses relative paths to reference external files, jsoup automatically adds a prefix for these URLs, that is, this baseURL.
For example, open-source software is converted into open-source software.




**************************************** ********************************
Parse and extract HTML elements


This part involves the most basic functions of an HTML Parser, but jsoup uses a method different from other open-source projects-selector. We will detail the jsoup selector in the last part, in this section, you will see how jsoup is implemented with the simplest code.
However, jsoup also provides the traditional DOM method for element parsing. Let's look at the following code:
File input = new File ("D:/test.html ");
Document doc = Jsoup. parse (input, "UTF-8", "http://blog.csdn.net/lyc666666666/article/details/netmask /");
Element content = doc. getElementById ("content ");
Elements links = content. getElementsByTag ("");
For (Element link: links ){
String linkHref = link. attr ("href ");
String linkText = link. text ();
}
You may think that the jsoup method is familiar. That's right. The getElementById and getElementsByTag methods have the same name as the JavaScript method and have the same functions. You can obtain the corresponding element or element list based on the node name or HTML element id.
Unlike the htmlparser project, jsoup does not define a corresponding class for HTML elements. Generally, an HTML element consists of node names, attributes, and text, jsoup provides a simple method for you to retrieve the data, which is also the reason why jsoup remains slim.
In terms of element retrieval, jsoup selectors are omnipotent,
File input = new File ("D: \ test.html ");
Document doc = Jsoup. parse (input, "UTF-8", "http://blog.csdn.net/lyc666666666/article/details/ ");
Elements links = doc. select ("a [href]"); // link with the href attribute
Elements pngs = doc. select ("img?src==.png]"); // all Elements that reference png Images
Element masthead = doc. select ("div. masthead"). first ();
// Find the element that defines class = masthead
Elements resultLinks = doc. select ("h3.r> a"); // direct a after h3
This is where jsoup really impressed me. jsoup uses the same selector as jQuery to retrieve elements. If the above retrieval method is replaced with another HTML interpreter, at least a lot of lines of code are required, while jsoup only needs one line of code.
The jsoup selector also supports expression functions. We will introduce this super selector in the last section.


**************************************** ********************************
Modify data


When parsing a document, we may need to modify some elements in the document. For example, we can add clicklinks, modify link addresses, or modify texts for all images in the document.
Below are some simple examples:
Doc. select ("div. commentsa"). attr ("rel", "nofollow ");
// Add the rel = nofollow attribute to all links
Doc. select ("div. commentsa"). addClass ("mylinkclass ");
// Add the class = mylinkclass attribute to all links
Doc. select ("img"). removeAttr ("onclick"); // Delete the onclick attribute of all images
Doc. select ("input [type = text]"). val (""); // clear all text in the text input box
The principle is very simple. You only need to use the jsoup selector to find out the elements, and then you can use the above method to modify them, in addition to the tag name that cannot be modified (the new element can be deleted and inserted), the attributes and texts of the element can be modified.
After modification, you can directly call the html () method of Element (s) to obtain the modified HTML document.


**************************************** ********************************
HTML document cleanup


While providing powerful APIs, jsoup is also very user-friendly. Users are often provided with comments when making websites. Some users are naughty and may make some scripts into the comments. These scripts may corrupt the behavior of the entire page. What's more serious is to get some confidential information, for example, XSS cross-site attacks.
Jsoup is very powerful and easy to use. Take a look at the following code:
String unsafe ="

Open source Chinese community

";
String safe = Jsoup. clean (unsafe, Whitelist. basic ());
// Output:
//

Open source Chinese community


Jsoup uses a Whitelist class to filter HTML documents. This class provides several common methods:


If none of the five filters meet your requirements, for example, you can allow users to insert flash Animation. It doesn't matter. Whitelist provides extended functions, such as whitelist. addTags ("embed", "object", "param", "span", "div"); you can also use addAttributes to add attributes to some elements.




**************************************** ********************************
The uniqueness of soup -- Selector


We have briefly introduced how jsoup uses selectors to retrieve elements. This section focuses on the powerful syntax of the selector. The following table lists all the syntax details of the jsoup selector.
Basic usage


These are the most basic selector syntaxes. These syntaxes can also be used in combination. The following is a combination of syntaxes supported by jsoup:


In addition to some basic syntaxes and combinations of these syntaxes, jsoup also supports filtering and selecting elements using expressions. The following is a list of all expressions supported by jsoup:


**************************************** *********************************/



This is the official api: http://jsoup.org/apidocs/


This is a jar package: http://pan.baidu.com/s/1gd3fwjh


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.