Using Open Source Libraries jsoup parsing HTML file instances in Java

Using Open Source Libraries jsoup parsing HTML file instances in Java _java

Last Update:2017-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HTML is the core of the web, and all the pages you see on the Internet are HTML, regardless of whether they are dynamically generated by javascript,jsp,php,asp or some other web technology. Your browser will parse the HTML and render them for you. But what if you need to parse HTML documents in a Java program and look for certain elements, tags, attributes, or whether a particular element exists? If you've been using Java programming for years, I'm sure you've tried parsing xml and using parsers like DOM or sax, but it's possible that you've never done any HTML parsing work. Ironically, in Java applications, there are few times when you need to parse HTML documents, not including servlet or other Java Web technologies. Worse still, the JDK does not have a library of HTTP or HTML in mind, at least I don't know about it. That's why when it comes to parsing HTML files, many Java programmers have to Google first to see how to get an HTML tag out of Java. When I have this need, I believe there will be some open source library to achieve this, but I did not expect to have jsoup so cool and full-featured library. Not only does it support reading and parsing HTML documents, but it also lets you extract any elements from an HTML file, their properties, their CSS properties, and you can modify them. With Jsoup you can almost do anything with HTML documents. We'll see an example of how to download and parse HTML files from the Google homepage or any URL in Java.

What's the Jsoup library?

Jsoup is an open source Java library that can be used to process HTML in real-world applications. It provides a very convenient API for data extraction and modification, leveraging the strengths of Dom,css and jquery style methods. Jsoup implements the specification of the WAHTWG HTML5, which is fully consistent from the parsing of the HTML-parsed DOM to modern browsers such as Chrome and Firefox. Here are some useful features of the Jsoup library:

1.Jsoup can retrieve and parse HTML from a URL, file, or string.
2.Jsoup can find and extract data, and can use DOM traversal or CSS selectors.
3. You can use Jsoup to modify HTML elements, attributes, and text.
4.Jsoup through a secure whitelist ensures that the content submitted by the user is clean to prevent XSS attacks.
5.Jsoup can also output neat HTML.

Jsoup is designed to deal with various kinds of HTML that appear in real life, including correctly valid HTML and incomplete, invalid tag collections. One of the core competencies of Jsoup is its robustness.

Using Jsoup in Java for HTML parsing

In this tutorial on Java HTML parsing, we see three different examples of using Jsoup to parse and traverse HTML in Java. In the first example, we parse an HTML string, which is a label that consists of a string literal in Java. In the second example, we will download the HTML document from the Web, and in the third example we will load an HTML sample file login.html for parsing. This file is an example of an HTML document that contains a title tag with a DIV tag inside it that contains a form. It has an input tag to get the username and password, as well as a submit and reset button for the next action. It is a properly valid HTML, that is, all tags and attributes are properly closed. Here is a sample file of our HTML:

Copy Code code as follows:

<! DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en" "Http://www.w3.org/TR/html4/loose.dtd" >   br> <meta http-equiv= "Content-type content=" text/html ; Charset=iso-8859-1 ">
<title>login page</title>
<body>
<div id= "Login" class= "simple" >
<form action= "L Ogin.do ">
Username: <input id=" Username "type=" text "/><br>
Password : <input id= "password" type= "password"/><br>
<input id= "Submit" type= "Submit"/>&NB sp;
<input id= "reset" type= "reset"/>
</form>
</div>& nbsp;
</body>

Using Jsoup to parse HTML is very simple, you simply call its static method Jsoup.parse () and pass in your HTML string to it. Jsoup provides a number of overloaded parse () methods that can read HTML files from strings, files, Uri,url, and even inputstream. If it is not UTF-8 encoding, you can also specify the character encoding so that the HTML file can be read correctly. The following is a complete list of HTML parsing methods in the Jsoup library. The parse (String HTML) method parses the input HTML into a new document. In Jsoup, document inherits the element, and it inherits from node. The same textnode also inherits from node. As long as you're passing in a string that's not NULL, you're sure to get a successful, meaningful parsing, with a document that contains the head and body elements. Once you have this document, you can call the document and its parent element and the appropriate method above node to get the data you want.

Java programs that parse HTML documents

The following is a complete Java program that parses HTML strings, downloaded HTML files on the network, and HTML files in the local file system. You can run this program using the Eclipse IDE or other Ides or even commands. In eclipse it's very simple to copy this code, create a new Java project, and then right click on the SRC package and paste it in. Eclipse will create the correct package and Java source files with the same name, so the workload is minimal. If you already have a Java example project, then just take a step. The following Java program shows three different examples of parsing and traversing an HTML file. In the first example, we parse a string directly into HTML, and in the second example we parse an HTML file downloaded from the URL, and in the third we load an HTML document from the local file system and parse it. The first and third examples use the Parse method to obtain a Document object, which you can query to extract any tag value or attribute value. In the second example, we use the Jsoup.connect method, which creates a connection to the URL, downloads the HTML, and parses it. This method also returns the document, which can be used for subsequent queries and for obtaining a label or property value.

Copy Code code as follows:

Import java.io.IOException;

Import Org.jsoup.Jsoup;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;

/**
[*] Java program to parse/read the HTML documents from File using Jsoup library.
[*] Jsoup is a open source library which allows Java developer to parse HTML
[*] files and extract elements, manipulate data, change style using DOM, CSS and
[*] JQuery like method.
[*]
[*] @author Javin Paul
[*]/
public class htmlparser{

public static void Main (String args[]) {

Parse HTML String using Jsoup Library
String htmlstring = "<! DOCTYPE html> "
+ "+ "+ "<title>jsoup example</title>"
+ "+ "<body>"
+ "| [b] Helloworld[/b] "
+ ""
+ "</body>"
+ "
Document html = jsoup.parse (htmlstring);
String title = Html.title ();
String H1 = Html.body (). Getelementsbytag ("H1"). Text ();

System.out.println ("Input HTML String to Jsoup:" + htmlstring);
System.out.println ("After parsing, title:" + title);
System.out.println ("Afte parsing, Heading:" + H1);

Jsoup Example 2-reading HTML page from URL
Document Doc;
try {
doc = Jsoup.connect ("http://google.com/"). get ();
title = Doc.title ();
catch (IOException e) {
E.printstacktrace ();
}

System.out.println ("Jsoup Can read HTML page from URL, title:" + title);

Jsoup Example 3-parsing A HTML file in Java
Document htmlfile = Jsoup.parse ("login.html", "iso-8859-1"); Wrong
Document htmlfile = null;
try {
Htmlfile = Jsoup.parse (New File ("login.html"), "iso-8859-1");
catch (IOException e) {
TODO auto-generated Catch block
E.printstacktrace ();
}//Right
title = Htmlfile.title ();
Element div = Htmlfile.getelementbyid ("login");
String CssClass = Div.classname (); Getting class Form HTML element

System.out.println ("Jsoup can also parse HTML file directly");
System.out.println ("title:" + title);
System.out.println ("Class of div tag:" + CssClass);
}

}

Output:

Copy Code code as follows:

Input HTML String to Jsoup: <! DOCTYPE Html>After parsing, Title:jsoup Example
Afte parsing, Heading:helloworld
Jsoup Can read HTML page from URL, Title:google
Jsoup can also parse HTML file directly title:login Page
Class of Div tag:simple

The advantage of the

Jsoup is that it is robust. The Jsoup HTML parser will parse the HTML you provide as cleanly as possible without considering whether the HTML is well-formed. It can handle these errors: unclosed tags (e.g., Java <p>scala to <p>javascala), implicit tags (for example, a naked | The Java is great encapsulated in the | inside), it always creates a document structure (containing the head and body HTML, and it contains only the correct elements). This is how HTML is parsed in Java. Jsoup is an excellent, robust open source library that makes it easy to read HTML documents, body fragments, HTML strings, and parse HTML content directly from the Web. In this article, we learned how to get a specific HTML tag in Java, as in the first example we extracted the value of title and H1 tags into text, and in the third example we learned how to get property values from HTML tags by extracting CSS properties. In addition to the powerful jquery-style html.body (). Getelementsbytag ("H1"). The text () method, you can also extract arbitrary HTML tags, it also provides like Document.title () and Element.classname (), you can quickly get to the title and CSS classes. Hopefully Jsoup will give you a good time, and soon we'll see more examples of this API.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More