Htmlcleaner use method and XPath grammar study

Htmlcleaner use method and XPath grammar study _java

Last Update:2017-01-19 Source: Internet

Author: User

Tags gettext xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When you are programming or writing a web crawler, you often need to parse HTML to extract useful data. A good tool is particularly useful, can provide a lot of help, online there are many such tools, such as: Htmlcleaner, Htmlparser

After use comparison: Feeling htmlcleaner than Htmlparser, especially Htmlcleaner XPath very useful.

The following example for Htmlcleaner, the demand is: Remove title,name= "my_href" link, div class= "D_1″ under all Li content."

First, Htmlcleaner use:

1, Htmlcleaner

Htmlcleaner is an Open-source HTML document parser for the Java language. Htmlcleaner can rearrange each element of an HTML document and generate well-formed (well-formed) HTML documents. By default, it follows rules that are similar to those used by most web browsers for the creation of a Document Object model. However, users can provide custom tag and rule groups for filtering and matching.

Home Address: http://htmlcleaner.sourceforge.net/

Download Address: http://www.jb51.net/softs/364983.html

2, the basic example, in the Wikipedia crawl airport information

Html-clean-demo.html

Html-clean-demo.html <! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "HTTP://WWW.W3.ORG/TR/XHTML1/DTD/XHTML1-TRANSITIONAL.DTD"  >

Htmlcleanerdemo.java

 package com.chenlb; import java.io.File; import Org.htmlcleaner.HtmlCleaner; import
Org.htmlcleaner.TagNode;
 /** * Htmlcleaner Use example. * */public class Htmlcleanerdemo {public static void main (string[] args) throws Exception {Htmlcleaner cleaner = NE
		W Htmlcleaner ();
		Tagnode node = Cleaner.clean (New File ("html/html-clean-demo.html"), "GBK");
		Press tag to take it.	object[] ns = Node.getelementsbyname ("title", true);
		Title if (Ns.length > 0) {System.out.println ("title=" + ((tagnode) ns[0)). GetText ());
		} System.out.println ("ul/li:");
		NS = Node.evaluatexpath ("//div[@class = ' d_1 ']//li") by XPath;
			for (Object on:ns) {tagnode n = (tagnode) on;
		System.out.println ("\ttext=" +n.gettext ());
		} System.out.println ("A:");
		NS = Node.getelementsbyattvalue by attribute value ("name", "My_href", true, true);
			for (Object on:ns) {tagnode n = (tagnode) on;
		System.out.println ("\thref=" +n.getattributebyname ("href") + ", text=" +n.gettext ()); }
	}
}

The parameters in Cleaner.clean () can be files, can be URLs, and can be string contents. The more commonly used should be Evaluatexpath, Getelementsbyattvalue, Getelementsbyname method. In addition, Htmlcleaner is better for nonstandard HTML compatibility.

Grabbing airport information in Wikipedia

Import java.io.UnsupportedEncodingException;
Import Org.htmlcleaner.HtmlCleaner;
Import Org.htmlcleaner.TagNode;
Import org.htmlcleaner.XPatherException;
Import Org.slf4j.Logger;
Import Org.slf4j.LoggerFactory;
Import Com.moore.index.BabyStory;
Import Com.moore.util.HttpClientUtil; /** * Use: TODO * * @author Bbdtek/public class Parserairport {private static Logger log = Loggerfactory.getlogger
	(Parserairport.class); /** * @param args * @throws unsupportedencodingexception * @throws xpatherexception/public static void main (St Ring[] args) throws Unsupportedencodingexception, xpatherexception {String url = ' Http://zh.wikipedia.org/wiki/%E4%B
		8%ad%e5%8d%8e%e4%ba%ba%e6%b0%91%e5%85%b1%e5%92%8c%e5%9b%bd%e6%9c%ba%e5%9c%ba%e5%88%97%e8%a1%a8 ";
		String contents = Httpclientutil.getutil (). Getcon (URL);
		Htmlcleaner HC = new Htmlcleaner ();
		Tagnode tn = hc.clean (contents); String XPath = "//div[@class = ' mw-content-ltr ']//table[@class = ' wikitable + sortable ']//tbody//tr[@align = ' right '] ";
		object[] Objarr = null;
		Objarr = Tn.evaluatexpath (XPath);
				if (Objarr!= null && objarr.length > 0) {for (Object Obj:objarr) {Tagnode tntr = (tagnode) obj;
				String xptr = "//td[@align = ' left ']//a";
				object[] Objarrtr = null;
				OBJARRTR = Tntr.evaluatexpath (xptr); if (objarrtr!= null && objarrtr.length > 0) {for (Object obja:objarrtr) {tagnode TNA = (tagnode
						) Obja;
						String str = Tna.gettext (). toString ();
					Log.info (str); }
				}
			}
		}
	}
}

Ii. A preliminary study of XPath

1. Introduction to XPath:

XPath is a language that looks for information in an XML document. XPath can be used to traverse elements and attributes in an XML document.

2. XPath node selection

XPath uses a path expression to select nodes in an XML document. Nodes are selected by following the path or step.

The most useful path expressions are listed below:

An expression	Description
NodeName	Select all child nodes of this node.
/	Select from the root node.
//	Select the nodes in the document from the current node that matches the selection, regardless of their location.
.	Select the current node.
..	Select the parent node of the current node.
@	Select the attribute.

Some of the common expressions

path expression	results
/bookstore/book[ 1]	selects the first book element that belongs to the bookstore child element.
/bookstore/book[last ()]	selects the last book element that belongs to the bookstore child element.
/bookstore/book[last ()-1]	selects the penultimate book element that belongs to the bookstore child element.
/bookstore/book[position () <3]	selects the first two book elements that belong to the child elements of the bookstore element.
//title[@lang]	selects all the title elements that have properties named Lang. The
//title[@lang = ' eng ']	selects all the title elements, and these elements have the lang attribute with a value of eng.
/bookstore/book[price>35.00]	selects all book elements of the bookstore element, with the value of the price element to be greater than 35.00.
/bookstore/book[price>35.00]/title	selects all the title elements of the book element in the bookstore element, and its The value of the price element in must be greater than 35.00.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More