Htmlcleaner use method and XPath grammar study _java

Source: Internet
Author: User
Tags gettext xpath

When you are programming or writing a web crawler, you often need to parse HTML to extract useful data. A good tool is particularly useful, can provide a lot of help, online there are many such tools, such as: Htmlcleaner, Htmlparser

After use comparison: Feeling htmlcleaner than Htmlparser, especially Htmlcleaner XPath very useful.

The following example for Htmlcleaner, the demand is: Remove title,name= "my_href" link, div class= "D_1″ under all Li content."

First, Htmlcleaner use:

1, Htmlcleaner

Htmlcleaner is an Open-source HTML document parser for the Java language. Htmlcleaner can rearrange each element of an HTML document and generate well-formed (well-formed) HTML documents. By default, it follows rules that are similar to those used by most web browsers for the creation of a Document Object model. However, users can provide custom tag and rule groups for filtering and matching.

Home Address: http://htmlcleaner.sourceforge.net/

Download Address: http://www.jb51.net/softs/364983.html


2, the basic example, in the Wikipedia crawl airport information

Html-clean-demo.html

Html-clean-demo.html <! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "HTTP://WWW.W3.ORG/TR/XHTML1/DTD/XHTML1-TRANSITIONAL.DTD"  >  

Htmlcleanerdemo.java

 package com.chenlb; import java.io.File; import Org.htmlcleaner.HtmlCleaner; import
Org.htmlcleaner.TagNode;
 /** * Htmlcleaner Use example. * */public class Htmlcleanerdemo {public static void main (string[] args) throws Exception {Htmlcleaner cleaner = NE
		W Htmlcleaner ();
		Tagnode node = Cleaner.clean (New File ("html/html-clean-demo.html"), "GBK");
		Press tag to take it.	object[] ns = Node.getelementsbyname ("title", true);
		Title if (Ns.length > 0) {System.out.println ("title=" + ((tagnode) ns[0)). GetText ());
		} System.out.println ("ul/li:");
		NS = Node.evaluatexpath ("//div[@class = ' d_1 ']//li") by XPath;
			for (Object on:ns) {tagnode n = (tagnode) on;
		System.out.println ("\ttext=" +n.gettext ());
		} System.out.println ("A:");
		NS = Node.getelementsbyattvalue by attribute value ("name", "My_href", true, true);
			for (Object on:ns) {tagnode n = (tagnode) on;
		System.out.println ("\thref=" +n.getattributebyname ("href") + ", text=" +n.gettext ()); }
	}
}

The parameters in Cleaner.clean () can be files, can be URLs, and can be string contents. The more commonly used should be Evaluatexpath, Getelementsbyattvalue, Getelementsbyname method. In addition, Htmlcleaner is better for nonstandard HTML compatibility.

Grabbing airport information in Wikipedia

Import java.io.UnsupportedEncodingException;
Import Org.htmlcleaner.HtmlCleaner;
Import Org.htmlcleaner.TagNode;
Import org.htmlcleaner.XPatherException;
Import Org.slf4j.Logger;
Import Org.slf4j.LoggerFactory;
Import Com.moore.index.BabyStory;
Import Com.moore.util.HttpClientUtil; /** * Use: TODO * * @author Bbdtek/public class Parserairport {private static Logger log = Loggerfactory.getlogger
	(Parserairport.class); /** * @param args * @throws unsupportedencodingexception * @throws xpatherexception/public static void main (St Ring[] args) throws Unsupportedencodingexception, xpatherexception {String url = ' Http://zh.wikipedia.org/wiki/%E4%B
		8%ad%e5%8d%8e%e4%ba%ba%e6%b0%91%e5%85%b1%e5%92%8c%e5%9b%bd%e6%9c%ba%e5%9c%ba%e5%88%97%e8%a1%a8 ";
		String contents = Httpclientutil.getutil (). Getcon (URL);
		Htmlcleaner HC = new Htmlcleaner ();
		Tagnode tn = hc.clean (contents); String XPath = "//div[@class = ' mw-content-ltr ']//table[@class = ' wikitable + sortable ']//tbody//tr[@align = ' right '] ";
		object[] Objarr = null;
		Objarr = Tn.evaluatexpath (XPath);
				if (Objarr!= null && objarr.length > 0) {for (Object Obj:objarr) {Tagnode tntr = (tagnode) obj;
				String xptr = "//td[@align = ' left ']//a";
				object[] Objarrtr = null;
				OBJARRTR = Tntr.evaluatexpath (xptr); if (objarrtr!= null && objarrtr.length > 0) {for (Object obja:objarrtr) {tagnode TNA = (tagnode
						) Obja;
						String str = Tna.gettext (). toString ();
					Log.info (str); }
				}
			}
		}
	}
}

Ii. A preliminary study of XPath

1. Introduction to XPath:

XPath is a language that looks for information in an XML document. XPath can be used to traverse elements and attributes in an XML document.

2. XPath node selection


XPath uses a path expression to select nodes in an XML document. Nodes are selected by following the path or step.

The most useful path expressions are listed below:

An expression Description
NodeName Select all child nodes of this node.
/ Select from the root node.
// Select the nodes in the document from the current node that matches the selection, regardless of their location.
. Select the current node.
.. Select the parent node of the current node.
@ Select the attribute.

Some of the common expressions

path expression results
/bookstore/book[ 1] selects the first book element that belongs to the bookstore child element.
/bookstore/book[last ()] selects the last book element that belongs to the bookstore child element.
/bookstore/book[last ()-1] selects the penultimate book element that belongs to the bookstore child element.
/bookstore/book[position () <3] selects the first two book elements that belong to the child elements of the bookstore element.
//title[@lang] selects all the title elements that have properties named Lang. The
//title[@lang = ' eng '] selects all the title elements, and these elements have the lang attribute with a value of eng.
/bookstore/book[price>35.00] selects all book elements of the bookstore element, with the value of the price element to be greater than 35.00.
/bookstore/book[price>35.00]/title selects all the title elements of the book element in the bookstore element, and its The value of the price element in must be greater than 35.00.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.