When you are programming or writing a web crawler, you often need to parse HTML to extract useful data. A good tool is particularly useful, can provide a lot of help, online there are many such tools, such as: Htmlcleaner, Htmlparser
After use comparison: Feeling htmlcleaner than Htmlparser, especially Htmlcleaner XPath very useful.
The following example for Htmlcleaner, the demand is: Remove title,name= "my_href" link, div class= "D_1″ under all Li content."
First, Htmlcleaner use:
1, Htmlcleaner
Htmlcleaner is an Open-source HTML document parser for the Java language. Htmlcleaner can rearrange each element of an HTML document and generate well-formed (well-formed) HTML documents. By default, it follows rules that are similar to those used by most web browsers for the creation of a Document Object model. However, users can provide custom tag and rule groups for filtering and matching.
Home Address: http://htmlcleaner.sourceforge.net/
Download Address: http://www.jb51.net/softs/364983.html
2, the basic example, in the Wikipedia crawl airport information
Html-clean-demo.html
Html-clean-demo.html <! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "HTTP://WWW.W3.ORG/TR/XHTML1/DTD/XHTML1-TRANSITIONAL.DTD" >
Htmlcleanerdemo.java
package com.chenlb; import java.io.File; import Org.htmlcleaner.HtmlCleaner; import
Org.htmlcleaner.TagNode;
/** * Htmlcleaner Use example. * */public class Htmlcleanerdemo {public static void main (string[] args) throws Exception {Htmlcleaner cleaner = NE
W Htmlcleaner ();
Tagnode node = Cleaner.clean (New File ("html/html-clean-demo.html"), "GBK");
Press tag to take it. object[] ns = Node.getelementsbyname ("title", true);
Title if (Ns.length > 0) {System.out.println ("title=" + ((tagnode) ns[0)). GetText ());
} System.out.println ("ul/li:");
NS = Node.evaluatexpath ("//div[@class = ' d_1 ']//li") by XPath;
for (Object on:ns) {tagnode n = (tagnode) on;
System.out.println ("\ttext=" +n.gettext ());
} System.out.println ("A:");
NS = Node.getelementsbyattvalue by attribute value ("name", "My_href", true, true);
for (Object on:ns) {tagnode n = (tagnode) on;
System.out.println ("\thref=" +n.getattributebyname ("href") + ", text=" +n.gettext ()); }
}
}
The parameters in Cleaner.clean () can be files, can be URLs, and can be string contents. The more commonly used should be Evaluatexpath, Getelementsbyattvalue, Getelementsbyname method. In addition, Htmlcleaner is better for nonstandard HTML compatibility.
Grabbing airport information in Wikipedia
Import java.io.UnsupportedEncodingException;
Import Org.htmlcleaner.HtmlCleaner;
Import Org.htmlcleaner.TagNode;
Import org.htmlcleaner.XPatherException;
Import Org.slf4j.Logger;
Import Org.slf4j.LoggerFactory;
Import Com.moore.index.BabyStory;
Import Com.moore.util.HttpClientUtil; /** * Use: TODO * * @author Bbdtek/public class Parserairport {private static Logger log = Loggerfactory.getlogger
(Parserairport.class); /** * @param args * @throws unsupportedencodingexception * @throws xpatherexception/public static void main (St Ring[] args) throws Unsupportedencodingexception, xpatherexception {String url = ' Http://zh.wikipedia.org/wiki/%E4%B
8%ad%e5%8d%8e%e4%ba%ba%e6%b0%91%e5%85%b1%e5%92%8c%e5%9b%bd%e6%9c%ba%e5%9c%ba%e5%88%97%e8%a1%a8 ";
String contents = Httpclientutil.getutil (). Getcon (URL);
Htmlcleaner HC = new Htmlcleaner ();
Tagnode tn = hc.clean (contents); String XPath = "//div[@class = ' mw-content-ltr ']//table[@class = ' wikitable + sortable ']//tbody//tr[@align = ' right '] ";
object[] Objarr = null;
Objarr = Tn.evaluatexpath (XPath);
if (Objarr!= null && objarr.length > 0) {for (Object Obj:objarr) {Tagnode tntr = (tagnode) obj;
String xptr = "//td[@align = ' left ']//a";
object[] Objarrtr = null;
OBJARRTR = Tntr.evaluatexpath (xptr); if (objarrtr!= null && objarrtr.length > 0) {for (Object obja:objarrtr) {tagnode TNA = (tagnode
) Obja;
String str = Tna.gettext (). toString ();
Log.info (str); }
}
}
}
}
}
Ii. A preliminary study of XPath
1. Introduction to XPath:
XPath is a language that looks for information in an XML document. XPath can be used to traverse elements and attributes in an XML document.
2. XPath node selection
XPath uses a path expression to select nodes in an XML document. Nodes are selected by following the path or step.
The most useful path expressions are listed below:
An expression |
Description |
NodeName |
Select all child nodes of this node. |
/ |
Select from the root node. |
// |
Select the nodes in the document from the current node that matches the selection, regardless of their location. |
. |
Select the current node. |
.. |
Select the parent node of the current node. |
@ |
Select the attribute. |
Some of the common expressions
path expression |
results |
/bookstore/book[ 1] |
selects the first book element that belongs to the bookstore child element. |
/bookstore/book[last ()] |
selects the last book element that belongs to the bookstore child element. |
/bookstore/book[last ()-1] |
selects the penultimate book element that belongs to the bookstore child element. |
/bookstore/book[position () <3] |
selects the first two book elements that belong to the child elements of the bookstore element. |
//title[@lang] |
selects all the title elements that have properties named Lang. The |
//title[@lang = ' eng '] |
selects all the title elements, and these elements have the lang attribute with a value of eng. |
/bookstore/book[price>35.00] |
selects all book elements of the bookstore element, with the value of the price element to be greater than 35.00. |
/bookstore/book[price>35.00]/title |
selects all the title elements of the book element in the bookstore element, and its The value of the price element in must be greater than 35.00. |