In the previous project to parse the HTML, is the use of regular expression step-by-step to remove extraneous HTML comments and JS code part, and then use regular expressions to find the part that needs to be extracted, it can be said that using regular expressions to do is a relatively tedious process, This is especially true if the regular expression is not familiar or the HTML to be processed is complex.
Now we can use one. NET HTML parsing class library Htmlagilitypack. Htmlagilitypack is a class library that supports parsing HTML with XPath, and it is necessary to learn about Htmlagilitypack's APIs and XPath.
Htmlagilitypack is an open source. NET class library whose home page is http://htmlagilitypack.codeplex.com/, which can be downloaded to the latest version of the class library and API manuals, and can also be downloaded to an auxiliary tool for debugging.
A Concise introduction to XPath
XPath uses a path expression to pick a node or set of nodes in an XML document. A node is picked up either along a path or a step (steps).
The most useful path expressions are listed below:
NodeName: Selects all child nodes of this node.
/: Selected from the root node.
: Selects the nodes in the document from the current node of the matching selection, regardless of their location.
.: Select the current node.
: Select the parent node of the current node.
For example, there is the following XML:
< XML version= "1.0" encoding= "Utf-8"?>
< articles>
<Article>
<Title> Cow B's resume is God horse, so magical. </Title>
<Url>http://chebazi.net/showtopic-401.aspx</Url>
<createat type= "en" >2011-04-07</CreateAt>
</Article>
<Article>
<title lang= "Eng" >
"Kung Fu Panda 2" US 2011 adventure Action Animation blockbuster
</Title>
<Url>http://chebazi.net/showtopic-109.aspx</Url>
<createat type= "ZH-CN" >
November 23, 2010
</CreateAt>
</Article>
<Article>
<Title>
is a man's Must see, girls do not enter!!!
</Title>
<Url>http://chebazi.net/showtopic-396.aspx</Url>
<createat type= "ZH-CN" >
June 12, 2011
</CreateAt>
</Article>
<Article>
<title lang= "Eng" >
Ambiguous
</Title>
<Url>http://www.iofeng.com/</Url>
<createat type= "ZH-CN" >
2007-09-08
</CreateAt>
</Article>
</articles>
for the XML file above, we list some path expressions with predicates, and the result of the expression:
/articles/article[1]: Select the first article element that belongs to the articles child element.
/articles/article[last ()]: Selects the last article element that belongs to the articles child element.
/articles/article[last ()-1]: Selects the second-to-last article element that belongs to the articles child element. &NBSP
/articles/article[position () <3]: Selects the first two article elements that belong to the child elements of the bookstore element. &NBSP
//title[@lang]: Selects all the title elements that have properties named Lang. &NBSP
//createat[@type = ' ZH-CN '): Selects all createat elements that have a type attribute with a value of ZH-CN.
/articles/article[order>2]: Selects all article elements of the articles element, and the value of the Order element must be greater than 2. &NBSP
/articles/article[order<3]/title: Selects all the Title elements of the article element in the articles element, and the value of the Order element must be less than 3.
Htmlagilitypack API Brief Introduction
The classes commonly used in Htmlagilitypack are HTMLDocument, htmlnodecollection,
Htmlnode and Htmlweb and so on.
The process is typically to get HTML first, which can load static content through HTMLDocument's load () or loadhtml (), or you can htmlweb the get () or load () method to load the HTML for the URL on the network.
After getting the instance of HTMLDocument, we can use HTMLDocument's Documentnode property, which is the root node of the whole HTML document, it is also a htmlnode, You can then use the Htmlnode selectnodes () method to return multiple Htmlnode collection Object Htmlnodecollection, or you can take advantage of Htmlnode's selectSingleNode () method returns a single htmlnode.
Htmlagilitypack Combat
Get links and text for the following items in http://www.hao123.com/game.htm column.
Using System;
Using System.Collections.Generic;
Using System.IO;
Using System.Linq;
Using System.Net;
Using System.Web;
Using System.Web.UI;
Using System.Web.UI.WebControls;
Using System.Text;
Using Htmlagilitypack;
public class Category
{
public string Subject {get; set;}
public string Indexurl {get; set;}
}
public partial class _default:system.web.ui.page
{
Private Const string Categorylistxpath = "//html[1]/body[1]/div[3]/center[1]/div[1]/table[1]/tr"; Key points, different sites analyze different paths
Private Const string Categorynamexpath = "//td/a[1]"; Key points, different sites analyze different paths
Private Const string Choosexpath = "//a[1]";
protected void Button1_Click (object sender, EventArgs e)
{
Uri url = new Uri (this. TextBox1.Text.Trim ());
Uri uricategory = null;
HttpWebRequest request = (HttpWebRequest) webrequest.create (URL);
WebResponse response = Request. GetResponse ();
Stream stream = Response. GetResponseStream ();
StreamReader read = new StreamReader (stream,encoding.getencoding ("gb2312"));
String str = read. ReadToEnd ();
HTMLDocument html = new HTMLDocument ();
Html. Loadhtml (str);
Htmlnode RootNode = html. Documentnode;
Htmlnodecollection categorynodelist = Rootnode.selectnodes (Categorylistxpath);
Htmlnode temp = null;
list<category> list = new list<category> ();
foreach (Htmlnode categorynode in categorynodelist)
{
temp = Htmlnode.createnode (categorynode.outerhtml);
Htmlnode Singlenode = temp. selectSingleNode (Categorynamexpath);
if (Singlenode = = null)
Continue
Htmlnodecollection singlelist = temp. SelectNodes (Categorynamexpath);
foreach (Htmlnode node in singlelist)
{
Htmlnode CreateNode = htmlnode.createnode (node. outerHTML);
Htmlnode Renode = Createnode.selectsinglenode (Choosexpath);
if (Renode = = null)
Continue
Category category = new category ();
Category. Subject = Renode.innertext;
Uri.trycreate (URL, renode.attributes["href"]. Value, out uricategory);
Category. Indexurl = Uricategory.tostring ();
List. ADD (category);
}
}
string re =null;
foreach (Category cate in list)
{
Re +=string. Format ("<tr><td><a href={0}>{1}</a></td></tr>", Cate. Indexurl,cate. Subject);
}
This. Literal1.text = string. Format ("<table>{0}</table>", re);
}
protected void Page_Load (object sender, EventArgs e)
{
}
}
C # Spider program HTML parsing tool htmlagilitypack