C # Spider program HTML parsing tool htmlagilitypack

Last Update:2015-08-21 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous project to parse the HTML, is the use of regular expression step-by-step to remove extraneous HTML comments and JS code part, and then use regular expressions to find the part that needs to be extracted, it can be said that using regular expressions to do is a relatively tedious process, This is especially true if the regular expression is not familiar or the HTML to be processed is complex.

Now we can use one. NET HTML parsing class library Htmlagilitypack. Htmlagilitypack is a class library that supports parsing HTML with XPath, and it is necessary to learn about Htmlagilitypack's APIs and XPath.

Htmlagilitypack is an open source. NET class library whose home page is http://htmlagilitypack.codeplex.com/, which can be downloaded to the latest version of the class library and API manuals, and can also be downloaded to an auxiliary tool for debugging.
A Concise introduction to XPath
XPath uses a path expression to pick a node or set of nodes in an XML document. A node is picked up either along a path or a step (steps).
The most useful path expressions are listed below:
NodeName: Selects all child nodes of this node.
/: Selected from the root node.
: Selects the nodes in the document from the current node of the matching selection, regardless of their location.
.: Select the current node.
: Select the parent node of the current node.
For example, there is the following XML:
< XML version= "1.0" encoding= "Utf-8"?>
< articles>
<Article>
<Title> Cow B's resume is God horse, so magical. </Title>
<Url>http://chebazi.net/showtopic-401.aspx</Url>
<createat type= "en" >2011-04-07</CreateAt>
</Article>
<Article>
<title lang= "Eng" >
"Kung Fu Panda 2" US 2011 adventure Action Animation blockbuster
</Title>
<Url>http://chebazi.net/showtopic-109.aspx</Url>
<createat type= "ZH-CN" >
November 23, 2010
</CreateAt>
</Article>
<Article>
<Title>
is a man's Must see, girls do not enter!!!
</Title>
<Url>http://chebazi.net/showtopic-396.aspx</Url>
<createat type= "ZH-CN" >
June 12, 2011
</CreateAt>
</Article>
<Article>
<title lang= "Eng" >
Ambiguous
</Title>
<Url>http://www.iofeng.com/</Url>
<createat type= "ZH-CN" >
2007-09-08
</CreateAt>
</Article>
</articles>

for the XML file above, we list some path expressions with predicates, and the result of the expression:
/articles/article[1]: Select the first article element that belongs to the articles child element.
/articles/article[last ()]: Selects the last article element that belongs to the articles child element.
/articles/article[last ()-1]: Selects the second-to-last article element that belongs to the articles child element. &NBSP
/articles/article[position () <3]: Selects the first two article elements that belong to the child elements of the bookstore element. &NBSP
//title[@lang]: Selects all the title elements that have properties named Lang. &NBSP
//createat[@type = ' ZH-CN '): Selects all createat elements that have a type attribute with a value of ZH-CN.
/articles/article[order>2]: Selects all article elements of the articles element, and the value of the Order element must be greater than 2. &NBSP
/articles/article[order<3]/title: Selects all the Title elements of the article element in the articles element, and the value of the Order element must be less than 3.

Htmlagilitypack API Brief Introduction
The classes commonly used in Htmlagilitypack are HTMLDocument, htmlnodecollection,
Htmlnode and Htmlweb and so on.
The process is typically to get HTML first, which can load static content through HTMLDocument's load () or loadhtml (), or you can htmlweb the get () or load () method to load the HTML for the URL on the network.
After getting the instance of HTMLDocument, we can use HTMLDocument's Documentnode property, which is the root node of the whole HTML document, it is also a htmlnode, You can then use the Htmlnode selectnodes () method to return multiple Htmlnode collection Object Htmlnodecollection, or you can take advantage of Htmlnode's selectSingleNode () method returns a single htmlnode.
Htmlagilitypack Combat
Get links and text for the following items in http://www.hao123.com/game.htm column.

Using System;
Using System.Collections.Generic;
Using System.IO;
Using System.Linq;
Using System.Net;
Using System.Web;
Using System.Web.UI;
Using System.Web.UI.WebControls;
Using System.Text;
Using Htmlagilitypack;

public class Category
{
public string Subject {get; set;}
public string Indexurl {get; set;}
}
public partial class _default:system.web.ui.page
{
Private Const string Categorylistxpath = "//html[1]/body[1]/div[3]/center[1]/div[1]/table[1]/tr"; Key points, different sites analyze different paths
Private Const string Categorynamexpath = "//td/a[1]"; Key points, different sites analyze different paths
Private Const string Choosexpath = "//a[1]";
protected void Button1_Click (object sender, EventArgs e)
{
Uri url = new Uri (this. TextBox1.Text.Trim ());
Uri uricategory = null;
HttpWebRequest request = (HttpWebRequest) webrequest.create (URL);
WebResponse response = Request. GetResponse ();

Stream stream = Response. GetResponseStream ();
StreamReader read = new StreamReader (stream,encoding.getencoding ("gb2312"));
String str = read. ReadToEnd ();

HTMLDocument html = new HTMLDocument ();
Html. Loadhtml (str);
Htmlnode RootNode = html. Documentnode;
Htmlnodecollection categorynodelist = Rootnode.selectnodes (Categorylistxpath);
Htmlnode temp = null;
list<category> list = new list<category> ();
foreach (Htmlnode categorynode in categorynodelist)
{
temp = Htmlnode.createnode (categorynode.outerhtml);
Htmlnode Singlenode = temp. selectSingleNode (Categorynamexpath);
if (Singlenode = = null)
Continue
Htmlnodecollection singlelist = temp. SelectNodes (Categorynamexpath);
foreach (Htmlnode node in singlelist)
{
Htmlnode CreateNode = htmlnode.createnode (node. outerHTML);
Htmlnode Renode = Createnode.selectsinglenode (Choosexpath);
if (Renode = = null)
Continue
Category category = new category ();
Category. Subject = Renode.innertext;
Uri.trycreate (URL, renode.attributes["href"]. Value, out uricategory);
Category. Indexurl = Uricategory.tostring ();
List. ADD (category);
}
}

string re =null;
foreach (Category cate in list)
{
Re +=string. Format ("<tr><td><a href={0}>{1}</a></td></tr>", Cate. Indexurl,cate. Subject);
}
This. Literal1.text = string. Format ("<table>{0}</table>", re);
}
protected void Page_Load (object sender, EventArgs e)
{

}
}

C # Spider program HTML parsing tool htmlagilitypack

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More