Article Source: HTML Agility Pack parsing HTML page
Now, in many applications, we want to do data fetching, especially based on the crawl of the Web Part. In fact, the process of Web page crawl is actually through the method of programming, to crawl different Web pages, then the process of analysis and screening. For example, some comparison shopping sites, will be at the same time to crawl different shopping site data and save it in the database. In general, the crawl of these pages will need to crawl back to the HTML to parse.
. NET provides many classes to access and obtain data for remote Web pages, such as the WebClient class and the HttpWebRequest class. These classes are useful for using HTTP to access remote Web pages and downloading them, but in terms of the parsing capabilities of the downloaded HTML, it is very weak, and in the past, developers had to use very rudimentary methods, such as using String.IndexOf, String.substring or use regular expressions to parse.
Another way to parse HTML is to use the Open Source Toolkit HTML Agility Pack (http://htmlagilitypack.codeplex.com/), which is designed to make it as simple as possible to read and write HTML documents. The package itself uses the DOM document Object model to parse the HTML. With just a few lines of code, developers can use the DOM to access the head of the document to its child's node. The HTML agility package can also access the specified node in the DOM through XPath, and it also contains a class that can be used to download Web pages on a remote site, which means that developers can use it to download and parse HTML pages at the same time.
This article will show you how to use the HTML agility package to download and parse the Web page in a few examples, and the code example can be downloaded in the attachment.
Preparatory work
You can download the HTML Agility package from http://htmlagilitypack.codeplex.com/), note that it must be run on ASP. NET 3.5 and later. After downloading, the toolkit actually exists in HtmlAgilityPack.dll form. When using, just put this DLL in your site or project Bin directory, we currently use the version is 1.4.
Here are three examples to illustrate the use of HTML Agility.
Example one: List meta tags for Remote Web pages
Crawling a Web page typically involves downloading a specified Web page and grabbing a specified piece of information. The first example shows how to use the HTML agility package to download a Web page and cycle through the tags that have both a name and a content tag in the HTML page.
The Html Agility Pack contains classes that are all in the Htmlagilitypack namespace, so refer to this namespace before use, as follows:
Here is the code snippet:
Using Htmlagilitypack;
To download a Web page from a Web site, you can use the Load method of the Htmlweb class and, of course, implement a new Htmlweb object instance, as follows:
Here is the code snippet:
var webGet = new HtmlWeb();
var document = webGet.Load(url);
Where the Load method returns a HTMLDocument object. In the above code, we assign the returned HTMLDocument object to a local variable document. HTMLDocument This class represents a complete HTML document and contains the Documentnode property, which returns a Htmlnode object representing the root node of the document.
The Htmlnode class has several properties that are very simple, mainly used to traverse the DOM, including:
ParentNode: Accessing the parent node
ChildNodes: Visit your child's node
NextSibling: The next sibling element of an element (that is, the next element in the same hierarchical element)
PreviousSibling: The previous sibling element of an element (that is, the previous element in the same hierarchy element)
For the judgment of the node itself, there are the following properties:
Name-Gets or sets the names of the nodes. For HTML elements, it returns the contents of the tag, for example, the body tag, the return result is "body", and for the [P] label the result is "P", and so on
Attributes-Returns a collection of all the attributes of the element
InnerHtml-Returns or sets the HTML content in the element.
InnerText-Returns the textual text of the node.
NodeType-Indicates the type of node, which can be document,element,comment or text.
Of course, there are many other ways to get information about the specified node, for example, the Ancestors method returns all ancestor nodes, and the SelectNodes method returns a collection of nodes that match the XPath expression.
With these methods and properties, we now have a number of ways to get all the tags in the HTML document. This example will take the SelectNodes method. The following statement invokes the SelectNodes method in the Documentnode property of the Document object, uses the XPath expression "//meta", and returns the tags in all documents.
Here is the code snippet:
var metatags = document. Documentnode.selectnodes ("//meta");
If there is no label in the document, then the metatags variable will be null, and if there is more than one tag, then MetaTags will be a Htmlnode collection of objects, and we can iterate over and display their properties, as in the following code:
Here is the code snippet:
If (metaTags != null)
{
Foreach (var tag in metaTags)
{
If (tag.Attributes["name"] != null && tag.Attributes["content"] != null)
{
... output tag.Attributes["name"].Value and tag.Attributes["content"].Value ...
}
}
}
In the above code, the first use of foreach to loop the MetaTags collection of each element, and then determine whether the value of the name and content property of each element is empty, if not empty, the direct output of its contents can be, see, there is no need to use regular expression, very convenient.
Is the result of the above code, the user entered the address to access, click the Submit button, the HTML agility will download the content of the page, and use the method described above to obtain the contents of the tag and display it.
▲
Example two lists links in a remote page
The example above shows how to use the SelectNodes method and XPath to find the specified node. Another method is to use LINQ syntax to implement. The methods of the Htmlnode class, such as returning the ancestors of the document or descendants, actually return objects like IEnumerable, and if you are familiar with LINQ syntax, you know that LINQ is very easy to handle IEnumerable, It is also easy to use LINQ to query the nodes of an HTML document.
To demonstrate how to use LINQ to access nodes, in this example, you will demonstrate how to get the text of a page and the values of all hyperlinks (tags). The initial code is the same as the first example, to create a Htmlweb object:
Here is the code snippet:
var webGet = new HtmlWeb();
var document = webGet.Load(url);
Label, of course, we require that these tags are content, not just blank, and finally return an anonymous type, there are two properties: URL and text:
Next, you will use the Descendants method and the LINQ syntax of the Document object to get all the links for the specified page. To be exact, it is to get all of the pages
Here is the code snippet:
var linksOnPage = from lnks in document.DocumentNode.Descendants()
where lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
See, it's the LINQ syntax that's used above. Now that we can use some of the ASP. NET controls to show the contents of this linksonpage, the code in this article uses the ListView control, named Lvlinks:
Here is the code snippet:
Lvlinks.datasource = Linksonpage; Lvlinks.databind ();
The listbview of the front end becomes simple, as follows:
Here is the code snippet:
<asp:ListView ID="lvLinks" runat="server">
<LayoutTemplate>
<ul>
<asp:PlaceHolder runat="server" ID="itemPlaceholder" />
</ul>
</LayoutTemplate>
<ItemTemplate>
<li>
<%# Eval("Text") %> - <%# Eval("Url") %>
</li>
</ItemTemplate>
</asp:ListView>
After running, as shown in:
Example three modifying and saving an HTML document
The above two examples show how to use the HTML agility package to crawl a Web page and parse it, but in some cases you also need to modify the DOM structure of the document and save it to disk. This example shows a bit of the front two, asking the user to enter the address of a Web page, then crawl the page, and modify the following two aspects.
1. While the program reads the document, using the program method, dynamically adds a new element node, making it the first child node of the label.
2. All links in the document are changed to open in a new open window, this setting has the Targer property of _blank.
When you have finished modifying the code, save it to the user's local disk. Again, the first step is the same as the previous two:
Here is the code snippet:
var webGet = new HtmlWeb();
var document = webGet.Load(url);
Next, by extending the LINQ syntax to find the BODY element, the following code means: Find the first node in all descendant nodes of the document node and its name is "body", or null if it does not exist.
Here is the code snippet:
var body = document.DocumentNode.Descendants()
.Where(n => n.Name == "body")
.FirstOrDefault();
If the body tag is found, we create a new HTML element tag, the following code creates a new HTML element tag (the variable named messageelement), and specifies its style attribute, specifying its named Div for the label, indicating that a
Tag, and finally assigns it the value of the inner HTML attribute, and, of course, inserts the established element node at the beginning of the body tag:
Here is the code snippet:
if (body != null)
{
var messageElement = new HtmlNode(HtmlNodeType.Element, document, 0);
messageElement.Attributes.Add("style", "width:95%;border:solid black 2px;background-color:#ffc;font-size:xx-large;text-align:center");
messageElement.Name = "div";
messageElement.InnerHtml = "
Hello! This page was modified by the Html Agility Pack!
Click on a link below... it should open in a new window!
";
body.ChildNodes.Insert(0, messageElement);
}
Next, the SelectNodes method returns all
Here is the code snippet:
var linksThatDoNotOpenInNewWindow = document.DocumentNode.SelectNodes("//a[@href]");
if (linksThatDoNotOpenInNewWindow != null)
{
foreach (var link in linksThatDoNotOpenInNewWindow)
if (link.Attributes["target"] == null)
link.Attributes.Add("target", "_blank");
else
link.Attributes["target"].Value = "_blank";
}
Finally, we call the Save method to save the changes we made, in this article, save it in the Modifiedpages directory, and use the GUID method to generate its file name, the following code:
Here is the code snippet:
var fileName = string.Format("~/ModifiedPages/{0}.htm", Guid.NewGuid().ToString());
document.Save(Server.MapPath(fileName));
Show the results of the run example three, you can see that we have modified and saved the page, at the beginning of the page, we did add the content we want to join, and you can try to open all the pages of the link, will find that the new open links open, to note that, Because the 4guysfromrolla Web page is used relative paths, so the picture we did not save them this time, so on this page can not see is normal.
HTML Agility Pack Parsing HTML page