Open source project HTML Agility Pack for fast parsing of HTML

Last Update:2016-12-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a very good thing, used to do HTML parsing is in use Htmlparser, although handy, but the resolution speed is slow, happened to find this today, take over to try, all unexpectedly, very cool, recommended for you to use.

Here are some simple use tips, hope to be useful to everyone, I personally also a learning process.

Why Html Agility Pack? (hereinafter referred to as HAP)

. NET parsing of HTML files has many options, including Microsoft itself also provides mshtml for manipulate HTML files. However, after a period of searching, the HTML Agility pack surfaced: It is the most recommended C # Html parser on the StackOverflow website. Hap Open source, easy to use, resolution fast.

How do I use HAP?

1. Download http://htmlagilitypack.codeplex.com/

2. Unzip

3. In Visual Studio solution, right-click Add Reference, Project, select HTMLAgilityPack.dll in the Unzip folder, OK

4. The code head joins using Htmlagilitypack;

done!

Htmlweb webClient = new Htmlweb ();
HTMLDocument doc = webclient.load ("http://xxx");
Htmlnodecollection hreflist = doc. Documentnode.selectnodes (".//a[@href]");
if (hreflist! = null)
{
foreach (Htmlnode href in hreflist)
{
Htmlattribute att = href. attributes["href"];
DoSomething (Att. Value);
}
}

Q: How do I select HTML nodes by ID?

A: Using @id= ' xxx ', e.g.,

Htmlnode bugsum = doc. Documentnode.selectsinglenode ("//h2[@id = ' summary ']");

Q: How do I get the text content or HTML content of a node?

Node. Innertext.trim ()
Node. InnerHtml
Node. outerHTML

Q: How do I find nodes under the HTML tree structure?

A: For example, find the first table under Id=container div from the root node:

Htmlnode table = doc. Documentnode.selectsinglenode ("//div[@id = ' container ']/table[1]");

Note that the "//" in the path means finding from the root node, two slashes '//' means finding all childnodes, and a slash '/' means finding only the first layer of childnodes (that is, not looking for grandchild); dot slash "./" Represents the start of a lookup from the current node rather than the root node. Next line of code, such as the TR to find all the direct child nodes of the table:

htmlnodecollection tr = table. SelectNodes ("./tr");

Q: How do I get the ID of a node?

A: Very simple: node.id

Q: If a piece of HTML exists in a string, is it possible to use HTML Agility pack for processing?

A: Yes, first load the string in, then the same way:

<pre name="code" class="CSharp" >//load the original HTML
String html = "Some HTML stuff"
HTMLDocument doc = new HTMLDocument ();
Doc. Loadhtml (@html);

Q: I've done some processing of the HTML load coming in, such as changing some of the node content, deleting some of the nodes, and why the results haven't changed?

A: Maybe you forgot to save your changes to HTML, assuming that the HTML exists in the string:

Load the original HTML
String html = "Some HTML stuff"
HTMLDocument doc = new HTMLDocument ();
Doc. Loadhtml (@html);
Make some changes
DoSomething ();
Save the Change
var sb = new StringBuilder ();
using (var writer = new StringWriter (SB))
{
Doc. Save (writer);
}

Q: How do I get rid of the outer HTML tag leaving only content?

A: Use the Remove method. Suppose the node <a href=xxx>abcd</a> you want to leave the ABCD instead of <a></a> then you need to get this HTML node first, assuming that it's called Link:

Link. Parentnode.removechild (link,true);

The parameter true indicates leaving grandchild, where the content is ABCD; False means that the node is deleted along with its grandchilds.

There are many rules, the Internet provides the source code, you can study, and the source code has garbled problem, is the character set problem, only need to write a method to automatically judge can solve the

Open source project HTML Agility Pack for fast parsing of HTML

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Open source project HTML Agility Pack for fast parsing of HTML

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Open source project HTML Agility Pack for fast parsing of HTML

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support