This is a very good thing, used to do HTML parsing is in use Htmlparser, although handy, but the resolution speed is slow, happened to find this today, take over to try, all unexpectedly, very cool, recommended for you to use.
Here are some simple use tips, hope to be useful to everyone, I personally also a learning process.
Why Html Agility Pack? (hereinafter referred to as HAP)
. NET parsing of HTML files has many options, including Microsoft itself also provides mshtml for manipulate HTML files. However, after a period of searching, the HTML Agility pack surfaced: It is the most recommended C # Html parser on the StackOverflow website. Hap Open source, easy to use, resolution fast.
How do I use HAP?
1. Download http://htmlagilitypack.codeplex.com/
2. Unzip
3. In Visual Studio solution, right-click Add Reference, Project, select HTMLAgilityPack.dll in the Unzip folder, OK
4. The code head joins using Htmlagilitypack;
done!
- Htmlweb webClient = new Htmlweb ();
- HTMLDocument doc = webclient.load ("http://xxx");
- Htmlnodecollection hreflist = doc. Documentnode.selectnodes (".//a[@href]");
- if (hreflist! = null)
- {
- foreach (Htmlnode href in hreflist)
- {
- Htmlattribute att = href. attributes["href"];
- DoSomething (Att. Value);
- }
- }
Q: How do I select HTML nodes by ID?
A: Using @id= ' xxx ', e.g.,
- Htmlnode bugsum = doc. Documentnode.selectsinglenode ("//h2[@id = ' summary ']");
Q: How do I get the text content or HTML content of a node?
- Node. Innertext.trim ()
- Node. InnerHtml
- Node. outerHTML
Q: How do I find nodes under the HTML tree structure?
A: For example, find the first table under Id=container div from the root node:
- Htmlnode table = doc. Documentnode.selectsinglenode ("//div[@id = ' container ']/table[1]");
Note that the "//" in the path means finding from the root node, two slashes '//' means finding all childnodes, and a slash '/' means finding only the first layer of childnodes (that is, not looking for grandchild); dot slash "./" Represents the start of a lookup from the current node rather than the root node. Next line of code, such as the TR to find all the direct child nodes of the table:
- htmlnodecollection tr = table. SelectNodes ("./tr");
Q: How do I get the ID of a node?
A: Very simple: node.id
Q: If a piece of HTML exists in a string, is it possible to use HTML Agility pack for processing?
A: Yes, first load the string in, then the same way:
- <pre name="code" class="CSharp" >//load the original HTML
- String html = "Some HTML stuff"
- HTMLDocument doc = new HTMLDocument ();
- Doc. Loadhtml (@html);
Q: I've done some processing of the HTML load coming in, such as changing some of the node content, deleting some of the nodes, and why the results haven't changed?
A: Maybe you forgot to save your changes to HTML, assuming that the HTML exists in the string:
- Load the original HTML
- String html = "Some HTML stuff"
- HTMLDocument doc = new HTMLDocument ();
- Doc. Loadhtml (@html);
- Make some changes
- DoSomething ();
- Save the Change
- var sb = new StringBuilder ();
- using (var writer = new StringWriter (SB))
- {
- Doc. Save (writer);
- }
Q: How do I get rid of the outer HTML tag leaving only content?
A: Use the Remove method. Suppose the node <a href=xxx>abcd</a> you want to leave the ABCD instead of <a></a> then you need to get this HTML node first, assuming that it's called Link:
- Link. Parentnode.removechild (link,true);
The parameter true indicates leaving grandchild, where the content is ABCD; False means that the node is deleted along with its grandchilds.
There are many rules, the Internet provides the source code, you can study, and the source code has garbled problem, is the character set problem, only need to write a method to automatically judge can solve the
Open source project HTML Agility Pack for fast parsing of HTML