Summary
In the development process, it is very likely to encounter this situation, the server returned is the content of HTML, but the client needs to display plain text content, this time need to parse the HTML, get inside the plain text. There are many ways to achieve this, such as writing a regular expression yourself, but for something that doesn't have a rule, it's a bit out of the way. HTML Agility Pack Open source components, which can be used to parse HTML content quickly and in an XPath manner.
An example
Component URL: http://htmlagilitypack.codeplex.com/, you can install through NuGet.
For example, we are here to analyze the blog home page article list, view the blog home page List HTML,:
Crawl the names of all articles
usingSystem;usingSystem.Collections.Generic;usingSystem.Linq;usingSystem.Text;usingSystem.Threading.Tasks;usingHtmlagilitypack;namespacehtmlagilitypackdemo{classProgram {Static voidMain (string[] args) { //initializing the network request clientHtmlweb webClient =NewHtmlweb (); //Initializing a documentHTMLDocument doc = Webclient.load ("http://www.cnblogs.com/"); //Find NodesHtmlnodecollection titlenodes = doc. Documentnode.selectnodes ("//a[@class = ' Titlelnk ')"); if(Titlenodes! =NULL) { foreach(varIteminchtitlenodes) {Console.WriteLine (item. InnerText); }} console.read (); } }}
Output
I remember writing a gadget before, and then I wrote the regular to match, compared with this component is really troublesome.
In the above code, there is [@class = ' xxx '] settings, it is based on the attributes of the HTML tag to find node, of course, you can also make other settings, such as the search by ID, you can write h3[@id = ' xxxx '].
Gets the contents of the node, which can be obtained in the following way
Node. InnerText node. InnerHtml node. outerHTML
[C #] HTML Agility Pack Parsing HTML