Html Agility Pack parsing Html and agilitypack
Hello for a long time. Haha, today I will share with you an Html Agility Pack class library for parsing Html. This applies to obtaining part of the content on a webpage. Today, I will take my Csdn blog list as an example.
Open the page and use Firebug to find the content area of the article list.
As shown in the preceding figure, we have found the location of the desired content in Html.
The first step is to get the Html and use the Html Agility Pack to find out what we want.
1. Obtain the Html of the webpage
1 # region get article list + GetHtml (string url) 2 /// <summary> 3 // get the article list Add shuaibi 4 /// </summary> 5 /// <param name = "url"> page address </param> 6 // <returns> Article List </returns> 7 public List <Model> GetHtml (string url) 8 {9 var myWebClient = new WebClient (); 10 var myStream = myWebClient. openRead (url); 11 var list = GetMessage (myStream); // The following method is called here 12 if (myStream! = Null) myStream. Close (); 13 return list; 14} 15 # endregion
2. Use Html Agility Pack to find out what we want
# Region processing article information + GetMessage (Stream myStream) /// <summary> /// process the document information Add shuaibi 2015-03-08 /// </summary> /// <param name = "myStream"> webpage data stream </param> /// <returns> </returns> private static List <Model> GetMessage (Stream myStream) {var document = new HtmlDocument (); document. load (myStream, Encoding. UTF8); var rootNode = document. documentNode; var messageNodeList = rootNode. selectNodes (MessageListXPath); return messageNodeList. select (messageNode => HtmlNode. createNode (messageNode. outerHtml )). select (temp => new Model {Title = temp. selectSingleNode (MessageNameXPath ). innerText, Href = "http://blog.csdn.net" + temp. selectSingleNode (MessageNameXPath ). attributes ["href"]. value, Content = temp. selectSingleNode (MessageContxtXPath ). innerText, Time = Convert. toDateTime (temp. selectSingleNode (MessageTimeXPath ). innerText), ComeFrom = "csdn "}). toList () ;}# endregion
After reading the methods and steps above, do you find any problems. Haha, I haven't said how to use this class library for a long time. What are the variables in the second method.
Now how to get Html Agility Pack http://htmlagilitypack.codeplex.com/download decompress the compressed package will find HtmlAgilityPack. dll in the project right-click to add reference can be
Then there are several variable issues.
1. The following sentence is used to obtain the div starting with list_item article_item.
/// <Summary> /// obtain the article list /// </summary> private const string MessageListXPath = "// div [starts-with (@ class, 'list _ item article_item ')] ";
2. The following sentence is used to obtain the title of each item in the set obtained above.
/// <Summary> /// obtain the title explanation: the first div, the first div, the first h1, and the first span, the first a label under /// </summary> private const string MessageNameXPath = "/div [1]/div [1]/h1 [1]/span [1]/ a [1] ";
3. Same as above, this is the content to be obtained.
/// <Summary> /// obtain the description of the content: the first div, private const string MessageContxtXPath = "/div [1]/div [2]";
4. Obtain the release time.
/// <Summary> /// obtain the time. This is the first div, which contains the 3rd div, private const string MessageTimeXPath = "/div [1]/div [3]/span ";
The above Code is based on the first image.
For the second message, please forgive me. Hope to make progress together with everyone
The code in the real body is attached.
Using System; namespace MessageHelper {public class Model {public string Title {get; set ;}// Title public string Content {get; set ;}// Content public string Href {get; set ;}// article link public string ComeFrom {get; set ;}// source public DateTime Time {get; set ;}// release Time }}