Hello long time no see Haha, today for everyone to share a parsing HTML class library HTML Agility Pack. This is useful for trying to get some content inside a webpage. Take a list of my csdn blogs today for example.
Open page Use Firebug to find the content area of the article List
As shown in the picture above we have found the location of the desired content in HTML
So the first step is to get the HTML and then use the HTML Agility Pack to find out what we want.
1. HTML obtained from the Web page
1 #regionGet the article List +gethtml (string url)2 /// <summary>3 ///get a list of articles Add Shuaibi 2015-03-084 /// </summary>5 /// <param name= "url" >Page Address</param>6 /// <returns>List of articles</returns>7 PublicList<model> gethtml (stringURL)8 {9 varMywebclient =NewWebClient ();Ten varMyStream =mywebclient.openread (URL); One varList =GetMessage (MyStream);//The following method is called here A if(MyStream! =NULL) Mystream.close (); - returnlist; - } the #endregion
2. Use the HTML Agility Pack to find out what we want.
#regionProcessing article Information +getmessage (Stream mystream)/// <summary> ///handling article Information Add Shuaibi 2015-03-08/// </summary> /// <param name= "mystream" > Web page Data Flow </param> /// <returns></returns> Private StaticList<model>GetMessage (Stream mystream) {varDocument =NewHTMLDocument (); Document. Load (MyStream, Encoding.UTF8); varRootNode =document. Documentnode; varMessagenodelist =rootnode.selectnodes (Messagelistxpath); returnMessagenodelist.select (Messagenode = Htmlnode.createnode (messagenode.outerhtml)). Select (temp =NewModel {Title=temp. selectSingleNode (Messagenamexpath). InnerText, Href="http://blog.csdn.net"+ Temp. selectSingleNode (Messagenamexpath). attributes["href"]. Value, Content=temp. selectSingleNode (Messagecontxtxpath). InnerText, time=Convert.todatetime (temp. selectSingleNode (Messagetimexpath). InnerText), Comefrom="csdn" }). ToList (); } #endregion
Read the above to say the method and steps carefully you are not what problems found. Haha, say a half a day did not say this kind of library how to use, there is a second method inside that several variables is what.
Now how to get the Html Agility pack http://htmlagilitypack.codeplex.com/downloaded after the decompression package will find HtmlAgilityPack.dll in the project right-click to add a reference to it
And then there's the problem of a few variables.
1. The following sentence is to get the div that all class List_item Article_item started
/// <summary> /// get a list of articles /// </summary> Private Const string " //div[starts-with (@class, ' List_item article_item ')] ";
2. The following sentence is the title of each item in the collection obtained above
/// <summary> /// Get title explanation: the first Div, under the first Div, under the first H1, under the first span, under the first a label /// </summary> Private Const string " /div[1]/div[1]/h1[1]/span[1]/a[1] ";
3. Same as above. This is to get the content
/// <summary> /// get content explained: The first Div, under the second Div /// </summary> Private Const string " /div[1]/div[2] ";
4. This is to get the release time
/// <summary> /// get time this is to get the first Div, under the 3rd Div, under the span /// </summary> Private Const string " /div[1]/div[3]/span ";
The above code is based on the first image.
The second time hair, said not good please forgive me. Hope and everyone become a friend together progress hehe
Finally, enclose the code in the entity
usingSystem;namespacemessagehelper{ Public classModel { Public stringTitle {Get;Set; }//title Public stringContent {Get;Set; }//content Public stringHref {Get;Set; }//article links Public stringComefrom {Get;Set; }//Source PublicDateTime Time {Get;Set; }//Release Time }}
HTML Agility Pack Parsing HTML