HTML Agility Pack Parsing HTML

Source: Internet
Author: User

Hello long time no see Haha, today for everyone to share a parsing HTML class library HTML Agility Pack. This is useful for trying to get some content inside a webpage. Take a list of my csdn blogs today for example.

Open page Use Firebug to find the content area of the article List

As shown in the picture above we have found the location of the desired content in HTML

So the first step is to get the HTML and then use the HTML Agility Pack to find out what we want.

1. HTML obtained from the Web page

1  #regionGet the article List +gethtml (string url)2         /// <summary>3         ///get a list of articles Add Shuaibi 2015-03-084         /// </summary>5         /// <param name= "url" >Page Address</param>6         /// <returns>List of articles</returns>7          PublicList<model> gethtml (stringURL)8         {9             varMywebclient =NewWebClient ();Ten             varMyStream =mywebclient.openread (URL); One             varList =GetMessage (MyStream);//The following method is called here A             if(MyStream! =NULL) Mystream.close (); -             returnlist; -         }  the         #endregion

2. Use the HTML Agility Pack to find out what we want.

#regionProcessing article Information +getmessage (Stream mystream)/// <summary>        ///handling article Information Add Shuaibi 2015-03-08/// </summary>        /// <param name= "mystream" > Web page Data Flow </param>        /// <returns></returns>        Private StaticList<model>GetMessage (Stream mystream) {varDocument =NewHTMLDocument (); Document.            Load (MyStream, Encoding.UTF8); varRootNode =document.            Documentnode; varMessagenodelist =rootnode.selectnodes (Messagelistxpath); returnMessagenodelist.select (Messagenode = Htmlnode.createnode (messagenode.outerhtml)). Select (temp =NewModel {Title=temp. selectSingleNode (Messagenamexpath). InnerText, Href="http://blog.csdn.net"+ Temp. selectSingleNode (Messagenamexpath). attributes["href"]. Value, Content=temp. selectSingleNode (Messagecontxtxpath). InnerText, time=Convert.todatetime (temp. selectSingleNode (Messagetimexpath). InnerText), Comefrom="csdn"            }).        ToList (); }         #endregion

Read the above to say the method and steps carefully you are not what problems found. Haha, say a half a day did not say this kind of library how to use, there is a second method inside that several variables is what.

Now how to get the Html Agility pack http://htmlagilitypack.codeplex.com/downloaded after the decompression package will find HtmlAgilityPack.dll in the project right-click to add a reference to it

And then there's the problem of a few variables.

1. The following sentence is to get the div that all class List_item Article_item started

/// <summary>        /// get a list        of articles /// </summary>        Private Const string " //div[starts-with (@class, ' List_item article_item ')] ";

2. The following sentence is the title of each item in the collection obtained above

/// <summary>        /// Get title explanation: the first Div, under the first        Div, under the first H1, under the first span, under the first a label /// </summary>        Private Const string " /div[1]/div[1]/h1[1]/span[1]/a[1] ";

3. Same as above. This is to get the content

/// <summary>        /// get content explained: The first Div, under the second Div         /// </summary>        Private Const string " /div[1]/div[2] ";

4. This is to get the release time

/// <summary>        /// get time this is to get the first Div, under the 3rd Div, under the span         /// </summary>        Private Const string " /div[1]/div[3]/span ";

The above code is based on the first image.

The second time hair, said not good please forgive me. Hope and everyone become a friend together progress hehe

Finally, enclose the code in the entity

usingSystem;namespacemessagehelper{ Public classModel { Public stringTitle {Get;Set; }//title        Public stringContent {Get;Set; }//content        Public stringHref {Get;Set; }//article links        Public stringComefrom {Get;Set; }//Source        PublicDateTime Time {Get;Set; }//Release Time    }}

HTML Agility Pack Parsing HTML

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.