Html Agility Pack parsing Html and agilitypack

Source: Internet
Author: User

Html Agility Pack parsing Html and agilitypack

Hello for a long time. Haha, today I will share with you an Html Agility Pack class library for parsing Html. This applies to obtaining part of the content on a webpage. Today, I will take my Csdn blog list as an example.

Open the page and use Firebug to find the content area of the article list.

As shown in the preceding figure, we have found the location of the desired content in Html.

The first step is to get the Html and use the Html Agility Pack to find out what we want.

1. Obtain the Html of the webpage

1 # region get article list + GetHtml (string url) 2 /// <summary> 3 // get the article list Add shuaibi 4 /// </summary> 5 /// <param name = "url"> page address </param> 6 // <returns> Article List </returns> 7 public List <Model> GetHtml (string url) 8 {9 var myWebClient = new WebClient (); 10 var myStream = myWebClient. openRead (url); 11 var list = GetMessage (myStream); // The following method is called here 12 if (myStream! = Null) myStream. Close (); 13 return list; 14} 15 # endregion

2. Use Html Agility Pack to find out what we want

# Region processing article information + GetMessage (Stream myStream) /// <summary> /// process the document information Add shuaibi 2015-03-08 /// </summary> /// <param name = "myStream"> webpage data stream </param> /// <returns> </returns> private static List <Model> GetMessage (Stream myStream) {var document = new HtmlDocument (); document. load (myStream, Encoding. UTF8); var rootNode = document. documentNode; var messageNodeList = rootNode. selectNodes (MessageListXPath); return messageNodeList. select (messageNode => HtmlNode. createNode (messageNode. outerHtml )). select (temp => new Model {Title = temp. selectSingleNode (MessageNameXPath ). innerText, Href = "http://blog.csdn.net" + temp. selectSingleNode (MessageNameXPath ). attributes ["href"]. value, Content = temp. selectSingleNode (MessageContxtXPath ). innerText, Time = Convert. toDateTime (temp. selectSingleNode (MessageTimeXPath ). innerText), ComeFrom = "csdn "}). toList () ;}# endregion

After reading the methods and steps above, do you find any problems. Haha, I haven't said how to use this class library for a long time. What are the variables in the second method.

Now how to get Html Agility Pack http://htmlagilitypack.codeplex.com/download decompress the compressed package will find HtmlAgilityPack. dll in the project right-click to add reference can be

Then there are several variable issues.

1. The following sentence is used to obtain the div starting with list_item article_item.

/// <Summary> /// obtain the article list /// </summary> private const string MessageListXPath = "// div [starts-with (@ class, 'list _ item article_item ')] ";

2. The following sentence is used to obtain the title of each item in the set obtained above.

/// <Summary> /// obtain the title explanation: the first div, the first div, the first h1, and the first span, the first a label under /// </summary> private const string MessageNameXPath = "/div [1]/div [1]/h1 [1]/span [1]/ a [1] ";

3. Same as above, this is the content to be obtained.

/// <Summary> /// obtain the description of the content: the first div, private const string MessageContxtXPath = "/div [1]/div [2]";

4. Obtain the release time.

/// <Summary> /// obtain the time. This is the first div, which contains the 3rd div, private const string MessageTimeXPath = "/div [1]/div [3]/span ";

The above Code is based on the first image.

For the second message, please forgive me. Hope to make progress together with everyone

The code in the real body is attached.

Using System; namespace MessageHelper {public class Model {public string Title {get; set ;}// Title public string Content {get; set ;}// Content public string Href {get; set ;}// article link public string ComeFrom {get; set ;}// source public DateTime Time {get; set ;}// release Time }}

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.