Previously, I used regular expressions to crawl pages. I had a hard time crawling some things.
On the Internet, I saw someone else recommended an htmlagilitypack item. I found some information on the Internet and wrote an example of crawling web pages. ASP. NET MVC 4 for the framework. Let's first look at the effect.
Demo address: http://www.5imvc.com/Html Alternate address: http://linfei721.s5.jutuo.net/html
First download the plug-in, which is available in nuget.
Create model
/// <Summary> /// Page capture results /// </Summary> Public Class Result { /// <Summary> /// Link /// </Summary> Public String URL { Get ; Set ;} /// <Summary> /// Title /// </Summary> Public String Title { Get ; Set ;} /// <Summary> /// Avatar address /// </Summary> Public String IMG { Get ; Set ;} /// <Summary> /// Body content /// </Summary> Public String Content { Get ; Set ;}}
Controllers:
Import namespace:
UsingHtmlagilitypack;
Public Actionresult index (){ Return View (getlist ());} /// <Summary> /// Capture Method /// </Summary> /// <Returns> </returns> Public List <result> Getlist () {list <Result> List = New List <result> (); # Region Old-fashioned regular crawling // System. net. webrequest Req = system. net. webrequest. Create (" Http://www.cnblogs.com/ "); // System. net. webresponse res = Req. getresponse (); // Getresponse blocks until the response arrives // System. Io. Stream receivestream = res. getresponsestream (); // Read the stream into a string // System. Io. streamreader sr = new system. Io. streamreader (receivestream ); // String resultstring = Sr. readtoend (); // String regstr = " // Matchcollection matches = RegEx. Matches (resultstring, regstr, regexoptions. multiline ); // Foreach (match item in matches) // { // List. Add (New Result {url = item. Groups [1]. Value, Title = item. Groups [2]. Value }); // } # Endregion Htmlweb = New Htmlweb (); htmldocument htmldoc = Htmlweb. Load ( @" Http://www.cnblogs.com/ " ); // Select the blog Home PageArticleList Htmldoc. documentnode. selectnodes ( " // Div [@ ID = 'Post _ list']/Div [@ class = 'Post _ item'] " ). Asparallel (). tolist (). foreach (AC => { // Capture images. If there is no space, store the variables. Htmlnode node = ac. selectsinglenode ( " . // P [@ class = 'Post _ item_summary ']/A/img " ); List. Add ( New Result {URL = Ac. selectsinglenode ( " . // A [@ class = 'titlelnk '] " ). Attributes [ " Href " ]. Value, title = Ac. selectsinglenode ( " . // A [@ class = 'titlelnk '] " ). Innertext, // If the image is blank, the default image is displayed. IMG = node = Null ? Virtualpathutility. toabsolute ( " ~ /Content/img/avatar.png " ): Node. attributes [ " SRC " ]. Value, content = Ac. selectsinglenode ( " . // P [@ class = 'Post _ item_summary '] " ). Innertext });}); Return List ;}
View:
@ Model ienumerable <result>
@{ Foreach ( VaR ItemIn Model ){ <Div Class = " Newsitem " > <Div> " @ Item. img " Class = " How.mg " Alt = " News " /> <H3> <a href = " @ Item. url " Target = " _ Blank " > @ Item. Title </a> @ Item. Content </P> </div> <p> <a href = " @ Item. url " Title = "" > View the full text </a> </P> </div> }}