When I was in a chat room, I used the news reading function in the chat room to capture information (such as the latest headlines, news sources, titles, and content) from the webpage) class, this article will introduce how to use this class to capture the information needed in the web page. This article takes the title and link of the blog home page as an example:
The DOM tree of the homepage is displayed. Obviously, you only need to extract the DIV whose class is post_item, and then extract the sign whose class is titlelnk. This function can be implemented through the following functions:
/// <Summary> /// Search for all the tags whose names are tagname and whose value is attrname in text HTML. /// Example: findtagbyattr (HTML, "Div", "class", "Demo ") /// Returns the DIV flag of the demo class. /// </Summary> Public static List < Htmltag > Findtagbyattr ( String HTML, String Tagname, String Attrname, String Attrvalue ){ String Format = String . Format ( @ "<{0} \ s [^ <>] * {1} \ s * = \ s * (\ x27 | \ x22) {2} (\ x27 | \ x22) [^ <>] *>" , Tagname, attrname, attrvalue ); Return Findtag (HTML, tagname, format );} Public static List < Htmltag > Findtag ( String HTML, String Name, String Format ){ RegEx Reg = New RegEx (Format, Regexoptions . Ignorecase ); RegEx Tagreg = New RegEx ( String . Format ( @ "<(\/|) ({0}) (\ s [^ <>] * |)>" , Name ), Regexoptions . Ignorecase ); List < Htmltag > Tags = New List < Htmltag > (); Int Start = 0; While ( True ){ Match Match = reg. Match (HTML, start ); If (Match. Success) {start = match. index + match. length; Match Tagmatch = Null ; Int Begintagcount = 1; While ( True ) {Tagmatch = tagreg. Match (HTML, start ); If (! Tagmatch. Success) {tagmatch = Null ; Break ;} Start = tagmatch. index + tagmatch. length; If (Tagmatch. Groups [1]. value = "/" ) Begintagcount --; Else Begintagcount ++; If (Begintagcount = 0) Break ;} If (Tagmatch! = Null ){ Htmltag Tag = New Htmltag (Name, match. value, HTML. substring (match. index + match. length, tagmatch. index-match. index-match. length); tags. add (TAG );} Else { Break ;}} Else { Break ;}} Return Tags ;}
With the above functions, you can extract the required HTML flag. to capture, you also need a function to download the webpage:
Public static String Gethtml ( String URL ){ Try { Httpwebrequest Req = Httpwebrequest . Create (URL) As Httpwebrequest ; Req. Timeout = 30*1000; Httpwebresponse Response = Req. getresponse () As Httpwebresponse ; Stream Stream = response. getresponsestream (); Memorystream Buffer = New Memorystream (); Byte [] Temp = New Byte [4096]; Int Count = 0; While (COUNT = stream. Read (temp, 0, 4096)> 0) {buffer. Write (temp, 0, count );} Return Encoding . Getencoding (response. characterset). getstring (buffer. getbuffer ());} Catch { Return String . Empty ;}}
To captureArticleThe title and link are used as an example to describe how to use the htmltag class to capture webpage information:
Class Program { Static void Main ( String [] ARGs ){ String Html = Htmltag . Gethtml ( Http://www.cnblogs.com" ); List < Htmltag > Tags = Htmltag . Findtagbyattr (HTML, "Div" , "ID" , "Post_list" ); If (Tags. Count> 0 ){ List < Htmltag > Item_tags = tags [0]. findtagbyattr ( "Div" , "Class" , "Post_item" ); Foreach ( Htmltag Item_tag In Item_tags ){ List < Htmltag > A_tags = item_tag.findtagbyattr ( "" , "Class" , "Titlelnk" ); If (A_tags.count> 0 ){ Console . Writeline ( "Title: {0 }" , A_tags [0]. innerhtml ); Console . Writeline ( "Link: {0 }" , A_tags [0]. getattribute ( "Href" )); Console . Writeline ( "" );}}}}}
The running result is as follows:
Source code download
Source: http://www.cnblogs.com/lucc/archive/2010/05/18/1738718.html
When I was in a chat room, I used the news reading function in the chat room to capture information (such as the latest headlines, news sources, titles, and content) from the webpage) class, this article will introduce how to use this class to capture the information needed in the web page. This article takes the title and link of the blog home page as an example:
The DOM tree of the homepage is displayed. Obviously, you only need to extract the DIV whose class is post_item, and then extract the sign whose class is titlelnk. This function can be implemented through the following functions:
/// <Summary> /// Search for all the tags whose names are tagname and whose value is attrname in text HTML. /// Example: findtagbyattr (HTML, "Div", "class", "Demo ") /// Returns the DIV flag of the demo class. /// </Summary> Public static List < Htmltag > Findtagbyattr ( String HTML, String Tagname, String Attrname, String Attrvalue ){ String Format = String . Format ( @ "<{0} \ s [^ <>] * {1} \ s * = \ s * (\ x27 | \ x22) {2} (\ x27 | \ x22) [^ <>] *>" , Tagname, attrname, attrvalue ); Return Findtag (HTML, tagname, format );} Public static List < Htmltag > Findtag ( String HTML, String Name, String Format ){ RegEx Reg = New RegEx (Format, Regexoptions . Ignorecase ); RegEx Tagreg = New RegEx ( String . Format ( @ "<(\/|) ({0}) (\ s [^ <>] * |)>" , Name ), Regexoptions . Ignorecase ); List < Htmltag > Tags = New List < Htmltag > (); Int Start = 0; While ( True ){ Match Match = reg. Match (HTML, start ); If (Match. Success) {start = match. index + match. length; Match Tagmatch = Null ; Int Begintagcount = 1; While ( True ) {Tagmatch = tagreg. Match (HTML, start ); If (! Tagmatch. Success) {tagmatch = Null ; Break ;} Start = tagmatch. index + tagmatch. length; If (Tagmatch. Groups [1]. value = "/" ) Begintagcount --; Else Begintagcount ++; If (Begintagcount = 0) Break ;} If (Tagmatch! = Null ){ Htmltag Tag = New Htmltag (Name, match. value, HTML. substring (match. index + match. length, tagmatch. index-match. index-match. length); tags. add (TAG );} Else { Break ;}} Else { Break ;}} Return Tags ;}
With the above functions, you can extract the required HTML flag. to capture, you also need a function to download the webpage:
Public static String Gethtml ( String URL ){ Try { Httpwebrequest Req = Httpwebrequest . Create (URL) As Httpwebrequest ; Req. Timeout = 30*1000; Httpwebresponse Response = Req. getresponse () As Httpwebresponse ; Stream Stream = response. getresponsestream (); Memorystream Buffer = New Memorystream (); Byte [] Temp = New Byte [4096]; Int Count = 0; While (COUNT = stream. Read (temp, 0, 4096)> 0) {buffer. Write (temp, 0, count );} Return Encoding . Getencoding (response. characterset). getstring (buffer. getbuffer ());} Catch { Return String . Empty ;}}
The following describes how to use the htmltag class to capture web page information by taking the title and link of the blog homepage as an example:
Class Program { Static void Main ( String [] ARGs ){ String Html = Htmltag . Gethtml ( Http://www.cnblogs.com" ); List < Htmltag > Tags = Htmltag . Findtagbyattr (HTML, "Div" , "ID" , "Post_list" ); If (Tags. Count> 0 ){ List < Htmltag > Item_tags = tags [0]. findtagbyattr ( "Div" , "Class" , "Post_item" ); Foreach ( Htmltag Item_tag In Item_tags ){ List < Htmltag > A_tags = item_tag.findtagbyattr ( "" , "Class" , "Titlelnk" ); If (A_tags.count> 0 ){ Console . Writeline ( "Title: {0 }" , A_tags [0]. innerhtml ); Console . Writeline ( "Link: {0 }" , A_tags [0]. getattribute ( "Href" )); Console . Writeline ( "" );}}}}}
The running result is as follows:
Source code download
Source: http://www.cnblogs.com/lucc/archive/2010/05/18/1738718.html