Web page capture

Source: Internet
Author: User

When I was in a chat room, I used the news reading function in the chat room to capture information (such as the latest headlines, news sources, titles, and content) from the webpage) class, this article will introduce how to use this class to capture the information needed in the web page. This article takes the title and link of the blog home page as an example:

The DOM tree of the homepage is displayed. Obviously, you only need to extract the DIV whose class is post_item, and then extract the sign whose class is titlelnk. This function can be implemented through the following functions:

 /// <Summary> ///  Search for all the tags whose names are tagname and whose value is attrname in text HTML.  ///  Example: findtagbyattr (HTML, "Div", "class", "Demo ")  ///  Returns the DIV flag of the demo class.  /// </Summary>  Public static  List  <  Htmltag > Findtagbyattr (  String  HTML,  String  Tagname,  String  Attrname,  String  Attrvalue ){  String  Format =  String  . Format (  @ "<{0} \ s [^ <>] * {1} \ s * = \ s * (\ x27 | \ x22) {2} (\ x27 | \ x22) [^ <>] *>"  , Tagname, attrname, attrvalue );  Return  Findtag (HTML, tagname, format );}  Public static List  <  Htmltag  > Findtag (  String  HTML,  String  Name,  String  Format ){  RegEx  Reg =  New  RegEx  (Format,  Regexoptions  . Ignorecase );  RegEx  Tagreg =  New RegEx  (  String  . Format (  @ "<(\/|) ({0}) (\ s [^ <>] * |)>"  , Name ),  Regexoptions  . Ignorecase );  List  <  Htmltag  > Tags =  New  List  <  Htmltag  > ();  Int  Start = 0;  While (  True  ){  Match  Match = reg. Match (HTML, start );  If  (Match. Success) {start = match. index + match. length;  Match  Tagmatch =  Null  ;  Int  Begintagcount = 1;  While  (  True  ) {Tagmatch = tagreg. Match (HTML, start );  If (! Tagmatch. Success) {tagmatch =  Null  ;  Break  ;} Start = tagmatch. index + tagmatch. length;  If  (Tagmatch. Groups [1]. value =  "/"  ) Begintagcount --;  Else  Begintagcount ++;  If  (Begintagcount = 0)  Break  ;}  If  (Tagmatch! = Null  ){  Htmltag  Tag =  New  Htmltag  (Name, match. value, HTML. substring (match. index + match. length, tagmatch. index-match. index-match. length); tags. add (TAG );}  Else  {  Break  ;}}  Else  {  Break  ;}}  Return Tags ;} 

With the above functions, you can extract the required HTML flag. to capture, you also need a function to download the webpage:

 Public static  String  Gethtml (  String  URL ){  Try  {  Httpwebrequest  Req =  Httpwebrequest  . Create (URL)  As  Httpwebrequest  ; Req. Timeout = 30*1000;  Httpwebresponse  Response = Req. getresponse ()  As Httpwebresponse  ;  Stream  Stream = response. getresponsestream ();  Memorystream  Buffer =  New  Memorystream  ();  Byte  [] Temp =  New  Byte  [4096];  Int  Count = 0;  While (COUNT = stream. Read (temp, 0, 4096)> 0) {buffer. Write (temp, 0, count );}  Return  Encoding  . Getencoding (response. characterset). getstring (buffer. getbuffer ());}  Catch  {  Return  String  . Empty ;}} 

To captureArticleThe title and link are used as an example to describe how to use the htmltag class to capture webpage information:

 Class  Program  {  Static void  Main (  String [] ARGs ){  String  Html =  Htmltag  . Gethtml (  Http://www.cnblogs.com"  );  List  <  Htmltag  > Tags =  Htmltag  . Findtagbyattr (HTML,  "Div"  ,  "ID"  ,  "Post_list"  );  If (Tags. Count> 0 ){  List  <  Htmltag  > Item_tags = tags [0]. findtagbyattr (  "Div"  ,  "Class"  ,  "Post_item"  );  Foreach  (  Htmltag  Item_tag  In  Item_tags ){  List  <  Htmltag > A_tags = item_tag.findtagbyattr (  ""  ,  "Class"  ,  "Titlelnk"  );  If  (A_tags.count> 0 ){  Console  . Writeline (  "Title: {0 }"  , A_tags [0]. innerhtml );  Console  . Writeline (  "Link: {0 }"  , A_tags [0]. getattribute (  "Href" ));  Console  . Writeline (  ""  );}}}}} 

The running result is as follows:

Source code download

Source: http://www.cnblogs.com/lucc/archive/2010/05/18/1738718.html

When I was in a chat room, I used the news reading function in the chat room to capture information (such as the latest headlines, news sources, titles, and content) from the webpage) class, this article will introduce how to use this class to capture the information needed in the web page. This article takes the title and link of the blog home page as an example:

The DOM tree of the homepage is displayed. Obviously, you only need to extract the DIV whose class is post_item, and then extract the sign whose class is titlelnk. This function can be implemented through the following functions:

 /// <Summary> ///  Search for all the tags whose names are tagname and whose value is attrname in text HTML.  /// Example: findtagbyattr (HTML, "Div", "class", "Demo ")  ///  Returns the DIV flag of the demo class.  /// </Summary>  Public static  List  <  Htmltag  > Findtagbyattr (  String  HTML,  String  Tagname,  String  Attrname,  String  Attrvalue ){  String Format =  String  . Format (  @ "<{0} \ s [^ <>] * {1} \ s * = \ s * (\ x27 | \ x22) {2} (\ x27 | \ x22) [^ <>] *>"  , Tagname, attrname, attrvalue );  Return  Findtag (HTML, tagname, format );}  Public static  List  <  Htmltag  > Findtag (  String  HTML,  String  Name,  String  Format ){ RegEx  Reg =  New  RegEx  (Format,  Regexoptions  . Ignorecase );  RegEx  Tagreg =  New  RegEx  (  String  . Format (  @ "<(\/|) ({0}) (\ s [^ <>] * |)>"  , Name ),  Regexoptions  . Ignorecase );  List  < Htmltag  > Tags =  New  List  <  Htmltag  > ();  Int  Start = 0;  While  (  True  ){  Match  Match = reg. Match (HTML, start );  If  (Match. Success) {start = match. index + match. length;  Match  Tagmatch = Null  ;  Int  Begintagcount = 1;  While  (  True  ) {Tagmatch = tagreg. Match (HTML, start );  If  (! Tagmatch. Success) {tagmatch =  Null  ;  Break  ;} Start = tagmatch. index + tagmatch. length;  If  (Tagmatch. Groups [1]. value = "/"  ) Begintagcount --;  Else  Begintagcount ++;  If  (Begintagcount = 0)  Break  ;}  If  (Tagmatch! =  Null  ){  Htmltag  Tag =  New  Htmltag (Name, match. value, HTML. substring (match. index + match. length, tagmatch. index-match. index-match. length); tags. add (TAG );}  Else  {  Break  ;}}  Else  {  Break  ;}}  Return  Tags ;} 

With the above functions, you can extract the required HTML flag. to capture, you also need a function to download the webpage:

 Public static  String  Gethtml (  String URL ){  Try  {  Httpwebrequest  Req =  Httpwebrequest  . Create (URL)  As  Httpwebrequest  ; Req. Timeout = 30*1000;  Httpwebresponse  Response = Req. getresponse ()  As  Httpwebresponse  ;  Stream  Stream = response. getresponsestream ();  Memorystream Buffer =  New  Memorystream  ();  Byte  [] Temp =  New  Byte  [4096];  Int  Count = 0;  While  (COUNT = stream. Read (temp, 0, 4096)> 0) {buffer. Write (temp, 0, count );}  Return  Encoding  . Getencoding (response. characterset). getstring (buffer. getbuffer ());}  Catch {  Return  String  . Empty ;}} 

The following describes how to use the htmltag class to capture web page information by taking the title and link of the blog homepage as an example:

 Class  Program  {  Static void  Main (  String  [] ARGs ){  String  Html =  Htmltag  . Gethtml (  Http://www.cnblogs.com"  );  List  < Htmltag  > Tags =  Htmltag  . Findtagbyattr (HTML,  "Div"  ,  "ID"  ,  "Post_list"  );  If  (Tags. Count> 0 ){  List  <  Htmltag  > Item_tags = tags [0]. findtagbyattr (  "Div"  ,  "Class"  , "Post_item"  );  Foreach  (  Htmltag  Item_tag  In  Item_tags ){  List  <  Htmltag  > A_tags = item_tag.findtagbyattr (  ""  ,  "Class"  ,  "Titlelnk"  );  If (A_tags.count> 0 ){  Console  . Writeline (  "Title: {0 }"  , A_tags [0]. innerhtml );  Console  . Writeline (  "Link: {0 }"  , A_tags [0]. getattribute (  "Href"  ));  Console  . Writeline (  ""  );}}}}} 

The running result is as follows:

Source code download

Source: http://www.cnblogs.com/lucc/archive/2010/05/18/1738718.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.