Web page capture

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When I was in a chat room, I used the news reading function in the chat room to capture information (such as the latest headlines, news sources, titles, and content) from the webpage) class, this article will introduce how to use this class to capture the information needed in the web page. This article takes the title and link of the blog home page as an example:

The DOM tree of the homepage is displayed. Obviously, you only need to extract the DIV whose class is post_item, and then extract the sign whose class is titlelnk. This function can be implemented through the following functions:

 /// <Summary> ///  Search for all the tags whose names are tagname and whose value is attrname in text HTML.  ///  Example: findtagbyattr (HTML, "Div", "class", "Demo ")  ///  Returns the DIV flag of the demo class.  /// </Summary>  Public static  List  <  Htmltag > Findtagbyattr (  String  HTML,  String  Tagname,  String  Attrname,  String  Attrvalue ){  String  Format =  String  . Format (  @ "<{0} \ s [^ <>] * {1} \ s * = \ s * (\ x27 | \ x22) {2} (\ x27 | \ x22) [^ <>] *>"  , Tagname, attrname, attrvalue );  Return  Findtag (HTML, tagname, format );}  Public static List  <  Htmltag  > Findtag (  String  HTML,  String  Name,  String  Format ){  RegEx  Reg =  New  RegEx  (Format,  Regexoptions  . Ignorecase );  RegEx  Tagreg =  New RegEx  (  String  . Format (  @ "<(\/|) ({0}) (\ s [^ <>] * |)>"  , Name ),  Regexoptions  . Ignorecase );  List  <  Htmltag  > Tags =  New  List  <  Htmltag  > ();  Int  Start = 0;  While (  True  ){  Match  Match = reg. Match (HTML, start );  If  (Match. Success) {start = match. index + match. length;  Match  Tagmatch =  Null  ;  Int  Begintagcount = 1;  While  (  True  ) {Tagmatch = tagreg. Match (HTML, start );  If (! Tagmatch. Success) {tagmatch =  Null  ;  Break  ;} Start = tagmatch. index + tagmatch. length;  If  (Tagmatch. Groups [1]. value =  "/"  ) Begintagcount --;  Else  Begintagcount ++;  If  (Begintagcount = 0)  Break  ;}  If  (Tagmatch! = Null  ){  Htmltag  Tag =  New  Htmltag  (Name, match. value, HTML. substring (match. index + match. length, tagmatch. index-match. index-match. length); tags. add (TAG );}  Else  {  Break  ;}}  Else  {  Break  ;}}  Return Tags ;}

With the above functions, you can extract the required HTML flag. to capture, you also need a function to download the webpage:

 Public static  String  Gethtml (  String  URL ){  Try  {  Httpwebrequest  Req =  Httpwebrequest  . Create (URL)  As  Httpwebrequest  ; Req. Timeout = 30*1000;  Httpwebresponse  Response = Req. getresponse ()  As Httpwebresponse  ;  Stream  Stream = response. getresponsestream ();  Memorystream  Buffer =  New  Memorystream  ();  Byte  [] Temp =  New  Byte  [4096];  Int  Count = 0;  While (COUNT = stream. Read (temp, 0, 4096)> 0) {buffer. Write (temp, 0, count );}  Return  Encoding  . Getencoding (response. characterset). getstring (buffer. getbuffer ());}  Catch  {  Return  String  . Empty ;}}

To captureArticleThe title and link are used as an example to describe how to use the htmltag class to capture webpage information:

 Class  Program  {  Static void  Main (  String [] ARGs ){  String  Html =  Htmltag  . Gethtml (  Http://www.cnblogs.com"  );  List  <  Htmltag  > Tags =  Htmltag  . Findtagbyattr (HTML,  "Div"  ,  "ID"  ,  "Post_list"  );  If (Tags. Count> 0 ){  List  <  Htmltag  > Item_tags = tags [0]. findtagbyattr (  "Div"  ,  "Class"  ,  "Post_item"  );  Foreach  (  Htmltag  Item_tag  In  Item_tags ){  List  <  Htmltag > A_tags = item_tag.findtagbyattr (  ""  ,  "Class"  ,  "Titlelnk"  );  If  (A_tags.count> 0 ){  Console  . Writeline (  "Title: {0 }"  , A_tags [0]. innerhtml );  Console  . Writeline (  "Link: {0 }"  , A_tags [0]. getattribute (  "Href" ));  Console  . Writeline (  ""  );}}}}}

The running result is as follows:

Source code download

Source: http://www.cnblogs.com/lucc/archive/2010/05/18/1738718.html

 /// <Summary> ///  Search for all the tags whose names are tagname and whose value is attrname in text HTML.  /// Example: findtagbyattr (HTML, "Div", "class", "Demo ")  ///  Returns the DIV flag of the demo class.  /// </Summary>  Public static  List  <  Htmltag  > Findtagbyattr (  String  HTML,  String  Tagname,  String  Attrname,  String  Attrvalue ){  String Format =  String  . Format (  @ "<{0} \ s [^ <>] * {1} \ s * = \ s * (\ x27 | \ x22) {2} (\ x27 | \ x22) [^ <>] *>"  , Tagname, attrname, attrvalue );  Return  Findtag (HTML, tagname, format );}  Public static  List  <  Htmltag  > Findtag (  String  HTML,  String  Name,  String  Format ){ RegEx  Reg =  New  RegEx  (Format,  Regexoptions  . Ignorecase );  RegEx  Tagreg =  New  RegEx  (  String  . Format (  @ "<(\/|) ({0}) (\ s [^ <>] * |)>"  , Name ),  Regexoptions  . Ignorecase );  List  < Htmltag  > Tags =  New  List  <  Htmltag  > ();  Int  Start = 0;  While  (  True  ){  Match  Match = reg. Match (HTML, start );  If  (Match. Success) {start = match. index + match. length;  Match  Tagmatch = Null  ;  Int  Begintagcount = 1;  While  (  True  ) {Tagmatch = tagreg. Match (HTML, start );  If  (! Tagmatch. Success) {tagmatch =  Null  ;  Break  ;} Start = tagmatch. index + tagmatch. length;  If  (Tagmatch. Groups [1]. value = "/"  ) Begintagcount --;  Else  Begintagcount ++;  If  (Begintagcount = 0)  Break  ;}  If  (Tagmatch! =  Null  ){  Htmltag  Tag =  New  Htmltag (Name, match. value, HTML. substring (match. index + match. length, tagmatch. index-match. index-match. length); tags. add (TAG );}  Else  {  Break  ;}}  Else  {  Break  ;}}  Return  Tags ;}

With the above functions, you can extract the required HTML flag. to capture, you also need a function to download the webpage:

 Public static  String  Gethtml (  String URL ){  Try  {  Httpwebrequest  Req =  Httpwebrequest  . Create (URL)  As  Httpwebrequest  ; Req. Timeout = 30*1000;  Httpwebresponse  Response = Req. getresponse ()  As  Httpwebresponse  ;  Stream  Stream = response. getresponsestream ();  Memorystream Buffer =  New  Memorystream  ();  Byte  [] Temp =  New  Byte  [4096];  Int  Count = 0;  While  (COUNT = stream. Read (temp, 0, 4096)> 0) {buffer. Write (temp, 0, count );}  Return  Encoding  . Getencoding (response. characterset). getstring (buffer. getbuffer ());}  Catch {  Return  String  . Empty ;}}

The following describes how to use the htmltag class to capture web page information by taking the title and link of the blog homepage as an example:

 Class  Program  {  Static void  Main (  String  [] ARGs ){  String  Html =  Htmltag  . Gethtml (  Http://www.cnblogs.com"  );  List  < Htmltag  > Tags =  Htmltag  . Findtagbyattr (HTML,  "Div"  ,  "ID"  ,  "Post_list"  );  If  (Tags. Count> 0 ){  List  <  Htmltag  > Item_tags = tags [0]. findtagbyattr (  "Div"  ,  "Class"  , "Post_item"  );  Foreach  (  Htmltag  Item_tag  In  Item_tags ){  List  <  Htmltag  > A_tags = item_tag.findtagbyattr (  ""  ,  "Class"  ,  "Titlelnk"  );  If (A_tags.count> 0 ){  Console  . Writeline (  "Title: {0 }"  , A_tags [0]. innerhtml );  Console  . Writeline (  "Link: {0 }"  , A_tags [0]. getattribute (  "Href"  ));  Console  . Writeline (  ""  );}}}}}

The running result is as follows:

Source code download

Source: http://www.cnblogs.com/lucc/archive/2010/05/18/1738718.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Web page capture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Web page capture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support