) Play with crawlers-small details during crawling

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Address: http://www.cnblogs.com/huangxincheng/archive/2012/11/08/2759752.html

This article describes several issues that should be noticed during page capturing.

I. webpage update

We know that the information on general web pages is constantly updated, which also requires us to regularly grasp the new information, But how should we understand this "regular", that is, how long it takes?

Capture the page once. In fact, this regular period is the page cache time. It is unnecessary for us to crawl the page again during the page cache time, which puts pressure on the server.

For example, if I want to capture the blog homepage, I first clear the page cache,

From last-modified to expires, we can see that the cache time in the blog garden is 2 minutes, and I can also see the current server time date, if I try again

If you refresh the page, the date will be changed to if-modified-since, and then sent to the server to check whether the browser cache has expired?

In the end, the server finds the IF-modified-since> = last-modifined time, and the server returns 304, But it finds that this cookie information is really a lot of thieves...

In actual development, if we know the website cache policy, we can make the crawler crawl for 2 minutes. Of course, these can be configured and maintained by the data team,

Now, we can simulate it with crawlers.

 1   Using System; 2   Using System. net; 3    4   Namespace Leleapplication2 5 { 6       Public   Class Program 7 { 8           Static   Void Main ( String [] ARGs)9 { 10 Datetime prevdatetime = datetime. minvalue; 11    12               For ( Int I = 0 ; I < 10 ; I ++) 13 { 14                   Try   15 { 16                       VaR Url = "  Http://cnblogs.com  " ; 17    18                       VaR Request = (httpwebrequest) httpwebrequest. Create (URL ); 19    20 Request. method = "  Head  " ; 21    22                       If (I>0 ) 23 { 24 Request. ifmodifiedsince = prevdatetime; 25 } 26    27 Request. Timeout = 3000 ; 28    29                       VaR Response = (httpwebresponse) request. getresponse (); 30    31                      VaR Code = response. statuscode; 32    33                       //  If the returned status of the server is 200, the webpage is deemed to have been updated. Remember the server time at that time.   34                       If (Code = httpstatuscode. OK) 35 { 36 Prevdatetime = convert. todatetime (response. headers [httpresponseheader. Date]); 37 } 38    39 Console. writeline ( "  Status Code of the current server: {0}  " , Code ); 40 } 41                   Catch (Webexception ex) 42 { 43                       If (EX. response! = Null ) 44 { 45                           VaR Code = (ex. Response As Httpwebresponse). statuscode; 46    47 Console. writeline ( "  Status Code of the current server: {0}  " , Code ); 48 } 49 } 50 } 51 } 52 } 53 }

Ii. webpage Encoding Problems

Sometimes we have captured the web page. When we are going to parse it, all the TMD code is garbled. It's really cool, as shown below,

Perhaps we vaguely remember that there is a charset attribute in the HTML Meta, which records the encoding method, and another key point is

The encoding method is also recorded in the response. characterset attribute. Let's try again.

Sorry, it's still garbled and it hurts. This time, I need to go to the official website to see what the HTTP header information interacts with and why the browser can display it normally,

Crawlers cannot crawl.

After checking the HTTP header information, we finally got it. The browser says that I can parse the three compression methods: gzip, deflate, and SDCh. The server sends gzip compression.

We should also know the commonly used Web performance optimization.

 1   Using System; 2   Using System. Collections. Generic; 3   Using System. LINQ;4   Using System. text; 5   Using System. Threading; 6   Using Htmlagilitypack; 7   Using System. Text. regularexpressions; 8   Using System. net; 9   Using System. IO; 10   Using System. Io. compression;11    12   Namespace Leleapplication2 13 { 14       Public   Class Program 15 { 16           Static   Void Main ( String [] ARGs) 17 { 18               // VaR currenturl ="  Http://www.mm5mm.com/  ";   19    20               VaR Currenturl = "  Http://www.sohu.com/  " ; 21    22               VaR Request = webrequest. Create (currenturl) As Httpwebrequest; 23   24               VaR Response = request. getresponse () As Httpwebresponse; 25    26               VaR Encode = String . Empty; 27    28               If (Response. characterset = "  ISO-8859-1  " ) 29 Encode ="  Gb2312  " ; 30               Else   31 Encode = response. characterset; 32    33 Stream stream; 34    35               If (Response. contentencoding. tolower () = "  Gzip  " ) 36 { 37 Stream = New Gzipstream (response. getresponsestream (), compressionmode. Decompress ); 38 } 39               Else   40 { 41 Stream = response. getresponsestream (); 42 } 43    44               VaR Sr = New Streamreader (stream, encoding. getencoding (encode )); 45    46               VaR Html = Sr. readtoend (); 47 } 48 } 49 }

Iii. Web page resolution

Since the web page has been obtained through a great deal of hard work, it will be resolved next. Of course, regular expression matching is a good method. After all, the workload is heavy and may be highly praised by the industry.

Htmlagilitypack is a parsing tool that parses HTML into XML, and then uses XPath to extract specified content, greatly improving the development speed and performance.

It doesn't matter. After all, agility is agile. For the content of XPath, you can understand the two figures of w3cschool.

 1   Using System;2   Using System. Collections. Generic; 3   Using System. LINQ; 4   Using System. text; 5   Using System. Threading; 6   Using Htmlagilitypack; 7   Using System. Text. regularexpressions; 8   Using System. net; 9   Using System. IO; 10   Using System. Io. compression; 11    12   Namespace Leleapplication2 13 { 14       Public   Class Program 15 { 16           Static  Void Main ( String [] ARGs) 17 { 18               //  VaR currenturl ="  Http://www.mm5mm.com/  ";   19    20               VaR Currenturl = "  Http://www.sohu.com/  " ; 21   22               VaR Request = webrequest. Create (currenturl) As Httpwebrequest; 23    24               VaR Response = request. getresponse () As Httpwebresponse; 25    26               VaR Encode = String . Empty; 27    28               If (Response. characterset = "  ISO-8859-1  " ) 29 Encode = "  Gb2312  " ; 30               Else   31 Encode = response. characterset; 32    33 Stream stream; 34    35              If (Response. contentencoding. tolower () = "  Gzip  " ) 36 { 37 Stream = New Gzipstream (response. getresponsestream (), compressionmode. Decompress ); 38 } 39               Else   40 { 41 Stream = response. getresponsestream ();42 } 43    44               VaR Sr = New Streamreader (stream, encoding. getencoding (encode )); 45    46               VaR Html = Sr. readtoend (); 47    48 Sr. Close (); 49    50 Htmldocument document = New Htmldocument ();51    52 Document. loadhtml (HTML ); 53    54               //  Extract title   55               VaR Title = Document. documentnode. selectsinglenode ( "  // Title  " ). Innertext; 56    57               //  Extract keywords  58               VaR KEYWORDS = Document. documentnode. selectsinglenode ( "  // Meta [@ name = 'keyword']  " ). Attributes [ "  Content  " ]. Value; 59 } 60 } 61 }

Okay, close the job and go to bed...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

) Play with crawlers-small details during crawling

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

) Play with crawlers-small details during crawling

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support