) Play with crawlers-small details during crawling

Source: Internet
Author: User

Address: http://www.cnblogs.com/huangxincheng/archive/2012/11/08/2759752.html

 

This article describes several issues that should be noticed during page capturing.

I. webpage update

We know that the information on general web pages is constantly updated, which also requires us to regularly grasp the new information, But how should we understand this "regular", that is, how long it takes?

Capture the page once. In fact, this regular period is the page cache time. It is unnecessary for us to crawl the page again during the page cache time, which puts pressure on the server.

For example, if I want to capture the blog homepage, I first clear the page cache,

From last-modified to expires, we can see that the cache time in the blog garden is 2 minutes, and I can also see the current server time date, if I try again

If you refresh the page, the date will be changed to if-modified-since, and then sent to the server to check whether the browser cache has expired?

In the end, the server finds the IF-modified-since> = last-modifined time, and the server returns 304, But it finds that this cookie information is really a lot of thieves...

In actual development, if we know the website cache policy, we can make the crawler crawl for 2 minutes. Of course, these can be configured and maintained by the data team,

Now, we can simulate it with crawlers.

 1   Using System; 2   Using System. net; 3    4   Namespace Leleapplication2 5 { 6       Public   Class Program 7 { 8           Static   Void Main ( String [] ARGs)9 { 10 Datetime prevdatetime = datetime. minvalue; 11    12               For ( Int I = 0 ; I < 10 ; I ++) 13 { 14                   Try   15 { 16                       VaR Url = "  Http://cnblogs.com  " ; 17    18                       VaR Request = (httpwebrequest) httpwebrequest. Create (URL ); 19    20 Request. method = "  Head  " ; 21    22                       If (I>0 ) 23 { 24 Request. ifmodifiedsince = prevdatetime; 25 } 26    27 Request. Timeout = 3000 ; 28    29                       VaR Response = (httpwebresponse) request. getresponse (); 30    31                      VaR Code = response. statuscode; 32    33                       //  If the returned status of the server is 200, the webpage is deemed to have been updated. Remember the server time at that time.   34                       If (Code = httpstatuscode. OK) 35 { 36 Prevdatetime = convert. todatetime (response. headers [httpresponseheader. Date]); 37 } 38    39 Console. writeline ( "  Status Code of the current server: {0}  " , Code ); 40 } 41                   Catch (Webexception ex) 42 { 43                       If (EX. response! = Null ) 44 { 45                           VaR Code = (ex. Response As Httpwebresponse). statuscode; 46    47 Console. writeline ( "  Status Code of the current server: {0}  " , Code ); 48 } 49 } 50 } 51 } 52 } 53 }

Ii. webpage Encoding Problems

Sometimes we have captured the web page. When we are going to parse it, all the TMD code is garbled. It's really cool, as shown below,

Perhaps we vaguely remember that there is a charset attribute in the HTML Meta, which records the encoding method, and another key point is

The encoding method is also recorded in the response. characterset attribute. Let's try again.

Sorry, it's still garbled and it hurts. This time, I need to go to the official website to see what the HTTP header information interacts with and why the browser can display it normally,

Crawlers cannot crawl.

After checking the HTTP header information, we finally got it. The browser says that I can parse the three compression methods: gzip, deflate, and SDCh. The server sends gzip compression.

We should also know the commonly used Web performance optimization.

 1   Using System; 2   Using System. Collections. Generic; 3   Using System. LINQ;4   Using System. text; 5   Using System. Threading; 6   Using Htmlagilitypack; 7   Using System. Text. regularexpressions; 8   Using System. net; 9   Using System. IO; 10   Using System. Io. compression;11    12   Namespace Leleapplication2 13 { 14       Public   Class Program 15 { 16           Static   Void Main ( String [] ARGs) 17 { 18               // VaR currenturl ="  Http://www.mm5mm.com/  ";   19    20               VaR Currenturl = "  Http://www.sohu.com/  " ; 21    22               VaR Request = webrequest. Create (currenturl) As Httpwebrequest; 23   24               VaR Response = request. getresponse () As Httpwebresponse; 25    26               VaR Encode = String . Empty; 27    28               If (Response. characterset = "  ISO-8859-1  " ) 29 Encode ="  Gb2312  " ; 30               Else   31 Encode = response. characterset; 32    33 Stream stream; 34    35               If (Response. contentencoding. tolower () = "  Gzip  " ) 36 { 37 Stream = New Gzipstream (response. getresponsestream (), compressionmode. Decompress ); 38 } 39               Else   40 { 41 Stream = response. getresponsestream (); 42 } 43    44               VaR Sr = New Streamreader (stream, encoding. getencoding (encode )); 45    46               VaR Html = Sr. readtoend (); 47 } 48 } 49 }

Iii. Web page resolution

Since the web page has been obtained through a great deal of hard work, it will be resolved next. Of course, regular expression matching is a good method. After all, the workload is heavy and may be highly praised by the industry.

Htmlagilitypack is a parsing tool that parses HTML into XML, and then uses XPath to extract specified content, greatly improving the development speed and performance.

It doesn't matter. After all, agility is agile. For the content of XPath, you can understand the two figures of w3cschool.

 1   Using System;2   Using System. Collections. Generic; 3   Using System. LINQ; 4   Using System. text; 5   Using System. Threading; 6   Using Htmlagilitypack; 7   Using System. Text. regularexpressions; 8   Using System. net; 9   Using System. IO; 10   Using System. Io. compression; 11    12   Namespace Leleapplication2 13 { 14       Public   Class Program 15 { 16           Static  Void Main ( String [] ARGs) 17 { 18               //  VaR currenturl ="  Http://www.mm5mm.com/  ";   19    20               VaR Currenturl = "  Http://www.sohu.com/  " ; 21   22               VaR Request = webrequest. Create (currenturl) As Httpwebrequest; 23    24               VaR Response = request. getresponse () As Httpwebresponse; 25    26               VaR Encode = String . Empty; 27    28               If (Response. characterset = "  ISO-8859-1  " ) 29 Encode = "  Gb2312  " ; 30               Else   31 Encode = response. characterset; 32    33 Stream stream; 34    35              If (Response. contentencoding. tolower () = "  Gzip  " ) 36 { 37 Stream = New Gzipstream (response. getresponsestream (), compressionmode. Decompress ); 38 } 39               Else   40 { 41 Stream = response. getresponsestream ();42 } 43    44               VaR Sr = New Streamreader (stream, encoding. getencoding (encode )); 45    46               VaR Html = Sr. readtoend (); 47    48 Sr. Close (); 49    50 Htmldocument document = New Htmldocument ();51    52 Document. loadhtml (HTML ); 53    54               //  Extract title   55               VaR Title = Document. documentnode. selectsinglenode ( "  // Title  " ). Innertext; 56    57               //  Extract keywords  58               VaR KEYWORDS = Document. documentnode. selectsinglenode ( "  // Meta [@ name = 'keyword']  " ). Attributes [ "  Content  " ]. Value; 59 } 60 } 61 }

Okay, close the job and go to bed...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.