Address: http://www.cnblogs.com/huangxincheng/archive/2012/11/08/2759752.html
This article describes several issues that should be noticed during page capturing.
I. webpage update
We know that the information on general web pages is constantly updated, which also requires us to regularly grasp the new information, But how should we understand this "regular", that is, how long it takes?
Capture the page once. In fact, this regular period is the page cache time. It is unnecessary for us to crawl the page again during the page cache time, which puts pressure on the server.
For example, if I want to capture the blog homepage, I first clear the page cache,
From last-modified to expires, we can see that the cache time in the blog garden is 2 minutes, and I can also see the current server time date, if I try again
If you refresh the page, the date will be changed to if-modified-since, and then sent to the server to check whether the browser cache has expired?
In the end, the server finds the IF-modified-since> = last-modifined time, and the server returns 304, But it finds that this cookie information is really a lot of thieves...
In actual development, if we know the website cache policy, we can make the crawler crawl for 2 minutes. Of course, these can be configured and maintained by the data team,
Now, we can simulate it with crawlers.
1 Using System; 2 Using System. net; 3 4 Namespace Leleapplication2 5 { 6 Public Class Program 7 { 8 Static Void Main ( String [] ARGs)9 { 10 Datetime prevdatetime = datetime. minvalue; 11 12 For ( Int I = 0 ; I < 10 ; I ++) 13 { 14 Try 15 { 16 VaR Url = " Http://cnblogs.com " ; 17 18 VaR Request = (httpwebrequest) httpwebrequest. Create (URL ); 19 20 Request. method = " Head " ; 21 22 If (I>0 ) 23 { 24 Request. ifmodifiedsince = prevdatetime; 25 } 26 27 Request. Timeout = 3000 ; 28 29 VaR Response = (httpwebresponse) request. getresponse (); 30 31 VaR Code = response. statuscode; 32 33 // If the returned status of the server is 200, the webpage is deemed to have been updated. Remember the server time at that time. 34 If (Code = httpstatuscode. OK) 35 { 36 Prevdatetime = convert. todatetime (response. headers [httpresponseheader. Date]); 37 } 38 39 Console. writeline ( " Status Code of the current server: {0} " , Code ); 40 } 41 Catch (Webexception ex) 42 { 43 If (EX. response! = Null ) 44 { 45 VaR Code = (ex. Response As Httpwebresponse). statuscode; 46 47 Console. writeline ( " Status Code of the current server: {0} " , Code ); 48 } 49 } 50 } 51 } 52 } 53 }
Ii. webpage Encoding Problems
Sometimes we have captured the web page. When we are going to parse it, all the TMD code is garbled. It's really cool, as shown below,
Perhaps we vaguely remember that there is a charset attribute in the HTML Meta, which records the encoding method, and another key point is
The encoding method is also recorded in the response. characterset attribute. Let's try again.
Sorry, it's still garbled and it hurts. This time, I need to go to the official website to see what the HTTP header information interacts with and why the browser can display it normally,
Crawlers cannot crawl.
After checking the HTTP header information, we finally got it. The browser says that I can parse the three compression methods: gzip, deflate, and SDCh. The server sends gzip compression.
We should also know the commonly used Web performance optimization.
1 Using System; 2 Using System. Collections. Generic; 3 Using System. LINQ;4 Using System. text; 5 Using System. Threading; 6 Using Htmlagilitypack; 7 Using System. Text. regularexpressions; 8 Using System. net; 9 Using System. IO; 10 Using System. Io. compression;11 12 Namespace Leleapplication2 13 { 14 Public Class Program 15 { 16 Static Void Main ( String [] ARGs) 17 { 18 // VaR currenturl =" Http://www.mm5mm.com/ "; 19 20 VaR Currenturl = " Http://www.sohu.com/ " ; 21 22 VaR Request = webrequest. Create (currenturl) As Httpwebrequest; 23 24 VaR Response = request. getresponse () As Httpwebresponse; 25 26 VaR Encode = String . Empty; 27 28 If (Response. characterset = " ISO-8859-1 " ) 29 Encode =" Gb2312 " ; 30 Else 31 Encode = response. characterset; 32 33 Stream stream; 34 35 If (Response. contentencoding. tolower () = " Gzip " ) 36 { 37 Stream = New Gzipstream (response. getresponsestream (), compressionmode. Decompress ); 38 } 39 Else 40 { 41 Stream = response. getresponsestream (); 42 } 43 44 VaR Sr = New Streamreader (stream, encoding. getencoding (encode )); 45 46 VaR Html = Sr. readtoend (); 47 } 48 } 49 }
Iii. Web page resolution
Since the web page has been obtained through a great deal of hard work, it will be resolved next. Of course, regular expression matching is a good method. After all, the workload is heavy and may be highly praised by the industry.
Htmlagilitypack is a parsing tool that parses HTML into XML, and then uses XPath to extract specified content, greatly improving the development speed and performance.
It doesn't matter. After all, agility is agile. For the content of XPath, you can understand the two figures of w3cschool.
1 Using System;2 Using System. Collections. Generic; 3 Using System. LINQ; 4 Using System. text; 5 Using System. Threading; 6 Using Htmlagilitypack; 7 Using System. Text. regularexpressions; 8 Using System. net; 9 Using System. IO; 10 Using System. Io. compression; 11 12 Namespace Leleapplication2 13 { 14 Public Class Program 15 { 16 Static Void Main ( String [] ARGs) 17 { 18 // VaR currenturl =" Http://www.mm5mm.com/ "; 19 20 VaR Currenturl = " Http://www.sohu.com/ " ; 21 22 VaR Request = webrequest. Create (currenturl) As Httpwebrequest; 23 24 VaR Response = request. getresponse () As Httpwebresponse; 25 26 VaR Encode = String . Empty; 27 28 If (Response. characterset = " ISO-8859-1 " ) 29 Encode = " Gb2312 " ; 30 Else 31 Encode = response. characterset; 32 33 Stream stream; 34 35 If (Response. contentencoding. tolower () = " Gzip " ) 36 { 37 Stream = New Gzipstream (response. getresponsestream (), compressionmode. Decompress ); 38 } 39 Else 40 { 41 Stream = response. getresponsestream ();42 } 43 44 VaR Sr = New Streamreader (stream, encoding. getencoding (encode )); 45 46 VaR Html = Sr. readtoend (); 47 48 Sr. Close (); 49 50 Htmldocument document = New Htmldocument ();51 52 Document. loadhtml (HTML ); 53 54 // Extract title 55 VaR Title = Document. documentnode. selectsinglenode ( " // Title " ). Innertext; 56 57 // Extract keywords 58 VaR KEYWORDS = Document. documentnode. selectsinglenode ( " // Meta [@ name = 'keyword'] " ). Attributes [ " Content " ]. Value; 59 } 60 } 61 }
Okay, close the job and go to bed...