I have long been thinking about writing down the daily access traffic of my blog. I just applied for the Gae application and started to learn Python again. We plan to use python to save the access records locally, and then deploy them to Gae. With the cron provided by gae, We can get closer to the access traffic every day. OK, start ~
First, simple web page capturingProgram:
[Python] view plaincopy import sys, urllib2
Req = urllib2.request ("http://blog.csdn.net/nevasun ")
FD = urllib2.urlopen (req)
While true: Data = FD. Read (1024)
If not Len (data): Break SYS. stdout. Write (data)
Run urllib2.httperror: HTTP Error 403: forbidden on the terminal. What is the problem?
This is because the website does not allow crawlers. You can add header information to the request to disguise it as a browser. Add and modify:
[Python] view plaincopy headers = {'user-agent': 'mozilla/5.0 (windows; U; Windows NT 6.1; en-US; RV: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} Req = urllib2.request ("http://blog.csdn.net/nevasun", headers = headers)
Try again. HTTP Error 403 is missing, but all Chinese characters are garbled. Why?
This is because the website is UTF-8 encoded and requires the conversion of the local system's encoding format:
[Python] view plaincopy import sys, urllib2
Headers = {'user-agent': 'mozilla/5.0 (windows; U; Windows NT 6.1; en-US; RV: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} Req = urllib2.request ("http://blog.csdn.net/nevasun", headers = headers)
Content = urllib2.urlopen (req ). Read () # UTF-8
Type = SYS. getfilesystemencoding () # local encode format print content. Decode ("UTF-8 "). Encode (type) # convert encode format OK. you can capture the Chinese page. The next step is to make a simple application on Gae ~
From: http://linux.chinaitlab.com/Python/878184.html