Python crawls Chinese Web pages

Source: Internet
Author: User

I have long been thinking about writing down the daily access traffic of my blog. I just applied for the Gae application and started to learn Python again. We plan to use python to save the access records locally, and then deploy them to Gae. With the cron provided by gae, We can get closer to the access traffic every day. OK, start ~

First, simple web page capturingProgram:

[Python] view plaincopy import sys, urllib2

Req = urllib2.request ("http://blog.csdn.net/nevasun ")

FD = urllib2.urlopen (req)

While true: Data = FD. Read (1024)

If not Len (data): Break SYS. stdout. Write (data)

Run urllib2.httperror: HTTP Error 403: forbidden on the terminal. What is the problem?

This is because the website does not allow crawlers. You can add header information to the request to disguise it as a browser. Add and modify:

[Python] view plaincopy headers = {'user-agent': 'mozilla/5.0 (windows; U; Windows NT 6.1; en-US; RV: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} Req = urllib2.request ("http://blog.csdn.net/nevasun", headers = headers)

Try again. HTTP Error 403 is missing, but all Chinese characters are garbled. Why?

This is because the website is UTF-8 encoded and requires the conversion of the local system's encoding format:

[Python] view plaincopy import sys, urllib2

Headers = {'user-agent': 'mozilla/5.0 (windows; U; Windows NT 6.1; en-US; RV: 1.9.1.6) gecko/20091201 Firefox/3.5.6 '} Req = urllib2.request ("http://blog.csdn.net/nevasun", headers = headers)

Content = urllib2.urlopen (req ). Read () # UTF-8

Type = SYS. getfilesystemencoding () # local encode format print content. Decode ("UTF-8 "). Encode (type) # convert encode format OK. you can capture the Chinese page. The next step is to make a simple application on Gae ~

 

From: http://linux.chinaitlab.com/Python/878184.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.