Directly on dry goods!!
Using Python 2.7.5-windows
Open http://www.apple.com/cn/itunes/charts/free-apps/
As can be seen using the UTF-8 encoding
After some ideological struggle coded as follows (shoot bricks don't face)
#coding =utf-8import urllib2 import urllib import re import thread import time #------- ----APP Store leaderboard----------- class Spider_model: def __init__ (self): self.page = 1 self.pages = []
self.enable = False def getcon (self): Myurl = "http://www.apple.com/cn/itunes/charts/free-apps/" User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} req = Urllib2. Request (myurl, headers = headers) Myresponse = Urllib2.urlopen (req) mypage = Myresponse.read () # The role of encode is to convert Unicode encoding to other encoded strings #decode的作用是将其他编码的字符串转换成unicode编码 print mypage print ' MyModel = Spider_ Model () Mymodel.getcon ()
Collection page character set Python file character set unified for utf-8 (poor egg is not a problem)
Print out the results:
Take out the killer www.baidu.com
Find out why:
http://blog.csdn.net/lf8289/article/details/2465196
http://www.crifan.com/unicodeencodeerror_gbk_codec_can_not_encode_character_in_position_illegal_multibyte_sequence/
All sorts of crazy changes ....
#coding =GBK encoding modified to GBK import urllib2 Import urllib import re import thread import time #-----------APP Store leaderboard-----------class Spider_mod El:def __init__ (self): self.page = 1 self.pages = [] self.enable = False def getcon (self): Myurl = "http://www.apple.com/cn/itunes/charts/free-apps/" User_agen t = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} req = Urllib2. Request (myurl, headers = headers) Myresponse = Urllib2.urlopen (req) mypage = Myresponse.read () #encode的作用是将unicode编码转换成其他编码的字符串 #decode的作用是将其他编码的字符串转换成unicode编码 u Nicodepage = Mypage.decode (' Utf-8 '). Encode (' GBK ', ' ignore ') #采集页面编码为utf-8 to GBK (ignore to ignore illegal characters)
Print Unicodepage
Mymodel.getcon ()
Operation Result: