Python encoding conversion and Chinese processing
Unicode in Python is a confusing and difficult problem to understand. Utf-8 is an implementation of Unicode, Unicode, GBK, and gb2312 are coded character sets.
Decode is to parse the normal string according to the encoding format in the parameter, and then generate the corresponding Unicode object
The Chinese encoding problem encountered when writing Python:
? /test sudo vim test.py
#!/usr/bin/python
#-*-Coding:utf-8-*-
Def weather ():
Import time
Import re
Import Urllib2
Import Itchat
#模拟浏览器
Hearders = "User-agent", "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/60.0.3112.113 safari/537.36 "
url = "Https://tianqi.moji.com/weather/china/guangdong/shantou" # #要爬去天气预报的网址
par = ' (<meta name= "description" content= ") (. *?) (">) ' # #正则匹配 to match what's in the page
# #创建opener对象并设置为全局对象
Opener = Urllib2.build_opener ()
Opener.addheaders = [Hearders]
Urllib2.install_opener (opener)
# #获取网页
html = urllib2.urlopen (URL). read (). Decode ("Utf-8")
# #提取需要爬取的内容
data = Re.search (par,html). Group (2)
Print type (data)
Data.encode (' gb2312 ')
b = ' weather forecast '
Print type (b)
c = b + ' \ n ' + data
Print C
Weather ()
? /test sudo python test.py
<type ' Unicode ' >
<type ' str ' >
Traceback (most recent):
File "test.py", line A, in <module>
Weather ()
File "test.py", line +, in weather
c = b + ' \ n ' + data
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe5 in position 0:ordinal not in range (128)
Workaround:
? /test sudo vim test.py
#!/usr/bin/python
#-*-Coding:utf-8-*-
Import Sys
Reload (SYS)
# Python2.5 will delete sys.setdefaultencoding This method after initialization, we need to reload
Sys.setdefaultencoding (' Utf-8 ')
Def weather ():
Import time
Import re
Import Urllib2
Import Itchat
#模拟浏览器
Hearders = "User-agent", "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/60.0.3112.113 safari/537.36 "
url = "Https://tianqi.moji.com/weather/china/guangdong/shantou" # #要爬去天气预报的网址
par = ' (<meta name= "description" content= ") (. *?) (">) ' # #正则匹配 to match what's in the page
# #创建opener对象并设置为全局对象
Opener = Urllib2.build_opener ()
Opener.addheaders = [Hearders]
Urllib2.install_opener (opener)
# #获取网页
html = urllib2.urlopen (URL). read (). Decode ("Utf-8")
# #提取需要爬取的内容
data = Re.search (par,html). Group (2)
Print type (data)
Data.encode (' gb2312 ')
b = ' weather forecast '
Print type (b)
c = b + ' \ n ' + data
Print C
Weather ()
After testing:
? /test sudo python test.py
<type ' Unicode ' >
<type ' str ' >
Weather forecast
Shantou today live: 20 degrees cloudy, Humidity: 57%, Dongfeng: 2 grade. Daytime: 20 degrees, cloudy. Night: Sunny, 13 degrees, the weather is cold, ink weather suggest you wear a thicker coat or warm sweater, frail and infirm can choose warm polar fleece jacket.
Python encoding conversion and Chinese processing