Python Chinese garbled

Last Update:2014-05-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

If you encounter this problem and google, you are almost helpless. I hope the following descriptions will be useful to you.

This document describes how to obtain a page from a website and find Chinese characters on the page. The platform in the experiment is python 2.7.

Python2.7 is too straightforward. If the operating system gives it a format-encoded string, it will receive the format string. The default format is different from the format you use. Therefore, various errors occur.

Solution:

1. Know the encoding format used by the System

2. decode the obtained page in the System Format and encode it in utf8 format.

3. uft8 encoding is used in all your scripts.

4. The processing is complete. encode your string in UTF-8 format.

Process ended

[Python] view plaincopy

#-*-Coding: UTF-8 -*-
Import sys
Import urllib2
# Obtain the default encoding format
SysCharType = sys. getfilesystemencoding ()
# Retrieve page
Headers = {'user-agent': 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: 1.9.1.6) Gecko/20091201 Firefox/3.5.6 '}
Req = urllib2.Request ("http://www.baidu.com/", headers = headers)
Html = urllib2.urlopen (req). read ()
SysHtml = html. decode (sysCharType). encode ('utf-8 ')
S = 'Baidu, you will know'
If html. find (s )! =-1:
Print 'matching without conversion characters'
Else:
Print 'unmatched unconverted characters'
If sysHtml. find (s )! =-1:
Print 'converted character Match'
Else:
Print 'converted character mismatched'

In addition, after reading kiki113's article, I have gained a lot. The original post address of kiki113.

Str is a byte array. It is only a byte stream and has no other meaning. Unicode is an encoding format. We can regard a string with unicode encoding format as a byte stream, but we cannot interpret a byte stream as a unicode string. This is the direct cause of frequent problems in character encoding conversion.

'Hello, 'EncodingEncoding result GBK \ xc4 \ xfa \ xba \ xc3unicode \ u60a8 \ u597dUTF-8 \ xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd

In python, if a str is directly encoded into another encoding, such as a UTF-8, the system first decodes str into unicode and then converts the unicode encoding format to the target format (UTF-8 ), during the process from str to unicode, the system considers str as the default encoding format, which is generally considered as ASCII. If the original encoding is not ASCII, an error occurs.

The following are examples:
1. view the default encoding format

[Python] view plaincopy

>>> Import sys
>>> Sys. getdefaultencoding ()
'Ascii'
>>> S = 'hello'
>>> S
'\ Xc4 \ xfa \ xba \ xc3' # This is a byte stream in GBK encoding format.
>>> S1 = u'hello'
>>> S1
U' \ xc4 \ xfa \ xba \ xc3 '# Check the object type. U indicates that the string followed by it is encoded in unicode format. If not, an error occurs.
<Type 'str'>
>>> Type (s1)
<Type 'unicode '> # s1 is of the unicode type, but its content is in GBK format.

Convert to UTF-8 format

[Python] view plaincopy

>>> Su = s. encode ('utf-8 ')
Traceback (most recent call last ):
File "<pyshell #9>", line 1, in <module>
Su = s. encode ('utf-8 ')
UnicodeDecodeError: 'ascii 'codec can't decode byte 0xc4in position 0: ordinal notin range (128)
Direct Conversion error
>>> Su = s. decode ('gbk'). encode ('utf-8 ')
>>> Su
'\ Xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd'
>>> Print su
ㄥソ
>>> Sun = su. decode ('utf-8 ')
>>> Sun
U' \ u60a8 \ u597d'
>>> Print sun
Hello
>>> S. decode ('gbk ')
U' \ u60a8 \ u597d'
<Span> s1.decode ('gbk') </span>
Traceback (most recent call last): File "<pyshell #17>", line 1, in <module> s1.decode ('gbk') UnicodeEncodeError: 'ascii 'codec can't encode characters in position 0-3: ordinal notin range (128)

The role of the file encoding format and encoding declaration to 'hello, 'respectively Save the ASCII format files, unicode format files, and UTF-8 format files, use a hexadecimal file to view the file content.

It can be seen that different encoding formats and identical characters are used, and the storage formats are different. The encoding Declaration explicitly describes the character encoding format in the file and informs the system that all the characters encountered later are interpreted in the declared format. For example, script file

[Python] view plaincopy

S = 'hello'
Print repr (s)

A. If the file format is UTF-8, the str value is '\ xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd'
B. If the file format is ASCII, the str value is '\ xc4 \ xfa \ xba \ xc3'

What if the encoding format does not match the declaration?

The file is saved in UTF-8 format and declared as GBK

[Python] view plaincopy

#-*-Coding: GBK -*-
S = 'hello'
Print repr (s)
Print s

Running result

[Python] view plaincopy

'\ Xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd'
ㄥソ

The file is ASCII and declared as UTF-8 (we know the file will be saved in GBK format)

[Python] view plaincopy

'\ Xc4 \ xfa \ xba \ xc3'
Hello

According to the above discussion, this seems to be taken for granted. It's amazing.

The file is saved in UTF-8 format and declared as GBK

[Python] view plaincopy

#-*-Coding: GBK -*-
S = u'hello'
Print repr (s)
Print s

The running result is

[Python] view plaincopy

U' \ u93ae \ u3125 \ u30bd'
ㄥソ

Why? When running ss = u'hello, ', the entire process can be divided into the following steps:

1) Get the encoding of 'Hello ': determined by the file encoding format, which is' \ xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd'

2) When converting to unicode encoding, decode '\ xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd' based on the file display encoding format statement, instead of UTF-8 decoding, instead, it uses the GBK encoding specified at the Declaration encoding to decode and obtain the string 'encoding encoding '''. The unicode encoding of these three characters is U'/u93ae/u3125/u30bd ', therefore, print repr (ss) outputs/u93ae/u3125/u30bd.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Chinese garbled

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Chinese garbled

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support