Python Chinese garbled

Source: Internet
Author: User

If you encounter this problem and google, you are almost helpless. I hope the following descriptions will be useful to you.

This document describes how to obtain a page from a website and find Chinese characters on the page. The platform in the experiment is python 2.7.

Python2.7 is too straightforward. If the operating system gives it a format-encoded string, it will receive the format string. The default format is different from the format you use. Therefore, various errors occur.


Solution:

1. Know the encoding format used by the System

2. decode the obtained page in the System Format and encode it in utf8 format.

3. uft8 encoding is used in all your scripts.

4. The processing is complete. encode your string in UTF-8 format.

Process ended



[Python] view plaincopy
  1. #-*-Coding: UTF-8 -*-

  2. Import sys

  3. Import urllib2

  4. # Obtain the default encoding format

  5. SysCharType = sys. getfilesystemencoding ()

  6. # Retrieve page

  7. Headers = {'user-agent': 'mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: 1.9.1.6) Gecko/20091201 Firefox/3.5.6 '}

  8. Req = urllib2.Request ("http://www.baidu.com/", headers = headers)

  9. Html = urllib2.urlopen (req). read ()

  10. SysHtml = html. decode (sysCharType). encode ('utf-8 ')

  11. S = 'Baidu, you will know'

  12. If html. find (s )! =-1:

  13. Print 'matching without conversion characters'

  14. Else:

  15. Print 'unmatched unconverted characters'

  16. If sysHtml. find (s )! =-1:

  17. Print 'converted character Match'

  18. Else:

  19. Print 'converted character mismatched'


In addition, after reading kiki113's article, I have gained a lot. The original post address of kiki113.

Str is a byte array. It is only a byte stream and has no other meaning. Unicode is an encoding format. We can regard a string with unicode encoding format as a byte stream, but we cannot interpret a byte stream as a unicode string. This is the direct cause of frequent problems in character encoding conversion.


'Hello, 'EncodingEncoding result GBK \ xc4 \ xfa \ xba \ xc3unicode \ u60a8 \ u597dUTF-8 \ xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd


In python, if a str is directly encoded into another encoding, such as a UTF-8, the system first decodes str into unicode and then converts the unicode encoding format to the target format (UTF-8 ), during the process from str to unicode, the system considers str as the default encoding format, which is generally considered as ASCII. If the original encoding is not ASCII, an error occurs.

The following are examples:
1. view the default encoding format

[Python] view plaincopy
  1. >>> Import sys

  2. >>> Sys. getdefaultencoding ()

  3. 'Ascii'

  4. >>> S = 'hello'

  5. >>> S

  6. '\ Xc4 \ xfa \ xba \ xc3' # This is a byte stream in GBK encoding format.

  7. >>> S1 = u'hello'

  8. >>> S1

  9. U' \ xc4 \ xfa \ xba \ xc3 '# Check the object type. U indicates that the string followed by it is encoded in unicode format. If not, an error occurs.

  10. <Type 'str'>

  11. >>> Type (s1)

  12. <Type 'unicode '> # s1 is of the unicode type, but its content is in GBK format.


Convert to UTF-8 format

[Python] view plaincopy
  1. >>> Su = s. encode ('utf-8 ')

  2. Traceback (most recent call last ):

  3. File "<pyshell #9>", line 1, in <module>

  4. Su = s. encode ('utf-8 ')

  5. UnicodeDecodeError: 'ascii 'codec can't decode byte 0xc4in position 0: ordinal notin range (128)

  6. Direct Conversion error

  7. >>> Su = s. decode ('gbk'). encode ('utf-8 ')

  8. >>> Su

  9. '\ Xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd'

  10. >>> Print su

  11. ㄥソ

  12. >>> Sun = su. decode ('utf-8 ')

  13. >>> Sun

  14. U' \ u60a8 \ u597d'

  15. >>> Print sun

  16. Hello

  17. >>> S. decode ('gbk ')

  18. U' \ u60a8 \ u597d'

  19. <Span> s1.decode ('gbk') </span>

  20. Traceback (most recent call last): File "<pyshell #17>", line 1, in <module> s1.decode ('gbk') UnicodeEncodeError: 'ascii 'codec can't encode characters in position 0-3: ordinal notin range (128)


The role of the file encoding format and encoding declaration to 'hello, 'respectively Save the ASCII format files, unicode format files, and UTF-8 format files, use a hexadecimal file to view the file content.


It can be seen that different encoding formats and identical characters are used, and the storage formats are different. The encoding Declaration explicitly describes the character encoding format in the file and informs the system that all the characters encountered later are interpreted in the declared format. For example, script file


[Python] view plaincopy
  1. S = 'hello'

  2. Print repr (s)


A. If the file format is UTF-8, the str value is '\ xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd'
B. If the file format is ASCII, the str value is '\ xc4 \ xfa \ xba \ xc3'


What if the encoding format does not match the declaration?

The file is saved in UTF-8 format and declared as GBK


[Python] view plaincopy
  1. #-*-Coding: GBK -*-

  2. S = 'hello'

  3. Print repr (s)

  4. Print s

Running result



[Python] view plaincopy
  1. '\ Xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd'

  2. ㄥソ


The file is ASCII and declared as UTF-8 (we know the file will be saved in GBK format)



[Python] view plaincopy
  1. '\ Xc4 \ xfa \ xba \ xc3'

  2. Hello

According to the above discussion, this seems to be taken for granted. It's amazing.


The file is saved in UTF-8 format and declared as GBK


[Python] view plaincopy
  1. #-*-Coding: GBK -*-

  2. S = u'hello'

  3. Print repr (s)

  4. Print s

The running result is



[Python] view plaincopy
  1. U' \ u93ae \ u3125 \ u30bd'

  2. ㄥソ



Why? When running ss = u'hello, ', the entire process can be divided into the following steps:

1) Get the encoding of 'Hello ': determined by the file encoding format, which is' \ xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd'

2) When converting to unicode encoding, decode '\ xe6 \ x82 \ xa8 \ xe5 \ xa5 \ xbd' based on the file display encoding format statement, instead of UTF-8 decoding, instead, it uses the GBK encoding specified at the Declaration encoding to decode and obtain the string 'encoding encoding '''. The unicode encoding of these three characters is U'/u93ae/u3125/u30bd ', therefore, print repr (ss) outputs/u93ae/u3125/u30bd.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.