Python character encoding and Chinese Processing

Source: Internet
Author: User
Python has two types of strings.

ByteString = "hello world! (In my default locale )"

UnicodeString = u "hello Unicode world! "

Mutual Conversion

1 s = "hello normal string"

2 u = unicode (s, "UTF-8 ")

3 backToBytes = u. encode ("UTF-8 ")

3 backToUtf8 = backToBytes. decode ('utf-8') # same effect as the second line

How to judge

If isinstance (s, str): # For Unicode strings, the result is False.

If isinstance (s, unicode): # For Unicode strings, the result is True.

If isinstance (s, basestring): # returns True for both strings.

Do a test
Example import sys print 'default encoding: ', sys. getdefaultencoding () print 'file system encoding: ', sys. getfilesystemencoding () print 'stdout encoding: ', sys. stdout. encodingprint u 'U "Chinese" is unicode: ', isinstance (u 'China', unicode) print U' "Chinese" is unicode:', isinstance ('China', unicode)

Check the output result and pay attention to the following facts:

The default encoding format of the python system is ASCII. This default encoding is used when Python converts strings. Here are two examples:

1. a = "abc" + u "bcd", Python converts "abc". decode (sys. getdefaultencoding () and then combines the two Unicode characters.

2. print unicode ('Chinese'). If this statement is executed, the following error occurs: "UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 ...", Because Python tries to use the default encoding, and this string is not ASCII, it should be shown that if your file source type is UTF-8, this should be the case: print unicode ('Chinese', 'utf-8 ')

In Windows, getfilesystemencoding outputs mbcs (Multi-byte encoding, windows mbcs, Which is ansi. It uses different encoding in windows of different languages, in Chinese windows, it is the encoding of the gb series)

In Windows, the console code is cp936. When you print something to the console, Python automatically converts it. This will cause an interesting problem. Try this simple example test. py:

Example #-*-coding: UTF-8-*-s = u'chinese' print s

Run python test. py and python test. py> 1.txt on the console respectively.

You will find that the latter will report an error because Python will automatically convert the encoding to sys when printing the console. stdout. encoding, but Python does not automatically convert internal characters in the write call when it is output to a file. This issue is described in more detail in PrintFails.

UTF-8 encoding format saves UTF-8 format files

Import codecs

FileObj = codecs. open ("someFile", "r", "UTF-8 ")

U = fileObj. read () # Returns a Unicode string from the UTF-8 in the file

Write BOM header by yourself

Out = file ("someFile", "w ")

Out. write (codecs. BOM_UTF8)

Out. write (unicodeString. encode ("UTF-8 "))

Out. close ()

Remove BOM header by yourself

For the UTF-16, Python decodes BOM as an empty string. However for the UTF-8, BOM is decoded as a character, for example:

Example

>>> Codecs. BOM_UTF16.decode ("utf16 ")

U''

>>> Codecs. BOM_UTF8.decode ("utf8 ")

U' \ ufeff'

I don't know why this is different, so you need to remove the BOM when reading the file:

Remove BOMimport codecsif s. beginswith (codecs. BOM_UTF8): # The byte string s begins with the BOM: Do something. # For example, decode the string as UTF-8if u [0] = unicode (codecs. BOM_UTF8, "utf8"): # The unicode string begins with the BOM: Do something. # For example, remove the character. # Strip the BOM from the beginning of the Unicode string, if it existsu. lstrip (unicode (codecs. BOM_UTF8, "utf8 "))
Encoding of source code files

PEP0263 is clear about how Python codes code files. The following is an excerpt:

Python uses ASCII encoding by default.

You can add a declaration file encoding statement to the first or second line of the Code to notify python of the encoding format of the file, as shown in

#-*-Coding: UTF-8-*-# note that this encoding format is used when the file is saved.

  1. For a platform like Windows, it uses BOM (three bytes in the file header \ xef \ xbb \ xbf) to declare that the file is UTF-8 encoded. In this case:
  • If the file does not contain an encoding statement, python uses utf8 for processing.
  • If the encoding statement is not UTF-8, python reports an error.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.