Python character encoding and Chinese Processing

Last Update:2018-12-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python has two types of strings.

ByteString = "hello world! (In my default locale )"

UnicodeString = u "hello Unicode world! "

Mutual Conversion

1 s = "hello normal string"

2 u = unicode (s, "UTF-8 ")

3 backToBytes = u. encode ("UTF-8 ")

3 backToUtf8 = backToBytes. decode ('utf-8') # same effect as the second line

How to judge

If isinstance (s, str): # For Unicode strings, the result is False.

If isinstance (s, unicode): # For Unicode strings, the result is True.

If isinstance (s, basestring): # returns True for both strings.

Do a test

Example import sys print 'default encoding: ', sys. getdefaultencoding () print 'file system encoding: ', sys. getfilesystemencoding () print 'stdout encoding: ', sys. stdout. encodingprint u 'U "Chinese" is unicode: ', isinstance (u 'China', unicode) print U' "Chinese" is unicode:', isinstance ('China', unicode)

Check the output result and pay attention to the following facts:

The default encoding format of the python system is ASCII. This default encoding is used when Python converts strings. Here are two examples:

1. a = "abc" + u "bcd", Python converts "abc". decode (sys. getdefaultencoding () and then combines the two Unicode characters.

2. print unicode ('Chinese'). If this statement is executed, the following error occurs: "UnicodeDecodeError: 'ascii 'codec can't decode byte 0xe4 ...", Because Python tries to use the default encoding, and this string is not ASCII, it should be shown that if your file source type is UTF-8, this should be the case: print unicode ('Chinese', 'utf-8 ')

In Windows, getfilesystemencoding outputs mbcs (Multi-byte encoding, windows mbcs, Which is ansi. It uses different encoding in windows of different languages, in Chinese windows, it is the encoding of the gb series)

In Windows, the console code is cp936. When you print something to the console, Python automatically converts it. This will cause an interesting problem. Try this simple example test. py:

Example #-*-coding: UTF-8-*-s = u'chinese' print s

Run python test. py and python test. py> 1.txt on the console respectively.

You will find that the latter will report an error because Python will automatically convert the encoding to sys when printing the console. stdout. encoding, but Python does not automatically convert internal characters in the write call when it is output to a file. This issue is described in more detail in PrintFails.

UTF-8 encoding format saves UTF-8 format files

Import codecs

FileObj = codecs. open ("someFile", "r", "UTF-8 ")

U = fileObj. read () # Returns a Unicode string from the UTF-8 in the file

Write BOM header by yourself

Out = file ("someFile", "w ")

Out. write (codecs. BOM_UTF8)

Out. write (unicodeString. encode ("UTF-8 "))

Out. close ()

Remove BOM header by yourself

For the UTF-16, Python decodes BOM as an empty string. However for the UTF-8, BOM is decoded as a character, for example:

Example>>> Codecs. BOM_UTF16.decode ("utf16 ")
U''
>>> Codecs. BOM_UTF8.decode ("utf8 ")
U' \ ufeff'

I don't know why this is different, so you need to remove the BOM when reading the file:

Remove BOMimport codecsif s. beginswith (codecs. BOM_UTF8): # The byte string s begins with the BOM: Do something. # For example, decode the string as UTF-8if u [0] = unicode (codecs. BOM_UTF8, "utf8"): # The unicode string begins with the BOM: Do something. # For example, remove the character. # Strip the BOM from the beginning of the Unicode string, if it existsu. lstrip (unicode (codecs. BOM_UTF8, "utf8 "))

Encoding of source code files

PEP0263 is clear about how Python codes code files. The following is an excerpt:

Python uses ASCII encoding by default.

You can add a declaration file encoding statement to the first or second line of the Code to notify python of the encoding format of the file, as shown in

#-*-Coding: UTF-8-*-# note that this encoding format is used when the file is saved.

For a platform like Windows, it uses BOM (three bytes in the file header \ xef \ xbb \ xbf) to declare that the file is UTF-8 encoded. In this case:

If the file does not contain an encoding statement, python uses utf8 for processing.
If the encoding statement is not UTF-8, python reports an error.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python character encoding and Chinese Processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python character encoding and Chinese Processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support