Python Learning Note--day5 (reprint)

Last Update:2016-08-27 Source: Internet

Author: User

Tags assert locale

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

python encoding Conversion

This paper mainly introduces the encoding mechanism of Python, the conversion between Unicode, Utf-8, utf-16, GBK, gb2312,iso-8859-1 and so on.

Common encoding conversions are divided into the following scenarios:

Automatically identify string encodings

You can use the Chardet module to automatically identify character creation codes

Chardet How to use

Unicode conversion to other encodings (GBK, GB2312, etc.)

For example, A is a Unicode encoding to be converted to gb2312. A.encode (' gb2312 ')

#-*-coding=gb2312-*-a = u"Chinese" a_gb2312 = A.encode(' gb2312 ')print a_gb2312

The difference between GBK and GB2312

GB code, the full name is Gb2312-80 "The basic set of Chinese character encoding character set for information Interchange", published in 1980, is the national standard of Chinese processing, in the mainland and overseas use of simplified English regions (such as Singapore, etc.) is mandatory use of the only Chinese code. p-windows3.2 and Apple OS are based on GB2312 as the basic Chinese character coding, Windows 95/98 GBK as the basic Chinese character coding, but compatible support GB2312. GB code A total of 6,763 Simplified Chinese characters, 682 symbols, including Chinese characters: first-class word 3755, sorted by pinyin, two-word 3008, by the order of the Radicals. The formulation and application of this standard has played a very important role in the process of Chinese informatization.

GBK coding is a new Chinese code extension national standard developed by mainland China, which is equivalent to UCS. The GBK Working Group completed the GBK specification in October 1995, the same year in December. The coding standard is compatible with GB2312, a total of 21,003 Chinese characters, 883 symbols, and provides 1894 word-of-character code, simple, traditional characters in a library.

GBK includes all the encodings of GB2312, some GB2312 not, and need to be encoded with GBK.

Turn: GBK, the difference between gb2312,big5,unicode,utf-8,utf-16

Other encodings (UTF-8,GBK) converted to Unicode

For example: A is gb2312 encoded and is converted to Unicode. Unicode (A, ' gb2312 ') or A.decode (' gb2312 ')

#-*-coding=gb2312-*-a = u"Chinese" a_gb2312 = A.encode(' gb2312 ')print a_gb2312 A_unicode = a_gb2312. decode(' gb2312 ')assert(A_unicode = = a) A_utf_8 = A_unicode.  Encode(' utf-8 ')print a_utf_8

Conversion between non-Unicode encodings

Encode 1 (gbk,gb2312) to encode 2 (utf-8,utf-16,iso-8859-1)

You can convert to Unicode and then to code 2

such as gb2312 turn Utf-8

#-*-coding=gb2312-*-a = u"Chinese" a_gb2312 = A.encode(' gb2312 ')print a_gb2312 A_unicode = a_gb2312. decode(' gb2312 ')assert(A_unicode = = a) A_utf_8 = A_unicode.  Encode(' utf-8 ')print a_utf_8

Determining the encoding of a string

Isinstance (S, str) is used to determine whether a generic string
Isinstance (S, Unicode) is used to determine whether Unicode
If a string is already Unicode, then performing a Unicode conversion sometimes makes an error (not all errors)

The following code converts any string to Unicode

def u(S, Encoding):    isinstanceUnicodeUnicode(S, encoding)

The difference between Unicode and other encodings

Why not all the files are Unicode, but also using gbk,utf-8 and other codes?

Unicode can be called abstract encoding, meaning it is an internal representation and cannot be stored directly.
When you save to disk, you need to convert it to the corresponding encoding, such as Utf-8 and utf-16.

Other methods

In addition to the above coding methods, you can also use the codecs open method to convert to and from the file when reading and writing.

command line default encoding detection and setup

You can detect the command line default encoding and set the command line encoding by using the module locale that comes with Python.

Locale  locale.  Getdefaultlocale()# (' Zh_cn ', ' cp936 ') #setlocale.  SetLocale(... )

Chinese character to Unicode encoding

    Pd_name = Pd_name. decode(' utf-8 ')    print pd_name    ord(c); Nname + = C

Python Learning Note--day5 (reprint)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More