Python Learning Note--day5 (reprint)

Source: Internet
Author: User
Tags assert locale

python encoding Conversion

This paper mainly introduces the encoding mechanism of Python, the conversion between Unicode, Utf-8, utf-16, GBK, gb2312,iso-8859-1 and so on.

Common encoding conversions are divided into the following scenarios:

Automatically identify string encodings

You can use the Chardet module to automatically identify character creation codes

Chardet How to use

Unicode conversion to other encodings (GBK, GB2312, etc.)

For example, A is a Unicode encoding to be converted to gb2312. A.encode (' gb2312 ')

#-*-coding=gb2312-*-a = u"Chinese" a_gb2312 = A.encode(' gb2312 ')print a_gb2312   
The difference between GBK and GB2312

GB code, the full name is Gb2312-80 "The basic set of Chinese character encoding character set for information Interchange", published in 1980, is the national standard of Chinese processing, in the mainland and overseas use of simplified English regions (such as Singapore, etc.) is mandatory use of the only Chinese code. p-windows3.2 and Apple OS are based on GB2312 as the basic Chinese character coding, Windows 95/98 GBK as the basic Chinese character coding, but compatible support GB2312. GB code A total of 6,763 Simplified Chinese characters, 682 symbols, including Chinese characters: first-class word 3755, sorted by pinyin, two-word 3008, by the order of the Radicals. The formulation and application of this standard has played a very important role in the process of Chinese informatization.

GBK coding is a new Chinese code extension national standard developed by mainland China, which is equivalent to UCS. The GBK Working Group completed the GBK specification in October 1995, the same year in December. The coding standard is compatible with GB2312, a total of 21,003 Chinese characters, 883 symbols, and provides 1894 word-of-character code, simple, traditional characters in a library.

GBK includes all the encodings of GB2312, some GB2312 not, and need to be encoded with GBK.

Turn: GBK, the difference between gb2312,big5,unicode,utf-8,utf-16

Other encodings (UTF-8,GBK) converted to Unicode

For example: A is gb2312 encoded and is converted to Unicode. Unicode (A, ' gb2312 ') or A.decode (' gb2312 ')

#-*-coding=gb2312-*-a = u"Chinese" a_gb2312 = A.encode(' gb2312 ')print a_gb2312 A_unicode = a_gb2312. decode(' gb2312 ')assert(A_unicode = = a) A_utf_8 = A_unicode.  Encode(' utf-8 ')print a_utf_8          
Conversion between non-Unicode encodings

Encode 1 (gbk,gb2312) to encode 2 (utf-8,utf-16,iso-8859-1)

You can convert to Unicode and then to code 2

such as gb2312 turn Utf-8

#-*-coding=gb2312-*-a = u"Chinese" a_gb2312 = A.encode(' gb2312 ')print a_gb2312 A_unicode = a_gb2312. decode(' gb2312 ')assert(A_unicode = = a) A_utf_8 = A_unicode.  Encode(' utf-8 ')print a_utf_8          
Determining the encoding of a string

Isinstance (S, str) is used to determine whether a generic string
Isinstance (S, Unicode) is used to determine whether Unicode
If a string is already Unicode, then performing a Unicode conversion sometimes makes an error (not all errors)

The following code converts any string to Unicode

def u(S, Encoding):    isinstanceUnicodeUnicode(S, encoding)    
The difference between Unicode and other encodings

Why not all the files are Unicode, but also using gbk,utf-8 and other codes?

Unicode can be called abstract encoding, meaning it is an internal representation and cannot be stored directly.
When you save to disk, you need to convert it to the corresponding encoding, such as Utf-8 and utf-16.

Other methods

In addition to the above coding methods, you can also use the codecs open method to convert to and from the file when reading and writing.

command line default encoding detection and setup

You can detect the command line default encoding and set the command line encoding by using the module locale that comes with Python.

Locale  locale.  Getdefaultlocale()# (' Zh_cn ', ' cp936 ') #setlocale.  SetLocale(... )
Chinese character to Unicode encoding
    Pd_name = Pd_name. decode(' utf-8 ')    print pd_name    ord(c); Nname + = C 

Python Learning Note--day5 (reprint)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.