Python character encoding

Source: Internet
Author: User
Tags coding standards

Character encoding describes what is character encoding

The character here refers to the character that human can recognize, the character is convenient to remember. In the computer, is the binary storage, convenient storage is a number, so the characters in the computer storage needs to first converted to a number, and then storage.
Stored procedure: Characters-to-numbers, read process: Numbers--Characters ~
In the above two conversion process, there is a one by one correspondence between the character and the number, one character corresponds to a specific number, and this one by one corresponds to the so-called character encoding ~

Character encoding issues

After the computer appeared, the Americans developed a set of ASCII, such as. One character in an ASCII table is represented by a byte, a byte 8 bits, which can represent up to 256 characters (2**8 = 256). The characters in the table include English, letters, numbers, and some special characters, and there are one by one correspondence between these characters and numbers, for example: 97 for lowercase a,65 for uppercase a ...

In fact, ASCII uses only 7 bits of a byte to represent characters, altogether 127 characters (the last 128 are called extended ASCII codes). ASCII is invented by Americans, so in addition to English, it cannot be used in other languages. So the Chinese have designated the GB2312 code, which contains the Chinese character----the corresponding relationship between the number, the Japanese also developed a shift_jis code, Korean EUC-KR code, and so on, people all have their own set of standards.

In this way, for example, there is only one language in a document, there is no problem, but if there are multiple languages in this document, there will be garbled problems regardless of which encoding standard is used, when Unicode comes into being, Unicode is compatible with the universal character, In order to avoid the above situation garbled problem ~

Introduction to Unicode

Unicode typically uses 2 bytes (16-bit binary) to represent one character and 4 bytes for uncommon characters. Unicode-compatible ASCII, for example: lowercase x, ASCII is 0111 1000 (binary), Unicode is represented as 0000 0000 0111 1000 (binary), the values of both are consistent, but using Unicode means that the 2 bytes, but ASCII means only one byte is used, storage space is one times more ~

Unicode has a mapping relationship with other encodings, so Unicode can be compatible with universal characters. "Mapping relationships with other encodings in Unicode" is not a good idea, simply put, Unicode encoding can be converted to other encodings, such as GBK, Shift_JIS, etc., and other encodings can also be converted to Unicode by mapping relationships that exist in Unicode Code, the rules of conversion such as:

A further explanation is given in the following example, which represents the mapping of other encodings, such as Unicode encoding and GBK (Unicode mapping table, intercepted), the Unicode encoding of Chinese ' people ' is ' 4eba ', and the corresponding GBK encoding is ' 484b '

Python3 Environment

>>> x = ‘\u4eba‘                # unicode 码>>> x‘人‘>>> x.encode(‘gbk‘)         # 转为 gbk 编码b‘\xc8\xcb‘>>> b‘\xc8\xcb‘.decode(‘gbk‘)    # gbk 编码转为 unicode 码‘人‘

In single or double quotes, the Unicode code begins with \u, and the Unicode code is represented by a 4-bit 16 binary number for each character. The rule is: the high 8 bits of a character (char) and the lower 8 bits are taken out respectively, converted to 16 binary number, if the conversion of 16 of the length of the number of 2 bits, then 0, then the high, low 8-bit into the 16 binary string together and in front of the "\u".

Man ' in ' from Unicode encoding to GBK encoding, the display of GBK encoded as ' C8CB ', and the figure of ' 484b ' will not meet, this is because GBK encoding in order to be compatible with ASCII, that is, if it is English, with a byte, 2 bytes is Chinese, if 1 bytes of the first Bit (the leftmost one) is 0 means ASCII, if the first digit of the consecutive 2 bytes is 1, then these 2 bytes are represented as a Chinese, so here's ' C8CB ' minus the first 1, is ' 484B ' ~

Unicode and UTF-8

ASCII uses one byte to represent one character, while Unicode requires 2 bytes, so that for the English text, the storage space is one more time, so there is UTF-8 (variable-length storage, Unicode transformation Format), UTF-8 Abbreviation for the universal code, can be displayed in Chinese simplified traditional and other languages (such as English, Japanese, Korean). The English characters in UTF-8 encoding use only 1 bytes, Chinese characters are 3 bytes, other uncommon characters use more bytes to store ~

Unicode encoding is used uniformly in in-memory characters, which avoids garbled problems, and when data needs to be stored on a hard disk or passed between networks, the Unicode encoding is converted to another encoding standard (most cases are UTF-8 and UTF-8 are recommended). Because of this more space-saving, can also reduce the network transmission pressure.

When the data needs to be re-read into memory, it needs to be decoded to Unicode by decoding, and the same encoding standard is used to decode the previous encoding to disk. The approximate process is as follows (UTF-8):

unicode(内存) -----> encode 编码 -------->utf-8(磁盘)utf-8(磁盘) --------> decode 解码 ---------->unicode(内存)

This may be asked why the memory does not directly use the UTF-8 encoding standard, utf-8 all languages, that is because many software is still using national coding standards (such as SHIFT_JIS,GBK,EUC-KR, etc.), UTF-8 encoding standard does not exist And the mapping of these encodings (in short, SHIFT_JIS,GBK,EUC-KR, etc. these encodings cannot be converted to UTF-8 encoding, UTF-8 encoding cannot be converted to these encodings), and Unicode exists, so all memory uses Unicode Coding standards can avoid garbled problems. The main purpose of the utf-8 is to reduce the amount of storage space, and if all the software is using the UTF-8 encoding standard, then the data read into memory does not need to be converted to Unicode.

Common coding problems

Common coding problems are generally 2:

--The first case
The encoding error occurs when the data is stored. For example, the text is both Chinese and Korean, but uses the EUC-KR encoding standard when storing

Re-open after saving:

The problem with the above scenario is that an error has occurred while using EUC-KR encoded storage, and Korean uses EUC-KR encoding (Unicode--and EUC-KR) without problems, but the Chinese process cannot be completed, causing the encoding to fail and the data cannot be recovered ~

--The second case
The data is correctly encoded and stored, using the wrong encoding when reading
Text is only Korean and saved with EUC-KR encoding

Re-use other encoding standards to decode after opening

Garbled, there is no problem storing the text after encoding, but the wrong decoding method is used when opening the file. Here just adjust the decoding mode can be, will not result in data loss!

Re-set it to EUC-KR.

Summarize:
1, in-memory characters are stored using Unicode encoding, the Unicode encoding is converted to other encoding standards when writing to disk or for network transmission
2, when writing to disk or network transmission, using what encoding standard to encode, then need to use the same encoding standard for decoding

3, recommended the use of UTF-8 coding standards, multi-country text can exist in a text at the same time ~

Encoding of the encoded PY file in Python

Executes the Python program, starts the Python interpreter first, and then the Python interpreter reads the contents of the py file into memory in the encoding specified by the top 2 rows of the Py file, which is commonly used to define the encoding:

1)# coding=<encoding name>2)# -*- coding: <encoding name> -*-

If the above statement must be placed in the first line of the Py file or the second line ~

If no encoding is specified in the py file, Python2 uses Ascii,python3 by default using UTF-8. This can be viewed through sys.getdefaultencoding ()
Python2 Environment

luyideMacBook-Pro:~ baby$ pythonPython 2.7.10 (default, Oct  6 2017, 22:29:07) ...>>> import sys>>> sys.getdefaultencoding()‘ascii‘

Python3 Environment

C:\Users\Baby>pythonPython 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40)...>>> import sys>>> sys.getdefaultencoding()‘utf-8‘

After the Python interpreter loads the code into memory, the code is stored in the Unicode format in memory, but when the interpreter executes to the statement that holds the string, for example: my_str = ' Hello Kitty ', the Python interpreter will request memory and then encode the string into The py file is encoded at the beginning of the specified encoding format and then stored. In Python2, strings are stored in the above-mentioned procedure, and in Python3, string unification is stored in Unicode format, which avoids a lot of unnecessary trouble in Python3.

STR and Unicode types in Python2

--STR type
As mentioned above, the strings in Python2 are encoded in the encoding format specified at the beginning of the py file and stored
Python2 Environment

# -*- coding: utf-8 -*-my_str = ‘你好‘print type(my_str)print (my_str,)输出结果:<type ‘str‘>(‘\xe4\xbd\xa0\xe5\xa5\xbd‘,)

Tip: View the true format of the string in memory, which can be viewed through the print tuple or list, and direct print will automatically convert the encoding ~

--unicode type
Python2 Environment

# -*- coding: utf-8 -*-my_str = u‘你好‘print type(my_str)print (my_str,)输出结果:<type ‘unicode‘>(u‘\u4f60\u597d‘,)

Tip: string preceded by a u, that is, the string is saved in Unicode format ~

Unicode strings can be saved by encoding, converting to other encoding formats

# -*- coding: utf-8 -*-my_str = u‘你好‘print (my_str.encode(‘utf-8‘),)输出结果:(‘\xe4\xbd\xa0\xe5\xa5\xbd‘,)
Types of str and bytes in Python3

In Python3, the Python interpreter saves the string to the newly requested memory, which is Unicode by default.

x = ‘你好‘                  # 默认就使用unicode保存到内存中,前面无需加 uprint(type(x))            # <class ‘str‘>y = x.encode(‘utf-8‘)print(y)                       # b‘\xe4\xbd\xa0\xe5\xa5\xbd‘print(type(y))              # <class ‘bytes‘>python3中申明 bytes 类型x = bytes(‘abc‘.encode(‘utf-8‘))y = b‘abc‘print(type(x))print(type(y))

TIP:
1) The string in Python3 is saved as Unicode by default, similar to the x = U ' Hello ' statement in Python2
2) Python3 string x = ' Hello ' using utf-8 encoding after output result is B ' \xe4\xbd\xa0\xe5\xa5\xbd ', which with python2 "my_str = ' Hello ';p rint ((My_str,))" (#-< c2>-Coding:utf-8--) The output is consistent. The str type in Python2 is the type of bytes in Python3 ~

To view the bytes source code in Python2:

The bytes in Python2 is intended to be compatible with the Python3 notation, and the bytes type in Python2 uses the str type directly, so:
There are 3 types of strings in Python2: Unicode, str, bytes, where bytes and Str are of the same type ~
There are 2 types of strings in Python3: Str and BYTES,STR are the unicode,bytes in Python2 python2

Getdefaultencoding and setdefaultencoding in the SYS module

Use the getdefaultencoding in the Sys module to get python's default encoding:

# python2import sysprint sys.getdefaultencoding()输出结果:ascii# python3import sysprint(sys.getdefaultencoding())输出结果:utf-8

You can see that the default encoding in Python2 is utf-8 in Ascii,python3. When strings are encode and decode (converted to Unicode or from Unicode to other encodings), the encoding format of the getdefaultencoding output is used by default, which is used more in Python2. Python3 because strings are stored in Unicode, fewer applications are used ~

Python3, when the str type (Unicode) and bytes types are merged, an error is directly

x = ‘你好,‘                               # str类型y = ‘贝贝‘.encode(‘utf-8‘)        # bytes类型print(x + y)报错信息:TypeError: must be str, not bytes

However, in Python2, this process can be done, the Python interpreter will convert STR to Unicode and then the operation, the result is also Unicode type, but using the default encoding (ASCII) to turn Str into Unicode, the following error occurs:

# -*- coding: utf-8 -*-x = u‘你好,‘y = ‘贝贝‘print(x + y)错误信息:UnicodeDecodeError: ‘ascii‘ codec can‘t decode byte 0xe8 in position 0: ordinal not in range(128)

Adjust the default encoding to output normally:

# -*- coding: utf-8 -*-import sysreload(sys)sys.setdefaultencoding(‘utf-8‘)x = u‘你好,‘y = ‘贝贝‘print(x + y)输出结果:你好,贝贝
String Print to Terminal

Python2 the encoding standard of the string directly to the terminal and the string (in what standard encoding is stored in memory) that is the terminal encoding (for example, the Windows Terminal encoding for the GBK,PYCHARM Terminal encoded as Utf-8), the two are consistent, in order to avoid garbled

In Python3, strings are stored in memory by default in Unicode format, so that no problem is garbled if the output is to any terminal, and if the string is converted to a different encoding format, the terminal does not convert it to character format, but outputs it as is.

x = ‘你好‘y = x.encode(‘utf-8‘)print(type(x))print(x)print(type(y))print(y)pycharm 输出结果:<class ‘str‘>你好<class ‘bytes‘>b‘\xe4\xbd\xa0\xe5\xa5\xbd‘

The output of the two is consistent ~
.................^_^

Python character encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.