Tags: python coded character encodingRecent contact with Python coding related things, found that they do not understand the system, so through the search for information to do some summary. Character encoding
Strings are also a type of data, but a special string is a coding problem.
We know that inside the computer, all information is ultimately a binary value. Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 00000000 to 11111111.
In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits. This is known as ASCII code and has been used so far. The ASCII code specifies a total of 128 characters, such as the space is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) occupy only one byte of the latter 7 bits, and the first one is 0.
Asian countries, the use of a lot of characters, the number of Chinese characters is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols. What can be imagined is that there are hundreds of languages around the world, Japan to the Japanese into the Shift_JIS, South Korea to the Korean euc-kr, countries have national standards, will inevitably appear conflict, the result is, in the multi-language mixed text, the display will be garbled.
As a result, Unicode emerges. Unicode unifies all languages into a set of encodings, so there is no more garbled problem. Unicode is a large collection that now scales to accommodate the 100多万个 symbol. Each symbol is encoded differently, for example, u+0639 means that the Arabic letter ain,u+0041 represents the English uppercase letter A,u+4e25 The Chinese character is strict. The specific symbol table, you can query unicode.org. Currently, the modern operating system and most programming languages support Unicode directly.
Now, the difference between ASCII encoding and Unicode encoding is smoothed: ASCII encoding is 1 bytes, and Unicode encoding is usually 2 bytes.
字母A: 用ASCII编码是十进制的65，二进制的01000001；字符0: 用ASCII编码是十进制的48，二进制的00110000，注意字符‘0‘和整数0是不同的；汉字中: 已经超出了ASCII编码的范围，用Unicode编码是十进制的20013，二进制的01001110 00101101。
You can guess that if you encode ASCII-encoded A in Unicode, you only need to make 0 on the front, so the Unicode encoding for A is 00000000 01000001.
There are two serious problems here, and the first question is, how can you differentiate between Unicode and ASCII? How does the computer know that three bytes represents a symbol instead of three symbols? The second problem is that we already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each letter must have two to three bytes is 0, which is a great waste for storage, the size of the text file will be two or three times times larger , it is unacceptable.
They result in:
1）出现了 Unicode 的多种存储方式，也就是说有许多种不同的二进制格式，可以用来表示 Unicode。2）Unicode 在很长一段时间内无法推广，直到互联网的出现。
The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 (characters in two-byte or four-byte notation) and UTF-32 (characters in four-byte notation), but not on the Internet. Again, the relationship here is that UTF-8 is one of the ways Unicode is implemented.
One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol. The coding rules for UTF-8 are simple, with only two lines:
1）对于单字节的符号，字节的第一位设为0，后面7位为这个符号的 Unicode 码。因此对于英语字母，UTF-8 编码和 ASCII 码是相同的。2）对于n字节的符号（n > 1），第一个字节的前n位都设为1，第n + 1位设为0，后面字节的前两位一律设为10。剩下的没有提及的二进制位，全部为这个符号的 Unicode 码。
The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.
Unicode符号范围 | UTF-8编码方式(十六进制) | （二进制）----------------------+---------------------------------------------0000 0000-0000 007F | 0xxxxxxx0000 0080-0000 07FF | 110xxxxx 10xxxxxx0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
According to the above table, it is very simple to interpret UTF-8 coding. If the first bit of a byte is 0, then the byte is a single character, and if the first bit is 1, how many consecutive 1 is the number of bytes that the current character occupies.
Below, or take the Chinese character strictly as an example, demonstrates how to implement UTF-8 encoding.
Python default encoding source code file reads default encoding:
严的 Unicode 是4E25（100111000100101），根据上表，可以发现4E25处在第三行的范围内（0000 0800 - 0000 FFFF），因此严的 UTF-8 编码需要三个字节，即格式是1110xxxx 10xxxxxx 10xxxxxx。然后，从严的最后一个二进制位开始，依次从后向前填入格式中的x，多出的位补0。这样就得到了，严的 UTF-8 编码是11100100 10111000 10100101，转换成十六进制就是E4B8A5。
The default encoding to use for the STR type when the interpreter executes:
python2.x中，脚本源代码文件读写的时候是默认使用ASCII来处理，由于ASCII不支持中文，故会报错。故当我们的脚本源代码中出现中文的时候，我们一般增加# -*- coding: utf-8 -*-来解决问题，标识用utf-8编码来读取文件。python3.x中，脚本源代码文件读写的时候是默认使用UTF-8来处理，对中文比较友好。
The types of strings in the
Python2.x:python are of type STR, and when executed internally by python2.x's interpreter, STR uses ASCII encoding by default, which can be sys.setdefaultencoding (' Utf-8 ') to change the default encoding:>>> ' away from the original grass, one year old one withered flourish '. Encode (' Utf-8 ') Traceback (most recent called last): File "<stdin>", line 1, in <module>unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe7 in position 0:ordinal not in range (+) >& Gt;> sys.getdefaultencoding () ' ASCII ' >>> import sys>>> reload (SYS) <module ' sys ' (built-in) >>>> sys.setdefaultencoding (' utf-8 ') >>> ' off the grass, one year old one withered flourish '. Encode (' utf-8 ') ' \xe7\xa6\xbb\xe7\xa6\ Xbb\xe5\x8e\x9f\xe4\xb8\x8a\xe8\x8d\x89\xef\xbc\x8c\xe4\xb8\x80\xe5\xb2\x81\xe4\xb8\x80\xe6\x9e\xaf\xe8\x8d\ Xa3 ' python3.x: In python3.x, the type of the STR type is UTF-8 encoded: in [+]: sys.getdefaultencoding () out: ' Utf-8 ' in : ' Away from the original grass, one year old one withered flourish '. Encode (' Utf-8 ') out: B ' \xe7\xa6\xbb\xe7\xa6\xbb\xe5\x8e\x9f\xe4\xb8\x8a\xe8\x8d\x89\xef\xbc\x8c\xe4\xb8\ X80\xe5\xb2\x81\xe4\xb8\x80\xe6\x9e\xaf\xe8\x8d\xa3 '
We found a B in front of Python3 out, and the identity output is a bytes type. This is because one of the most important new features of Python3 is a clear distinction between strings and binary data streams. Text is always Unicode, represented by the STR type, and binary data is represented by the bytes type. The string type of Python3 is STR, which is expressed in Unicode in memory, and one character corresponds to several bytes. If you want to transfer on the network, or save to disk, you need to turn str into bytes in bytes.
Python3 data for bytes types is represented by single or double quotation marks with a B prefix:
x = B ' ABC '
Be aware of the distinction between ' abc ' and ' B ' abc ', which is STR, although the content is the same as the former, but each character of bytes occupies only one byte.
The STR represented in Unicode can be encoded as a specified bytes by using the Encode () method, for example:
>>> ‘ABC‘.encode(‘ascii‘)b‘ABC‘>>> ‘中文‘.encode(‘utf-8‘)b‘\xe4\xb8\xad\xe6\x96\x87‘>>> ‘中文‘.encode(‘ascii‘)Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeEncodeError: ‘ascii‘ codec can‘t encode characters in position 0-1: ordinal not in range(128)
The English-language STR can be ASCII encoded as bytes, the content is the same, the Chinese-containing STR can be encoded with UTF-8 bytes. STR, which contains Chinese, cannot be ASCII encoded because the range of Chinese encodings exceeds the ASCII encoding range and Python will error. In bytes, the bytes that cannot be displayed as ASCII characters are #显示 with \x#.
Conversely, if we read the byte stream from the network or disk, then the data read is bytes. To turn bytes into STR, you need to use the Decode () method:
>>> b‘ABC‘.decode(‘ascii‘)‘ABC‘>>> b‘\xe4\xb8\xad\xe6\x96\x87‘.decode(‘utf-8‘)‘中文‘
If the bytes contains bytes that cannot be decoded, the decode () method will error:
>>> b‘\xe4\xb8\xad\xff‘.decode(‘utf-8‘)Traceback (most recent call last): ...UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xff in position 3: invalid start byte如果bytes中只有一小部分无效的字节，可以传入errors=‘ignore‘忽略错误的字节：>>> b‘\xe4\xb8\xad\xff‘.decode(‘utf-8‘, errors=‘ignore‘)‘中‘
To calculate how many characters a str contains, you can use the Len () function:
>>> len(‘ABC‘)3>>> len(‘中文‘)2
The Len () function calculates the number of characters in STR and computes the number of bytes if replaced by the Bytes,len () function:
>>> len(b‘ABC‘)3>>> len(b‘\xe4\xb8\xad\xe6\x96\x87‘)6>>> len(‘中文‘.encode(‘utf-8‘))6
As can be seen, 1 Chinese characters are UTF-8 encoded and typically consume 3 bytes, while 1 English characters take up only 1 bytes.Reference
Python character encoding