Python3 encoding problem, python3 Encoding
Continue to collect python3 encoding questions
Sources of information Peng Cheng Sina Blog (reproduced) http://blog.sina.com.cn/s/blog_6d7cf9e50102vo90.html this Peng Cheng teacher wrote on python3 encoding blog write particularly clear, directly pick it down. For your reference.
1. Starting from byte:
One byte contains eight bits, each of which represents 0 or 1, and one byte can represent 2 ^ 8 = 00000000 numbers from 11111111 to 256. An ASCII Code uses one byte (excluding the highest bit of the byte as the parity bit). The ASCII Code actually uses seven bits in one byte to represent characters, it can contain 2 ^ 7 = 128 characters. For example, in ASCII encoding, 01000001 (that is, 65 in decimal format) represents 'A', and 01000001 plus 01100001 (that is, 97 in decimal format) represents 'A '. Now open Python and call the chr and ord functions. We can see that Python converts the ASCII code.
The first 00000000 represents null characters. Therefore, ASCII encoding only contains 127 characters including letters, punctuation marks, and special characters. Because ASCII is born in the United States, it is enough for English words consisting of letters to be expressed in words. However, Chinese, Japanese, Korean, and other languages are not satisfied. A Chinese character is a single word, and it is not enough to use ASCII encoding for a total of 256 characters.
Therefore, Unicode encoding was later introduced. Unicode encoding usually consists of two bytes, representing a total of 256*256 characters, that is, the so-called UCS-2. Some remote words also use four bytes, the so-called UCS-4. That is to say, Unicode standards are still developing. But the UCS-4 appears less, we first remember:The most primitive ASCII encoding uses a single byte encoding. However, due to the large number of different language characters, two bytes are used, and Unicode encoding that includes multiple languages appears.
In Unicode, the first 127 characters in the original ASCII only need to fill in a byte with all zeros. For example, the character 'a': 01100001 mentioned earlier has changed to 00000000 01100001 in Unicode. Soon, the Americans were unhappy. after eating a pot of rice from the forest of nations in the world, the English words that could be transmitted in just one byte were now two bytes, wasting storage space and transmission speed.
People play to the wisdom, so the emergence of UTF-8 encoding. This is because of space waste.UTF-8 encoding is Variable LengthFrom a byte of an English letter to a common three bytes of Chinese character, to six bytes of Some uncommon characters. Solve the space problem, UTF-8 encoding also has a magical additional function, that is compatible with the Big Brother ASCII code. Some old-fashioned software can now continue to work in UTF-8 coding.
Note that except for English letters, Chinese characters are usually different in Unicode and UTF-8 encoding. For example, in Chinese characters, the 'character is 01001110 00101101 in Unicode and 11100100 10111000 10101101 in UTF-8 encoding.
Our motherland naturally has its own set of standards. That is, GB2312 and GBK. Of course, it's rare to see it now. Typically, UTF-8 is used directly.
2.Default encoding in Python3
In Python3, the default is the UTF-8 through the following code:
Import sys
Sys. getdefaultencoding ()
You can view the default encoding of Python3.
3.Encode and decode in Python3
In Python3, decode and encode functions are often used for character encoding. These two functions are especially useful when capturing webpages. The role of encode enables us to see the intuitive character conversion into the byte form in the computer. Decode is the opposite. It converts byte characters into understandable, intuitive, and human-like characters.
\ X indicates the hexadecimal format. \ xe4 \ xb8 \ xad indicates the binary 11100100 10111000. That is to say, the 'encode' in the Chinese character is in byte format, which is 11100100 10111000 10101101. Similarly, we use 11100100 10111000 10101101, that is, \ xe4 \ xb8 \ xad, to decode the code back, that is, the Chinese character 'zhong '. The complete format should be B '\ xe4 \ xb8 \ xad'. In Python3, strings in byte format must be prefixed with B, that is, it is written in the above 'xxxx' format.
The default encoding of Python3 is UTF-8, so we can see that Python processes these characters in UTF-8. Therefore, we can see that even if we use encode ('utf-8') to encode the character encode as a UTF-8, the results are still the same: B '\ xe4 \ xb8 \ xad '.
Understand this, at the same time we know that UTF-8 is compatible with ASCII, we can guess that the university often recite 'A' corresponds to 65 in ASCII, here is not can correct decode out. Convert 65 in decimal format to 41 in hexadecimal format. Let's try:
B '\ x41'. decode ()
The result is as follows. It is the character 'a'
4. encoding conversion in Python3
It is said that all characters in computer memory are Unicode encoded. UTF-8 is changed only when characters are to be written into a file, stored in a hard disk, or sent from a server to a client (such as code at the front of a webpage. But in fact, I am more concerned about how to display these characters in Unicode bytes to reveal its positive purpose in memory. Here is a photo of the demon mirror:
Xxxx. encode/decode ('unicode-escape ')
B '\ u4e2d' or B '\ u4e2d. A slash does not seem to affect it. In the shell window, you can also directly input '\ u4e2d' and '\ u4e2d '. the decode ('unicode-escape ') is the same and will print the Chinese character', but it is '\ u4e2d '. decode ('unicode-escape ') returns an error. Description: Python3 not only supports Unicode, but also a Unicode Character in the '\ uxxxx' format can be identified and is equivalent to the str type.
If we know a Unicode bytecode, how does it turn into a UTF-8 bytecode. Now that we understand the above, we have some ideas. decode first and then encode. The Code is as follows:
Xxx. decode ('unicode-escape '). encode ()
Final Extension
Do you still remember the ord. As the times have changed, Big Brother ASCII was merged, but ord is still useful. Try ord ('中') and the output result is 20013. What is 20013? Let's try hex (ord ('中'). The output result is '0x4e2d ', that is, 20013 is the decimal value of x4e2d that we have met countless times in the preceding figure. Here, hex is used to convert it into a hexadecimal function. People who have learned SCM will certainly be familiar with hex.
The final extension shows other people's problems on the Internet. Let's write down characters similar to '\ u4e2d'. Python3 knows what we want to express. But when Python reads a file, '\ u4e2d' appears. Does the computer not know it? Some people will give the answer below. As follows:
Import codecs
File = codecs. open ("a.txt", "r", "unicode-escape ")
U = file. read ()
Print (u)