A detailed description of the character coding problems that Python often appears

Source: Internet
Author: User

The encoding error often occurs when Python is doing string processing or reading a file through the Open function: unicodedecodee-rror: ' ASCII ' codec can ' t decode byte 0xe6 in position 0: Ordinal not in

Range (128) This is due to the error that occurs when the codec is not compatible with Python during encoding and decoding. So now it is necessary to have a clear understanding of the character encoding, do not need to the internal content of the codec in-depth study, only need to understand the relevant coding rules, in the future encounter such a problem can be related to their own processing on the line.

First of all, we need to make it clear that the internal storage of the computer is 0,1 encoded, not in the form of the characters we see, such as the form of one-to-one, Chinese characters, etc., the computer does not know Chinese characters or character, only know 0 or 1, then how do these characters or characters appear? Is that we make a special code for each character, and then decode, in the terminal or file to display the familiar characters or Chinese characters, rather than 0 or 1.

1, now the character encoding format has ASCII encoding, Unicode encoding, UTF-8 encoding

1.1 ASCII encoding

This is the United States for English characters in an encoding format, a total of 128 characters including numbers, uppercase and lowercase characters, as well as arithmetic and logical operators, including 32 can not print special characters such as space, TAB key, line break and so on. They found that the number of one byte can represent the 128 characters, because a byte has 8 bits, 0 or 1 on the 8 bits, then the state can be represented by 2 8 powers altogether 256, then each state represents a character perfectly capable. Then they use the lower 7 bits to denote these characters, and the highest bits are expressed in%. For English, ASCII has been able to speed up the exchange of information through computers.

1.2 Unicode encoding

ASCII encoding only applies to English characters, but there are many languages in the world, ASCII is not competent, such as Chinese, ASCII is not possible to encode, so our country uses the code when GB2312, followed by a lot of standards. But there are a lot of languages in the world, people think that can adopt a unified coding method, all the world's encoding format is unique to the marking of Unicode code is born, yes, it is used to encode all the text in the world standard.

Unicode encoding is encoded in the format of UCS-8,UCS-16,UCS-32, which uses fixed bytes to encode characters.

1.3 UTF-8 Encoding

All called (Unicode transformation Format) by the full name we can find, in fact, the UTF-8 encoding is in a single byte encoding Unicode. One point to note is that the UTF-8 encoding and Unicode encoding are not the same encoding method.

The following table:

Unicode encoding (16 binary) UTF-8 byte stream (binary)
000000–00007f 0xxxxxxx
000080–0007ff 110xxxxx 10xxxxxx
000800–00ffff 1110xxxx 10xxxxxx 10xxxxxx
010000–10ffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

2. Python Encoding Processing

Python can be compared to a pool, which has an entrance and an exit, and this pool of strings used in the processing encoding is Unicode, then at the entrance needs to be entered in the character decoding operation, decoding using the decode () function, You can use Decode (this is required to fill in the encoding of this file), and then through the Unicode-related library functions to the string processing, at the output, we need to encode the output data into the format we want to store, the function is encode (parameter is the encoding format we want), For Unicode code and we want to output the encoding format has duplicate characters, can not be encoded operations, such as numbers, English characters, etc., can be directly stored.

3, the Python source file encoding format

That is xxx.py the encoding format of this file is ASCII encoding, you can get the default encoding format through the SYS module's getdefaultencoding () function, and when we want to change the encoding format of the source file, we need to enter # _*_ at the beginning of the source file coding : UTF-8 encoding operation to change the encoding format of the source file, or you can also set the encoding format of the source file through the Sys module's function setdefaultencoding ().


A detailed description of the character coding problems that Python often appears

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.