Python Chinese garbled error solution

Last Update:2017-01-13 Source: Internet

Author: User

Tags require

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently began to learn the Python language from the zero base, install the development environment after the installation plan to write a "Hello word" to verify that the configuration is successful.

Find problems

Python is an interpreted scripting language that executes the target code directly. The following line of code was written:

print "Hello python! hi python! "

Run the script without the expected output, but instead report a parameter error tip:

Syntaxerror:non-ascii character ' \xe4 ' in file G:\Eclipse\workspace\MyPy1\src\Test1\__init__.py on line 1, but no Encodin G declared; Http://python.org/dev/peps/pep-0263/for Details

What is this for? Such a simple script, so bad!

To go through the problem, from the error information to start troubleshooting. After the query, the original problem in the script in the Chinese characters. To get a clear picture of the problem, you have to start with character sets and character encoding.

Analyze the reason

Before introducing the character set, we first understand why we have a character set. What we can see on the screen is the manifested text, while the actual binary bit stream is stored in the computer storage medium. Then in the transition between the two rules need a unified standard, otherwise we plug the U disk into other computers, the document is garbled; small partner QQ upload over the file, in our local open and garbled. So in order to realize the conversion standard, the various character set standard appears. Simply speaking, a character set is a transformation relationship that prescribes which text (decoding) a binary value (encoding) corresponds to a string of binary values.

So why are there so many character set standards? This problem is actually due to the fact that many of the norms and standards were initially formulated without realizing that this would be a universal norm in the future, or that the interests of the Organization itself would be fundamentally different from the existing standards. As a result, there are so many standards that have the same effect but are not compatible with each other.

For example, the following is the 16 binary and binary comparison of the Chinese characters under different character codes:

Character coded hexadecimal coded binary
UTF-8 e7bc96 111001111011110010010110
ANSI b1e0 1011000111100000
Unicode 7f16 111111100010110

The character set is just the name of a set of rules that corresponds to real life, and the character set is the salutation to a particular language. For example: English, Chinese, Japanese. For a character set, it takes three key elements to encode a character correctly: the font table (character repertoire), the coded character set (coded character set), and the character encoding (character encoding form). The font table is a database that is equivalent to all readable or display characters, and the font table determines the range of all characters that the entire character set can represent. A coded character set that represents the position of a character in the library with a coded value, code point. Character encoding, which converts the relationship between the coded character set and the actual stored value. Typically, the value of the code point is directly stored as the encoded value. For example, in ASCII, A is ranked 65th in the table, and the value of a after encoding is 0100 0001, or decimal 65, of the binary conversion result.

The font table and the coded character set seem to be essential, since each character in the font list has a serial number of its own, which is just as good as storing the contents. Why bother to convert the serial number to a different storage format by character encoding?

The purpose of the unified font table is to be able to cover all the characters in the world, but the actual use of the word story will be found on the entire font scale is very low. For example, Chinese-language programs rarely require Japanese characters, and some English-speaking countries even have simple ASCII fonts to meet basic needs. And if each character is stored in the serial number of the font table, each character requires 3 bytes (This is an example of a Unicode font), which is obviously an additional cost (storage volume is three times times the original) for an English-speaking region that was originally encoded with only one character of ASCII. Direct some, the same hard disk, with ASCII can save 1500 articles, and 3-byte Unicode serial number storage can only save 500. So there is the UTF-8 of such a variable length code. Only one byte of ASCII character is required in the UTF-8 encoding, and still only takes up one byte. Complex characters, such as Chinese and Japanese, require 2 to 3 bytes to store.

Solve the problem

Different or incompatible character sets are used for encoding and decoding. Corresponding to real life, it's like an Englishman writing a bless (encoding process) on paper in order to express blessings. And a Frenchman got this paper, because in French bless said the meaning of injury, so think he wanted to express is injured (decoding process). This is a real life in the garbled situation. In computer science, a character encoded with UTF-8 and decoded with GBK. Because two character sets of the font table is not the same, the same character in the two-character chart position is also different, eventually will appear garbled.

Python, by default, uses ASCII character encoding (only 1 bytes of code, maximum support 256 characters), this encoding format does not contain Chinese, so the error occurred during the execution of the decoding.

The workaround is simple, as long as you declare the character encoding that appears in the current code to be normal in the code.

#-*-Coding:utf-8-*-
1
#-*-Coding:utf-8-*-

Plus, the script works!

Extended extension

Take UTF-8 encoding as an example, this is the most widely used Unicode character encoding on the Internet. Other implementations include UTF-16 and UTF-32, but not on the Internet. The UTF-8 here is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable length encoding. It can represent a symbol using the 1~4 byte, varying the byte length according to different symbols.

UTF-8 's coding rules are simple:

For Single-byte symbols, the first digit of the byte is set to 0, followed by the 7-bit Unicode code for this symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
For n-byte symbols (n>1), the first n bits are set to 1, the n+1 bits are set to 0, and the first two digits of the following bytes are placed to 10. The remaining bits, all of which are not mentioned, are all Unicode codes for this symbol.
The following table summarizes the encoding rules, and the letter x represents the bits that can be encoded.

Unicode symbol range (hexadecimal) UTF-8 encoding (binary)

0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

With the Chinese character "Yan" as an example, how to use the above coding specifications for UTF-8 coding.

Known as "Strict" Unicode is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is " 1110xxxx 10xxxxxx 10xxxxxx ". Then, start with the last bits of "Yan", then fill in the form from the back to the X, the extra bit to fill 0. This gets, "strict" UTF-8 code is "11100100 10111000 10100101", converted into 16 is e4b8a5.

You will find out why the use of UTF-8 encoding instead of Unicode more than a byte? Why not just save it in Unicode?

If you use Unicode to represent characters directly, there are two more serious problems:

How can I distinguish between Unicode and ASCII? How does a computer know that three bytes represent a symbol, rather than three symbols, respectively?
We already know that the English alphabet is enough to represent only one byte, if Unicode unification stipulates that each symbol is represented by three or four bytes, then each English letter must have two to three bytes before it is 0, which is a great waste for storage, and the size of the text file will be two or three times times larger. This is not acceptable.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More