Encoding Problems in python, python Encoding

Source: Internet
Author: User

Encoding Problems in python, python Encoding

During the programming process, we often encounter inexplicable garbled characters. Many people choose to find answers online and copy others' examples. This is a good way to quickly solve the problem. However, as a rigorous and realistic developer, if we do not thoroughly understand the mechanism of garbled code generation from the source, and therefore seek the fundamental path to solve the problem, then we will never be able to get rid of it from the shadow of the coders. Let's take a look at the ins and outs of computer coding.

ASCII

As we all know, all data in a computer, whether it is text, images, videos, or audio files, is essentially stored in a binary format similar to 01010101. However, the characters in the computer cannot be expressed in this way completely. Since the computer was originally invented by Americans, the original computer Code used American Standard (ASCII ). ASCII code consists of A total of 128 characters. For example, the uppercase letter A is 65 (Binary 01000001), and the symbol @ is 64 (Binary 01000000 ). Of the 128 symbols, 0 ~ 31 and 127 (33 in total) are control characters or communication characters. 32-126 are allocated to characters that can be found and printed on the keyboard. All content represented by ASCII encoding only occupies the last seven digits of a single byte, and the maximum bit is set to 0.

Later, an extended ASCII code was introduced to indicate letters other than English letters in Europe. ExtendedASCIIIncludeOriginalIt contains 128 characters, and an additional 128 characters, totaling 256 characters. When encoding, the highest bit is 1, so that it can be fully compatible with the ASCII code. It can represent characters such as the phonetic alphabet (encoding 145, binary 10010001) and the French letter e (encoding 130, binary 10000010.

This code indicates the phonetic alphabet and most non-English letters in Europe, but it is not an international standard. in different countries, the characters between 128 and 255 are not exactly the same, this produces a variety of extended ASCII codes. For example, the ISO8859-1 Character Set, Latin-1, has been added to common characters in Western Europe, including letters from Germany and France. The ISO8859-2 Character Set, also known as Latin-2, collects Eastern European characters. The ISO8859-3 Character Set, also known as Latin-3, collects southern European characters and so on.

Is this encoding method sufficient? Obviously not enough. For example, Chinese characters cannot be expressed in ASCII. The extended ASCII is far from enough.

GBK

The Chinese have made many efforts in order to be able to use computers normally. GB2312 is the result of this effort. This standard was released in 1980 and will be implemented in May 1, 1981. It marks an important step in the use of electronic computers in China. The GB2312 encoding contains 6763 Chinese characters and is also compatible with ASCII. This character encoding basically meets the computer processing needs of Chinese characters. The Chinese characters it contains already cover 99.75% of the usage frequency in mainland China, and cannot be processed for some ancient Chinese and traditional Chinese characters GB2312. Later, a GBK code was created on the basis of GB2312, which was officially released in 1995. GBK not only includes all Chinese characters and non-Chinese Characters in GB 2312, but also contains Chinese Characters in Japanese and Korean, for example, the GBK encoding in Lee shihei, a famous Korean go player, is 0x8168 (0x indicates hexadecimal ). Here we can query the encoding of Chinese characters.

GBK encoding generally uses two bytes to represent one character. If it is an English letter, it uses one character, which is the same as ASCII encoding. Therefore, GBK is also compatible with ASCII encoding, but not compatible with any extended ASCII code. This can be seen from its encoding sequence.

GBK uses dual-byte representation. The total encoding range is 0x8140-0xFEFE (1000000101000000-1111111011111110). The first byte is between 0x81-0xFE, And the last byte is between 0x40-0xFE. It can be seen that the highest bit of the first byte is 1. In this way, if the highest bit of the last byte is 0, it can be parsed into an ASCII encoded character; otherwise, it is a consecutive two-byte character.

Unicode

There are many languages in the world. Is there any encoding method that can include characters in all languages? The answer is yes. Unicode encoding is designed to meet this requirement. Unicode is a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. To represent so many characters in binary format, a large number of bytes are required for one-to-one matching. The standard Unicode uses four bytes to represent a string. The four-byte binary code is called the code point of this character. For example, U + 0639 indicates the Arabic letter Ain, U + 0041 indicates the upper-case English letter A, and U + 4E6D indicates the Chinese character "yellow ". You can access unicode.org to query the table corresponding to the specific symbol.

It is obviously not scientific to use four bytes to represent a single character, because many English letters only need one byte to represent it. If we use four bytes to represent it, it will cause a great waste. So there was a UTF-8 code.

Unicode only specifies how the characters are encoded, and does not specify how to store and transmit the characters. UTF-8 encoding is a Unicode encoding implementation method, it stipulates that can use 1 ~ The length of each byte varies according to the characters to be expressed. English letters are represented in 1 byte, and Chinese characters are represented in 2-3 bytes.

The problem arises, because the strings in the computer are continuously encoded in 0101, how can we express the code points of a character in the Unicode encoding table, in addition, the computer can understand that a byte in this continuous encoding string is an English letter, rather than forming two or three characters with the previous encoding string. The UTF-8's coding designer cleverly solved this problem.

The English characters, which can be expressed in ASCII code, are expressed in UTF-8 and only need one byte space, which is the same as ASCII. For multi-byte (n Bytes) characters, the first n of the first byte is set to 1, and the n + 1 is set to 0, set the first two bytes to 10. The remaining binary digits are all filled with the Unicode code of the character.

Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
----------------------- + ---------------------------------------------
0000 0000 ~ 0000 007F | 0 xxxxxxx
0000 0080 ~ 0000 07FF | 110 xxxxx 10 xxxxxx
0000 0800 ~ 0000 FFFF | 1110 xxxx 10 xxxxxx 10 xxxxxx

0001 0000 ~ 0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

This encoding method is easy to understand. If the first byte is 0, the byte corresponds to a character. If the first byte is 1, it depends on the number of consecutive 1 after it, it indicates the number of bytes occupied by the character. For example, the Unicode code of "I" is 0x6211, And the binary code is 110001000010001, which falls within the range of the third line (0000 0800 ~ 0000 FFFF), so "I" need three bytes in the format of "1110 xxxx 10 xxxxxx 10 xxxxxx ". Then, start from the last binary bit of "I" and fill in x in the format from the back to the front, and fill in 0 for the extra bit. In this way, we get the "I" UTF-8 encoding is "11100110 10001000 10010001", convert to hexadecimal is E68891, which is the binary encoding ultimately stored in the computer.

Here I pointed out a misunderstanding, there are many online utf8 encoding conversion tools on the network, claiming that the Chinese character can be converted into UTF-8 encoding, in fact, most of the tools just convert Chinese characters into the corresponding unicode Code Point, it is not actually UTF-8 encoding during storage and transmission.

In addition to UTF-8, Unicode implementations also have UTF-16, UTF-32. UTF-16 use 2 ~ 4 bytes represents a character, and the UTF-32 represents a character with the standard 4 bytes, one-to-one correspondence with its Unicode points. No matter which form is used, the Unicode code points corresponding to the same character are the same, but the code points are converted differently during storage and transmission.

PYTHON character encoding

The following describes the encoding in Python.

The default encoding of Python is ASCII, which is related to the background of its birth. The birth time of Python was in December 1989, and Unicode was officially announced in December 1994. At the beginning of Python, Unicode was not available, you can only select ASCII. Later, it was improved by multiple parties to make it suitable for non-English users.

If no modification is made, Python uses ASCII to encode all codes, including comments.

>>> Import sys

>>> Sys. getdefaultencoding ()

'Ascii'

If Chinese characters appear in the code, Python uses the default Windows encoding method for storage. GBK in simplified Chinese

>>> Str = 'hello'

>>> Str

'\ Xc4 \ xe3 \ xba \ xc3'

However, ASCII is still used during compilation.

# Stringtest. py

Print 'Hello'

File "string. py", line 1

SyntaxError: Non-ASCII character '\ xe6' in file D:/MyGit/taobaoSpider/string. py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

 

If you want to use Chinese in your code, you must declare the encoding method of this file at the beginning of the Code (the first or second line), such as setting the encoding method to UTF-8

# Coding = <UTF-8>

Or

# Coding = <UTF-8>

#! /Usr/bin/python

In this way, you can use Chinese characters in the code.

(End)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.