Python character encoding

Last Update:2018-01-10 Source: Internet

Author: User

Tags control characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python runtime when there is a unicodedecodeerror,unicodeencodeerror error, how to face, solve. A computer must first be converted to a number when it processes text. The computer uses binary.

String encoding

(1) General-purpose Unicode
(2) Converting Unicode encoding to some kind of encoding

Bytes

Data storage base unit, one byte equals 8bit, so one byte corresponds to 256 states.

Character

Character a unit of information, which is a general term for various words and symbols, such as an English letter is a character, a Chinese character is a char, a punctuation mark is also a character

Character

A collection of characters within a range, with different character sets that specify the number of characters. The ASCII character set has a total of 128 characters, including English letters, Arabic numerals, punctuation marks, and control characters. The GB2312 character set defines 7,445 characters and contains most of the kanji characters

Character code

The character code in character set is mapped to a specific implementation scheme of byte stream, and the common character encoding is ASCII encoding, UTF-8 encoding, GBK encoding.
The character set and character encoding have a correspondence,
For example, the ASCII character set corresponds to ASCII encoding. ASCII character encoding specifies that all characters are encoded using 7 bits in a single-byte low. For example, the number of "a" is 65, the single-byte representation is 0x41, so when writing to the storage device is B ' 01000001 '.

Encode, decode

Encoding is the process of converting a character into a byte stream, and the decoding process is to parse the byte stream into characters.

ASCII encoding

128-character encoding for correspondence between English characters and binary

GB2312

Common coding methods in Chinese, two bytes means a Chinese character theory has 256x256=65536 a symbol

Unicode Unified Code, universal code, single code

Ability to convert and process text across languages across platforms

Unicode encoding two bytes, ASCII one byte

ASCII encoding of A: 01000001, Unicode Encoding: 0000000001000001 wasted space stored up, more than ASCII wasted, to save space between the format character set, Utf-8 and utf-16

UTF

Variable-length character encoding, 1-4 bytes to represent a character, the English letter is compiled into a byte, the kanji 3 bytes.

Coding rules for UTF-8:

A. For a single-byte symbol, the first bit of byte is set to 0, the next 7 bits are the Unicode encoding of this byte, the English letter Utf-8 and the ASCII code are the same B. For the N-byte notation (n>1) The first n bits are set to 1, the n+1 bit is set to 0, The first two bits of the trailing byte are all set to 10 the remainder of the bits that are not mentioned are all Unicode encodings for this symbol

Code comparison case:

String encoding in Py3 using str and bytes
(1) STR string: Using Unicode encoding
(2) Bytes string: Converts Unicode encoding to some type of encoding such as Utf-8

Default encoding Unicodeencode and decode in Python3

Encode is responsible for encoding Unicode into the specified character encoding
Decode converting other character encodings to Unicode encoding

Causes of Unicodeencodeerror and unicodedecodeerror errors

Unicodeencodeerror and Unicodedecodeerror are wrong, the root cause of these errors is that Python2 default is to use ASCII encoding for decode and encode operations

>> s = ' we '
>> S.decode ()
Traceback (most recent):
File "<stdin>", line 1, in <module>
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range (128)

When converting S to a string of Unicode type, the Decode method is decoded by default using ASCII encoding, and the ASCII character set does not have a Chinese character "Hello", so there is a unicodedecodeerror, the correct way is to specify UTF-8 Character encoding.

>> s.decode (' Utf-8 ')
U ' \u4f60\u597d '

For the encode operation, when converting a Unicode string to a string of type STR, the encoding is converted by default using ASCII encoding, and the ASCII character set cannot find the Chinese character "Hello", so there is a unicodeencodeerror error.

>> a = U ' Us '
>> A.encode ()
Traceback (most recent):
File "<stdin>", line 1, in <module>
Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-1: Ordinal not in range (128)

When a str type is mixed with a string of Unicode type, a string of type str implicitly converts STR to a Unicode string, and if the STR string is a Chinese character, a unicodedecodeerror error occurs because Python2 defaults to the Use ASCII encoding to perform decode operations.

>> s = ' Hello ' # str type
>> y = u ' python ' # Unicode type
>> s + y # implicit conversion, i.e. S.decode (' ASCII ') + U
Traceback (most recent):
File "<stdin>", line 1, in <module>
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xe4 in position 0:ordinal not in range (128)

The correct way to do this is to specify the UTF-8 character encoding to decode

>> s.decode (' utf-8 ') +y
U ' \u4f60\u597dpython '

All garbled reasons can be attributed to characters that have been encoded in different encoding formats using inconsistent encoding format

Python character encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More