Interpretation of general coding principle from Python2,python3 coding problem

Last Update:2017-12-08 Source: Internet

Author: User

Tags hex code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

An exception was encountered using PYTHON2 encoding today unicodedecodeerror: ' ASCII ' code can ' t decode byte 0xef

Discovery is a coding problem, but the usual in the Python3 almost no encounter, so deliberately checked the data, the original Python3 and Python2 for the string understanding is different, in Python3, the string default Unicode encoding

I. Interpreting Python2 and Python3 Text processing methods

In Python3, the text string type (using the Unicode data store) is named Str, and the byte string type is named bytes. In general, instantiating a string will get a str object:

If you want to bytes, precede the text with prefix B, or encode.

So, obviously, the Str object has a encode method, and the bytes object has a decode method.

The Str object in Python3 is called Unicode in Python2, and the bytes object is called str in Python2 .

The way to use Chinese strings in Python2 is to declare at the top of the page # *--coding:utf-8--*

Two. Common encoding method

By the way, I looked up the character set and encoding of the document, and found that many people are very difficult to understand, so try to make it clear here. The character set and encoding are not parsed here, collectively known as the Encoding method.

No matter what the encoding, uniform in the computer is stored in binary bytes, a byte has 8 bits

1.ASCII code, a total of 128, with 1 bytes of low 7-bit to represent

2.iso-8859-1,128 characters are obviously not enough, so the ISO organization has developed new standards to extend the ASCII code, which is iso-8859-1 to Iso-8859-15, where iso-8859-1 covers most of the Western European codes, so it uses the most. Iso-8859-1 is also a single-byte encoding, representing a total of 256 characters, the discovery of characters into? , the general use of the Iso-8859-1 code, the rule does not know the scope of the use of 3f is that?

3.gb2312, double-byte encoding, with a range of A1-f7, where a1-a9 is the symbol area, contains 682 symbols, B0-f7 is the Chinese character area, contains 6,763 kanji, and the kanji is represented by two bytes.

4.GBK, for extended GB2312, encoding range 8140~fefe, a total of 23,940 code bits, using GB2312 encoded kanji can be decoded with GBK

5.utf-16, all characters are two bytes, two bytes is 16bit, so call UTF-16, but waste space. (same Unicode)

6.utf-8, each coded area has a different loadline length, up to three bytes, with the following encoding rules:

(1) If the first bit of a byte is 0, then the current character is a single-byte character, occupying one byte of space. All sections after 0 (7 bit) represent the ordinal number in Unicode.
(2) If a byte starts with 110, it represents the current character as a double-byte character and occupies 2 bytes of space. All sections after 110 (7 bit) represent the ordinal number in Unicode. And the second byte starts with 10
(3) If a byte starts with 1110, it represents the current character as three-byte character and occupies 2 bytes of space. All sections after 110 (7 bit) represent the ordinal number in Unicode. And the second and third bytes start with 10
(4) If a byte starts with 10, the current byte is the second byte of a multibyte character. All sections after 10 (6 bit) represent ordinal numbers in Unicode

The specific characteristics of each byte are visible in the following table, which x represents the ordinal part, and all the parts of each byte are x stitched together to form the ordinal number in the Unicode font.

Byte 1	Byte 2	Byte3
0xxx xxxx
110x xxxx	10xx xxxx
1110 xxxx	10xx xxxx	10xx xxxx

Let's look at three UTF-8 encoding examples from one byte to three bytes, respectively:

Actual character	Hexadecimal in Unicode font ordinal	Binary in Unicode font number	UTF-8 binary after encoding	UTF-8 hexadecimal after encoding
$	0024	010 0100	0010 0100	24
¢	00a2	000 1010 0010	1100 0010 1010 0010	C2 A2
€	20AC	0010 0000 1010 1100	1110 0010 1000 0010 1010 1100	E2-AC

Careful readers are not difficult to draw from the above simple introduction of the following rules:

3 bytes of UTF-8 hex code must be at E the beginning of the
2 bytes of UTF-8 hexadecimal encoding must be C or begin with D
1 bytes of UTF-8 hexadecimal encoding must start with a 8 smaller number

Three. Emoji of common problem handling

The so-called emoji is a character that is in \u1F601 \u1F64F the-section of Unicode. This clearly exceeds the encoding range of the currently used UTF-8 character set \u0000 – \uFFFF . Emoji expression with the popularity and support of iOS is becoming more and more common. Here are a few common emoji:

So what effect does emoji character expression have on our usual development and operation? The most common problem is when you put him in the MySQL database. In general, the default character set for MySQL databases is configured to be UTF-8 (three bytes), and UTF8MB4 is supported after 5.5, and few DBAs actively change the system default character set to UTF8MB4. Then the problem is, when we put a 4-byte UTF-8 code to represent the character in the database when the error: ERROR 1366: Incorrect string value: ‘\xF0\x9D\x8C\x86‘ for column . If you read the above explanation carefully, then this error is not ugly to understand. We tried to insert a string of bytes into a column, and the first byte of the string bytes was \xF0 meant to be a four-byte UTF-8 encoding. However, when the MySQL table and column character set are configured as UTF-8, it is not possible to store such characters, so the error is reported.

"Attached" hand-held two 锟 jin, mouth cries hot hot hot. Pedal thousand Roses Tun Tun, laugh at everything nobelium nobelium nobelium.

Interpretation of general coding principle from Python2,python3 coding problem

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More