Interpretation of general coding principle from Python2,python3 coding problem

Source: Internet
Author: User
Tags hex code

An exception was encountered using PYTHON2 encoding today unicodedecodeerror: ' ASCII ' code can ' t decode byte 0xef

Discovery is a coding problem, but the usual in the Python3 almost no encounter, so deliberately checked the data, the original Python3 and Python2 for the string understanding is different, in Python3, the string default Unicode encoding

I. Interpreting Python2 and Python3 Text processing methods

In Python3, the text string type (using the Unicode data store) is named Str, and the byte string type is named bytes. In general, instantiating a string will get a str object:

If you want to bytes, precede the text with prefix B, or encode.

So, obviously, the Str object has a encode method, and the bytes object has a decode method.

The Str object in Python3 is called Unicode in Python2, and the bytes object is called str in Python2 .

The way to use Chinese strings in Python2 is to declare at the top of the page # *--coding:utf-8--*

Two. Common encoding method

By the way, I looked up the character set and encoding of the document, and found that many people are very difficult to understand, so try to make it clear here. The character set and encoding are not parsed here, collectively known as the Encoding method.

No matter what the encoding, uniform in the computer is stored in binary bytes, a byte has 8 bits

1.ASCII code, a total of 128, with 1 bytes of low 7-bit to represent

2.iso-8859-1,128 characters are obviously not enough, so the ISO organization has developed new standards to extend the ASCII code, which is iso-8859-1 to Iso-8859-15, where iso-8859-1 covers most of the Western European codes, so it uses the most. Iso-8859-1 is also a single-byte encoding, representing a total of 256 characters, the discovery of characters into? , the general use of the Iso-8859-1 code, the rule does not know the scope of the use of 3f is that?

3.gb2312, double-byte encoding, with a range of A1-f7, where a1-a9 is the symbol area, contains 682 symbols, B0-f7 is the Chinese character area, contains 6,763 kanji, and the kanji is represented by two bytes.

4.GBK, for extended GB2312, encoding range 8140~fefe, a total of 23,940 code bits, using GB2312 encoded kanji can be decoded with GBK

5.utf-16, all characters are two bytes, two bytes is 16bit, so call UTF-16, but waste space. (same Unicode)

6.utf-8, each coded area has a different loadline length, up to three bytes, with the following encoding rules:

(1) If the first bit of a byte is 0, then the current character is a single-byte character, occupying one byte of space. All sections after 0 (7 bit) represent the ordinal number in Unicode.
(2) If a byte starts with 110, it represents the current character as a double-byte character and occupies 2 bytes of space. All sections after 110 (7 bit) represent the ordinal number in Unicode. And the second byte starts with 10
(3) If a byte starts with 1110, it represents the current character as three-byte character and occupies 2 bytes of space. All sections after 110 (7 bit) represent the ordinal number in Unicode. And the second and third bytes start with 10
(4) If a byte starts with 10, the current byte is the second byte of a multibyte character. All sections after 10 (6 bit) represent ordinal numbers in Unicode

The specific characteristics of each byte are visible in the following table, which x represents the ordinal part, and all the parts of each byte are x stitched together to form the ordinal number in the Unicode font.

Byte 1 Byte 2 Byte3
0xxx xxxx
110x xxxx 10xx xxxx
1110 xxxx 10xx xxxx 10xx xxxx

Let's look at three UTF-8 encoding examples from one byte to three bytes, respectively:

Actual character Hexadecimal in Unicode font ordinal Binary in Unicode font number UTF-8 binary after encoding UTF-8 hexadecimal after encoding
$ 0024 010 0100 0010 0100 24
¢ 00a2 000 1010 0010 1100 0010 1010 0010 C2 A2
20AC 0010 0000 1010 1100 1110 0010 1000 0010 1010 1100 E2-AC

Careful readers are not difficult to draw from the above simple introduction of the following rules:

    • 3 bytes of UTF-8 hex code must be at E the beginning of the
    • 2 bytes of UTF-8 hexadecimal encoding must be C or begin with D
    • 1 bytes of UTF-8 hexadecimal encoding must start with a 8 smaller number
Three. Emoji of common problem handling

The so-called emoji is a character that is in \u1F601 \u1F64F the-section of Unicode. This clearly exceeds the encoding range of the currently used UTF-8 character set \u0000\uFFFF . Emoji expression with the popularity and support of iOS is becoming more and more common. Here are a few common emoji:

So what effect does emoji character expression have on our usual development and operation? The most common problem is when you put him in the MySQL database. In general, the default character set for MySQL databases is configured to be UTF-8 (three bytes), and UTF8MB4 is supported after 5.5, and few DBAs actively change the system default character set to UTF8MB4. Then the problem is, when we put a 4-byte UTF-8 code to represent the character in the database when the error: ERROR 1366: Incorrect string value: ‘\xF0\x9D\x8C\x86‘ for column . If you read the above explanation carefully, then this error is not ugly to understand. We tried to insert a string of bytes into a column, and the first byte of the string bytes was \xF0 meant to be a four-byte UTF-8 encoding. However, when the MySQL table and column character set are configured as UTF-8, it is not possible to store such characters, so the error is reported.

"Attached" hand-held two 锟 jin, mouth cries hot hot hot. Pedal thousand Roses Tun Tun, laugh at everything nobelium nobelium nobelium.

Interpretation of general coding principle from Python2,python3 coding problem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.