Unicode encoding and UTF-8 encoding in Python

Source: Internet
Author: User
Tags python script

Take a look at Liaoche's Python2.7 tutorial in the afternoon, see the string and Encode section, have a little feeling, combine Cia Qingcai's Python blog to record this feeling:

ASCII: is a byte (8bit, 0-255) of 127 letters for uppercase and lowercase letters, numbers and some symbols. It is mainly used to denote modern English and Western European languages.

So there is a problem in dealing with Chinese, because Chinese processing requires at least two bytes, so China has developed a GB2312.

As a result, countries have developed national standards. Japan has developed Shift_JIS , and South Korea has developedEuc-kr。。。那么,乱码就来了。

In order to unify, Unicode was born. The unified code unifies all languages into a set of encodings. It solves the problem of garbled characters, but the problem of inefficient storage and transmission comes again.

Because the ASCII encoding is 1 bytes, the Unicode encoding is usually 2 bytes. You indicate that an English letter is sufficient for one byte, but Unicode has to be represented by two bytes (another byte is 0).

In order to save, the encoding of converting Unicode encoding to "Variable length encoding" appeared UTF-8 . The UTF-8 encoding encodes a Unicode character into 1-6 bytes according to a different number size, the commonly used English letter is encoded in 1 bytes, the kanji is usually 3 bytes, and only the very uncommon characters are encoded into 4-6 bytes. If the text you want to transfer contains a large number of English characters, you can save space with UTF-8 encoding (ASCII code can be seen as part of the UTF-8, so a lot of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding).

Now if I were to edit a python script with Notepad, I opened the file in memory and opened up a space to temporarily store my saved code, in computer memory, uniformly using Unicode encoding.

So I write the Chinese string, to precede the plus u represents a Unicode encoded string.

Also in the static blog:

But why sometimes, we need to use decode (' Utf-8 '), and then combined with a quiet blog to see:

The code for the response of the server sent to the client (ie, the browser) by the embarrassing encyclopedia is ' UTF-8 ':

In order for text editing (reading text), Unicode encoding is required in memory, so decoding with decode (' Utf-8 ') translates UTF-8 to Unicode encoding (encode (' utf-8 ') converts Unicode to UTF-8 encoding, in the same vein).

When saving the text to the hard disk or when it needs to be transferred, it is converted to UTF-8 encoding, so we need to define it at the beginning of the Python script #-*-coding:utf-8-*-

Photo source

Liaoche's official website: https://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/ 001386819196283586a37629844456ca7e5a7faa9b94ee8000

Cia Qingcai's personal blog: http://cuiqingcai.com/990.html

Unicode encoding and UTF-8 encoding in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.