Python basic knowledge of character encoding and transcoding

Source: Internet
Author: User

Character encoding

When the Python interpreter loads the code in the. py file, the content is encoded (default ASCII)

ASCII code

ASCII (American Standard Code for Information interchange, United States Standards Information Interchange Code) is a set of computer coding systems based on the Latin alphabet, mainly used to display modern English and other Western European languages, It can only be represented by a maximum of 8 bits (one byte), and the ASCII code can only represent a maximum of 255 characters.

Handling of Chinese

GB2312 coding is suitable for the exchange of information between Chinese character processing and Chinese character communication system, which is used in mainland China and Singapore. Almost all Chinese-language systems and international software in mainland China support GB 2312.

Basic set of total income Chinese characters 6,763 and non-Chinese characters graphic characters 682. The entire character set is divided into 94 extents, each of which has 94 bits. Each location has only one character, so it can be used in the region and bit to encode the Chinese character, called the location code.

The conversion into 16 into the location code plus 2020H, you get the GB code. GB code plus 8080H, will be commonly used in computer code. The Chinese Character coding Extension Specification (GBK) was enacted in 1995. GBK is compatible with the code standard for GB 2312-1980 national standards, while supporting iso/iec10646-1 and GB 13000-1 of all Chinese, Japanese, and Korean (CJK) characters at the vocabulary level, totaling 20902 words.

The appearance of Unicode

Unified code, Universal code, single Code) is an industry standard in the field of computer science, including character set, encoding scheme, etc. Unicode is created to address the limitations of traditional character encoding schemes, which set a uniform and unique binary encoding for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing.

UTF-8 encoding Format
utf-8编码格式规定中文统一占三个字节。

How do I get the default code format for the current system?

import sysprint(sys.getdefaultencoding())
Character transcoding

all characters in Python3 are Unicode, so only encode need not be decode Unicode.

    • If you convert the string to GBK encoding:

      s = "unicode字符串"s_gbk = s.encode("gbk")
    • If you convert the string to UTF-8 encoding:

      s_utf8 = s.encode("utf-8")
    • If you convert a string of GBK format to the UTF-8 format, you need to convert the GBK format to Unicode format and then convert the Unicode to the encoding in UTF-8 format:
      gbk_to_utf8 = s_gbk.decode("gbk").encode("utf-8")

It is important to note that encode the subsequent string is converted to a type by default bytes .

Python basic knowledge of character encoding and transcoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.