Character encoding
When the Python interpreter loads the code in the. py file, the content is encoded (default ASCII)
ASCII code
ASCII (American Standard Code for Information interchange, United States Standards Information Interchange Code) is a set of computer coding systems based on the Latin alphabet, mainly used to display modern English and other Western European languages, It can only be represented by a maximum of 8 bits (one byte), and the ASCII code can only represent a maximum of 255 characters.
Handling of Chinese
GB2312 coding is suitable for the exchange of information between Chinese character processing and Chinese character communication system, which is used in mainland China and Singapore. Almost all Chinese-language systems and international software in mainland China support GB 2312.
Basic set of total income Chinese characters 6,763 and non-Chinese characters graphic characters 682. The entire character set is divided into 94 extents, each of which has 94 bits. Each location has only one character, so it can be used in the region and bit to encode the Chinese character, called the location code.
The conversion into 16 into the location code plus 2020H, you get the GB code. GB code plus 8080H, will be commonly used in computer code. The Chinese Character coding Extension Specification (GBK) was enacted in 1995. GBK is compatible with the code standard for GB 2312-1980 national standards, while supporting iso/iec10646-1 and GB 13000-1 of all Chinese, Japanese, and Korean (CJK) characters at the vocabulary level, totaling 20902 words.
The appearance of Unicode
Unified code, Universal code, single Code) is an industry standard in the field of computer science, including character set, encoding scheme, etc. Unicode is created to address the limitations of traditional character encoding schemes, which set a uniform and unique binary encoding for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing.
UTF-8 encoding Format
utf-8编码格式规定中文统一占三个字节。
How do I get the default code format for the current system?
import sysprint(sys.getdefaultencoding())
Character transcoding
all characters in Python3 are Unicode, so only encode need not be decode Unicode.
If you convert the string to GBK encoding:
s = "unicode字符串"s_gbk = s.encode("gbk")
If you convert the string to UTF-8 encoding:
s_utf8 = s.encode("utf-8")
If you convert a string of GBK format to the UTF-8 format, you need to convert the GBK format to Unicode format and then convert the Unicode to the encoding in UTF-8 format:
gbk_to_utf8 = s_gbk.decode("gbk").encode("utf-8")
It is important to note that encode
the subsequent string is converted to a type by default bytes
.
Python basic knowledge of character encoding and transcoding