1 encoding of the string (encode) format
GB2312 GBK GB18030 UTF-8 ASCII
Among the commonly used encoding formats are
GB Series: GB18030 (GBK (GB2312)) (Windows common)
International Standard: UNICODE16 <---> UTF-8 (commonly used Linux/mac OS x/ios/android)
How many bytes does a Chinese character correspond to?
2 (GBK)/3 (UTF-8)
1.1 GB2312-80 Encoding
Released 1980
Two bytes encoded, encoding range (A1a1-fefe) (0xa1-0xfe) (0XA1-XFE) contains 6,763 kanji and 682 characters
1.2 GBK encoding
Developed in 1995
With two-byte encoding (8140~FEFE)
Fully compatible with GB2312, a total of 21,003 kanji (Chinese, Mongolian, etc.)
1.3 gb18030-2005 Encoding
2005 a total of 27,533 Chinese characters, encoded with two bytes and four bytes, two bytes are the same as GBK.
Four bytes: slightly
1.4 UNICODE16 Encoding (0x0000 ~ 0xFFFF)
UTF-8 Encoding (8-bit Unicode transformation Format)
UNICODE <<---->> UTF-8 0000~007f One-byte (ASCII) 0080~07ff two bytes 0800~ffff three bytes (Chinese characters fall in this area)
Unicode encoding: UNICODE16 (two bytes), UNICODE32 (four-byte encoding).
can refer to:
Python text and byte sequences
Python byte and byte array-pytips 0x08
Python Learning note 015--Chinese character coding