I have not published an article after activating the service for a month. I will summarize my recently learned knowledge and make a presentation. If you have any mistakes, please correct them !!!
Sohu Weibo
The flow of MSG (all Chinese and English ):
1. Convert Unicode content to gb18030 content
2. Perform UTF-8 decoding on the content in the form of gb18030
MSG process (English only ):
Directly perform URL Decoding
Ordinary Weibo
Flow of MSG (all types ):
Directly perform urldecode decoding.
Urlencode is encoded in this way.
1. The numbers and letters remain unchanged.
2. The space is changed to "+.
3. Others are encoded as "%" and their ASCII hexadecimal format is added.
Escape encoding:
All space characters, punctuation marks, special characters, and other non-ASCII characters will be converted to % uxx character encoding (XX equals to the hexadecimal encoding of this character in the character set table number ).
The character set used is ISO Latin-1.
The ISO Latin-1 character set is a subset of the Unicode character set. It corresponds to the first 256 entries in the Unicode Character instruction table in ie4 +.
The Unicode character is double-byte 16-bit,
It can represent any language symbol, while the Latin-1 character set is a single-byte 8-bit and can only represent English and Western European characters.
Question about Chinese Character Bytes:
Unicode
The original unicode encoding is fixed length, 16 bits, that is, two bytes represent a character.
Unicode is only an encoding specification, currently the actual implementation of Unicode encoding as long as there are three: UTF-8, UCS-2 and UTF-16,
Unicode character sets can be converted according to the specification.
UTF-8
3 bytes
Chinese characters in the GB family are dubyte, but in the UTF-8 is 3 bytes, so its encoding method is 1110 XXXX 10 xxxxxx 10 xxxxxx.
Gb2312
2 bytes
The length of the inner code of a Chinese character in the encoding character set is expressed in two bytes.
Gb18030
2 bytes
Two bytes indicate one Chinese character, and the Chinese character is dubyte In the GB family.
In addition, the number of English text segments is summarized as follows:
Number of bytes of English letters and Chinese characters encoded in different character sets
English letters:
Number of bytes: 1; encoding: gb2312
Number of bytes: 1; encoding: GBK
Number of bytes: 1; encoding: gb18030
Byte Count: 1; encoding: ISO-8859-1
Byte Count: 1; encoding: UTF-8
Byte Count: 4; encoding: UTF-16
Byte Count: 2; encoding: UTF-16BE
Byte Count: 2; encoding: UTF-16LE
Chinese characters:
Number of bytes: 2; encoding: gb2312
Number of bytes: 2; encoding: GBK
Number of bytes: 2; encoding: gb18030
Byte Count: 1; encoding: ISO-8859-1
3; encoding: UTF-8
Byte Count: 4; encoding: UTF-16
Byte Count: 2; encoding: UTF-16BE
Byte Count: 2; encoding: UTF-16LE
Knowledge about Sohu Weibo's decoding methods and common character sets.