Author: [Xiao Yunfeng]
MIME (Multipurpose Internet Mail Extensions) does not define Unicode as a licensed character set or specify how it is encoded. Although some other encoding formats (such as UTF-8) have been applied to messages, they use numbers between 128 and 255 to represent Unicode characters, this is not good for codec of non-US-ASCII character sets.
Because many mail gateways and systems cannot correctly submit eight-bit US-ASCII codes, characters using extended US-ASCII will be lost (bit. Since the UTF-7 uses only 7-bit (bit), the highest bit is not used, so UTF-7 encoding can be complete for transmission in these systems.
For some US-ASCII characters and characters other than the US-ASCII, The UTF-7 decodes by changing the byte order and uses the reserved characters in the US-ASCII as the conversion character (shift character ). The following are descriptions of UTF-7 encoding and decoding rules.
The UTF-7 divides Unicode characters into three types for processing:
1. Directly encode the character, that is, directly using the US-ASCII as the encoding character. These characters include uppercase/lowercase letters, numbers, and the following. (Note that it does not contain characters +)
'(),-./:? .
2. Optional characters that can be directly encoded. (Do not include characters or characters ~)
! "# $ % & *; <=> @ [] ^ _ '{|}
3. Unicode characters except 1 and 2.
Encoding Rules for UTF-7
1. (direct encoding) for the first type of characters, encoding directly using the US-ASCII, for the second type of characters, you can choose to use the US-ASCII or the method to change the byte order for encoding. However, note that in the mail header, If you encode the second type of characters directly with a US-ASCII, some gateways may not be able to read correctly.
2. (Unicode shifted encoding) except for the "+" character and the first and second types of characters, the characters must be decoded in byte order. The "+" symbol is used to control the start of the encoding process, if you press enter, the end of the line break or text ends, and use "-" to control the end Of the encoding process. The encoding of "+" and "-" is represented by the corrected Base64 encoding.
For example, the encoding of the string "A =Α" (Unicode: 0041 2260 0391) is: A + ImADkQ-(ASCII: 41 2B 49 6D 41 44 6B 51 2D)
3. The special character "+" is encoded as 2B2D (H ). When 2B2D (H) is encoded, that is, "+-", it is determined that 2D (H) is invalid and ignored. Therefore, 2B2D (H) encoding and the decoded string is "+" instead of "+ -". For 2B2D2D (H) encoding, the decoded string is "+ -".
4. Space (dec 32), hop cell (dec 9), carriage return (dec 13) and line feed (dec 10), coded directly using the US-ASCII.
Corrected Base64 encoding Calculation
The Unicode big endian encoding of a character is divided into two bytes, And the Base64 encoding of the character is calculated, but "=" is not used for complement.