UTF-7 encoding and decoding rules

Source: Internet
Author: User

 
 
Author: [Xiao Yunfeng]

MIME (Multipurpose Internet Mail Extensions) does not define Unicode as a licensed character set or specify how it is encoded. Although some other encoding formats (such as UTF-8) have been applied to messages, they use numbers between 128 and 255 to represent Unicode characters, this is not good for codec of non-US-ASCII character sets.
Because many mail gateways and systems cannot correctly submit eight-bit US-ASCII codes, characters using extended US-ASCII will be lost (bit. Since the UTF-7 uses only 7-bit (bit), the highest bit is not used, so UTF-7 encoding can be complete for transmission in these systems.
For some US-ASCII characters and characters other than the US-ASCII, The UTF-7 decodes by changing the byte order and uses the reserved characters in the US-ASCII as the conversion character (shift character ). The following are descriptions of UTF-7 encoding and decoding rules.

The UTF-7 divides Unicode characters into three types for processing:

1. Directly encode the character, that is, directly using the US-ASCII as the encoding character. These characters include uppercase/lowercase letters, numbers, and the following. (Note that it does not contain characters +)
'(),-./:? .
2. Optional characters that can be directly encoded. (Do not include characters or characters ~)
! "# $ % & *; <=> @ [] ^ _ '{|}
3. Unicode characters except 1 and 2.

Encoding Rules for UTF-7

1. (direct encoding) for the first type of characters, encoding directly using the US-ASCII, for the second type of characters, you can choose to use the US-ASCII or the method to change the byte order for encoding. However, note that in the mail header, If you encode the second type of characters directly with a US-ASCII, some gateways may not be able to read correctly.
2. (Unicode shifted encoding) except for the "+" character and the first and second types of characters, the characters must be decoded in byte order. The "+" symbol is used to control the start of the encoding process, if you press enter, the end of the line break or text ends, and use "-" to control the end Of the encoding process. The encoding of "+" and "-" is represented by the corrected Base64 encoding.
For example, the encoding of the string "A =Α" (Unicode: 0041 2260 0391) is: A + ImADkQ-(ASCII: 41 2B 49 6D 41 44 6B 51 2D)
3. The special character "+" is encoded as 2B2D (H ). When 2B2D (H) is encoded, that is, "+-", it is determined that 2D (H) is invalid and ignored. Therefore, 2B2D (H) encoding and the decoded string is "+" instead of "+ -". For 2B2D2D (H) encoding, the decoded string is "+ -".
4. Space (dec 32), hop cell (dec 9), carriage return (dec 13) and line feed (dec 10), coded directly using the US-ASCII.

Corrected Base64 encoding Calculation

The Unicode big endian encoding of a character is divided into two bytes, And the Base64 encoding of the character is calculated, but "=" is not used for complement.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.