Tian haili @ csdn
2012-07-31
The mime (multipurposeinternet Mail Extensions) is used when an email is uploaded or transmitted over the network ). Mail transmission can only deliver US-ASCII characters, and other characters in the mail must be converted by a certain amount of encoding before transmission. For emails whose subject or/and attachment names are Chinese characters, some email systems may encounter garbled characters due to the lack of encoding (character encoding and Transmission Encoding) information. This article analyzes the encoding of the email system in Android-base64 and quoted-printable.
The subject and attachment names of the email, which indicate the Transmission Encoding and character encoding in a short format. Character encoding can be UTF-8, gb2312, etc.; Transmission Encoding commonly used include base64 and quoted-printable. This article mainly look at the Transmission Encoding, Unicode encoding about character encoding, can refer to the Unicode encoding and Its Implementation: UTF-16, UTF-8, and more.
I. base64 encoding
Base64 encoding is widely used in network transmission. Base64 converts the content to be converted into printable characters (including the 'A '~ 'Z', 'A '~ 'Z', '0 '~ '9', '+', '/', 64 in total, and '= ').
Two-dimensional table (64 characters, index only needs 6 bits, that is, maximum 0x3f ):
Index |
Character |
Index |
Character |
Index |
Character |
Index |
Character |
0 |
A |
17 |
R |
34 |
I |
51 |
Z |
1 |
B |
18 |
S |
35 |
J |
52 |
0 |
2 |
C |
19 |
T |
36 |
K |
53 |
1 |
3 |
D |
20 |
U |
37 |
L |
54 |
2 |
4 |
E |
21 |
V |
38 |
M |
55 |
3 |
5 |
F |
22 |
W |
39 |
N |
56 |
4 |
6 |
G |
23 |
X |
40 |
O |
57 |
5 |
7 |
H |
24 |
Y |
41 |
P |
58 |
6 |
8 |
I |
25 |
Z |
42 |
Q |
59 |
7 |
9 |
J |
26 |
A |
43 |
R |
60 |
8 |
10 |
K |
27 |
B |
44 |
S |
61 |
9 |
11 |
L |
28 |
C |
45 |
T |
62 |
+ |
12 |
M |
29 |
D |
46 |
U |
63 |
/ |
13 |
N |
30 |
E |
47 |
V |
|
|
14 |
O |
31 |
F |
48 |
W |
|
|
15 |
P |
32 |
G |
49 |
X |
|
|
16 |
Q |
33 |
H |
50 |
Y |
|
|
The specific conversion rules are as follows:
1. 3 characters to 4 characters;
The three 8bits contain 24 bits, each of which forms a base64 bytes table index. The converted characters are found through the index.
That is, a7... A0 b7... b0c7... C0-> a7... a2a1a0b7... B4 b3... b0c7c6c5... C0
The first character of a7.. A2 is indexed in the orders table;
The second character a1a0b7 .. B4 is indexed in the two-character table;
B3.. b0c7c6 the third character is indexed in the orders table;
The fourth character c5.. C0 is indexed in the orders table.
2. Add a line break to the converted content for every 76 characters;
3. The last character less than 3 characters must be specially processed
3.1 If the remaining two characters are not processed, Then:
The remaining two characters and 0x00 constitute a data, and the three-Character index is obtained. The last character is '= '.
That is, a7... A0 b7... b00... 0-> a7... a2a1a0b7... B4 b3... b000
The first character of a7.. A2 is indexed in the orders table;
The second character a1a0b7 .. B4 is indexed in the two-character table;
B3.. B0 00 the third character is indexed in the orders table;
The fourth character is '= '.
3.2 If one remaining character is not processed, Then:
The remaining character and 0x0000 constitute a data, and the two character indexes are obtained. The last two characters are both '= '.
That is, a7... A0 0... 00... 0-> a7... a2a1a0 0... 0
The first character of a7.. A2 is indexed in the orders table;
A1a0 0 .. 0 index the second character in the orders table;
Third and fourth characters: '=', '= '.
Ii. Quoted-printable Encoding
Quoted-printable encoding is relatively simple. Scan the content to be encoded and process each byte:
- If it is a space character (0x20), replace it;
- If it is [33,127), and it is not a special character {= _? \ "# $ % & '(),.:; <> @ [\] ^' {| }~}, Add the original character directly without processing;
- For other characters, replace '=' with the inner code.
Iii. Formats of Email Subject and attachment names
With base64 and quoted-printable encoding methods, you must have a certain format to indicate which Transmission Encoding is used, and specify the character encoding method used for the encoded characters.
The format of the subject and attachment names of the email: <prefix> <charset>? <Encodemode>? <Encodedcontent> <suffix>
Where,
- <Prefix> fixed to "= ?";
- <Charset> the character encoding format;
- <Encodemode> Transmission Encoding format: B Represents base64; Q represents quote-printable
- <Encodedcontent> is a character string encoded as charset using encodemode.
- <Suffix> fixed as "? ="
For example, the name of the subject or attachment must be transmitted by email. The encoding process is as follows:
3.1.utf-8 encoding
E59095 e699b6 e699b6 6a6a392e6a7067 Lu Jing JJ 9. J P G
3.2.base64 Encoding
E59095 E699B6 E699B6 6A6A39 2E6A7067 3Bytes
E59095-> 111001011001000010010101 binary
-> 111001 011001 000010 010101 6Bits (binary)
-> 57 25 2 21 index (decimal)
-> '5' 'Z' 'C' 'V' encoded character
E699B6-> 111001101001100110110110 binary
-> 111001 101001 100 110 110 110 6Bits (binary)
-> 57 41 38 54 index (decimal)
-> '5' 'p' 'm' '2' encoded character
E699B6-> 111001101001100110110110 binary
-> 111001 101001 100 110 110 110 6Bits (binary)
-> 57 41 38 54 index (decimal)
-> '5' 'p' 'm' '2' encoded character
6A6A39-> 011010100110101000111001 binary
-> 011010 100110 101000 111001 6Bits (binary)
-> 26 38 40 57 index (decimal)
-> 'a' 'm' 'o' '5' encoded character
2E6A70-> 001011100110101001110000 binary
-> 001011 100 110 101001 110000 6Bits (binary)
-> 11 38 41 48 index (decimal)
-> 'L' 'm' 'p' 'w' encoded character
670000-> 011001110000000000000000 binary
-> 011001 110000 000000 000000 6Bits (binary)
-> 25 48 index (decimal)
-> 'Z' 'w' '=' '=' encoded character
Encoding Process:
- The content of the main content ( jj9.jpg) according to the three bytes of A group [line #1];
- Split each 6bits to obtain the index [line #3 & 4; line #7 & 8; line #11 & 12; line #15 & 16; line #19 & 20];
- The encoded characters [line #5; line #9; line #13; line #7; line #21] are obtained through the index query table.
- Process the last byte [line #22 ~ #25].
Therefore, we get base64 encoding [line #5; line #9; line #13; line #7; line #21]:
5zcv5pm25pm2amo5lmpwzw =
3.3. Final base64 encoding result
Add the prefix, character encoding, Transfer Encoding, and suffix to the format:
=?UTF-8? B? 5zcv5pm25pm2amo5lmpwzw =? =
3.4. Quoted-printable encoding result
If quoted-printable encoding is used for transmission, you can get:
=?UTF-8? Q? %E5%90%95%e6%99%b6%e6%99%b6jj9.jpg? =
The coding process is relatively simple. You can refer to the quoted-printable encoding in the second part for your own analysis.
4. Email-related implementation in Android
In the implementation of Android native email, encoding and decoding of base64 and quoted-printable are implemented using the third-party open-source package mime4j. Specifically, all base64/quoted-printable encoded fields can beDecodingBut when sending an email,OnlySubjectEncoding, For the attachment nameNoProceedEncoding. This also causes garbled Chinese attachment names.
Both transmission encoding and decoding are implemented through COM. Android. Email. Mail. Internet.Mimeutility, Call org. Apache. James. mime4j. decoder.DecoderutilOr org. Apache. James. mime4j. codec.Encoderutil.
4.1 Decoding
Com. Android. Email. Mail. Internet.MimeutilityThere are several static methods related to decoding in:
public static StringunfoldAndDecode(String s);
public static Stringunfold(String s);
public static Stringdecode(String s);
UnfoldanddecodeContains the unfold and decode operations. Unfold removes the CRLF of the encoded content;DecodeIs the true Decoding Implementation.
DecodeCall org. Apache. James. mime4j. decoder. decoderutil #Decodeencodedwords()
Decodeencodedwords() Determine the Transmission Encoding and select passDecodeb() To perform base64 decoding or throughDecodeq() Perform quoted-printable decoding.
4.2 Encoding
Com. Android. Email. Mail. Internet.MimeutilityThere are several static methods related to encoding:
public static StringfoldAndEncode(String s);
public static StringfoldAndEncode2(String s, int usedCharacters)
public static Stringfold(String s, int usedCharacters)
FoldandencodeNo operations,Foldandencode2Encoding is actually implemented.Foldandencode2Using org. Apache. James. mime4j. codec. encoderutil #Encodeifnecessary.
4.2.1 encoding required
After encoding, the length of the string is increased, and it is not required for encoding. Encoderutil #Hastobeencoded() Analyze the original string to determine whether encoding is required.
- If the string contains only printable characters, no encoding is required;
- If a string contains a control character and a character greater than 127 characters, it must be encoded.
4.2.2 encoding Selection
Encoding options include character encoding and transfer encoding.
Character encodingThrough encoderutil #Determinecharset.
- If the unicodepoint in the character in the string to be encoded is greater than 0xff, UTF-8 encoding;
- If the unicodepoint in the character in the string to be encoded is greater than 0x7f, perform ISO-8859-1 encoding;
- Otherwise, encode the US-ASCII.
Transfer EncodingThrough encoderutil #Determineencoding.
DetermineencodingView the proportion of Characters in the string to be encoded that require quoted-printable encoding. Quoted-printable encoding is used only when the proportion of characters to be encoded is lower than 30%. Otherwise, base64 encoding is used.
4.2.3 coding implementation
PassEncodeb() Base64 encoding; or throughEncodeq() Quoted-printable encoding.
4.3 solve the problem by adding encoding information
In the implementation of Android email
- The subject, attachment name, and other fields of the received email are decoded;
- When sending/saving an email,OnlySubjectEncoding, For the attachment nameNoProceedEncoding.
Therefore, garbled attachments are generated when you receive emails with Chinese attachments sent from the android email client. The solution is to encode the name of the attachment in the previous section when sending or saving the email.
5. Pending issues
The solution of 4.4 can solve the problem of sending new emails, but their attachment names are garbled for existing emails. In addition, if an unencoded email is received by another mail client (such as outlook), the name of the attachment can be correctly parsed. This means that even if no encoding is performed or the specified encoding format is specified, the client can also be decoded. I did not understand how to implicitly encode or decode the Code through the experiment. If you know how to implement it, I hope you will not be enlightened!
The following is the attachment name sent by the android emailclient, called jj9.jpg. I don't know how to compile/decode the name of the received attachment?
The name of the UTF-8 to send
E59095 e699b6 e699b6 6a6a392e6a7067 Lu Jing JJ 9. J P G
Received name (what kind of code is this? The following hexadecimal code is captured from the attachment name of the received email. Who knows the encoding principle !)
C3a5c290c295 c3a6c299c2b6 c3a6c299c2b6 6a6a392e6a7067 Lu Jingjing J 9. J P G