Mime (Multipurpose Internet Mail Extensions) is the abbreviation of "multi-purpose Internet Mail extended protocol". Before the mime protocol, the mail encoding methods such as uencode were used, but the mime protocol algorithm is simple, it is also easy to expand and has become the mainstream of mail encoding methods. It is used not only to transmit 8-bit characters, but also to transmit binary files, such as images and audios in email attachments, it also expands many mime-based applications.
When a text or HTML segment is transmitted by email, the sent content is first converted to a "Byte string" by a specified character encoding ", then, the "Byte string" is converted to another "Byte string" by using a specified content-transfer-encoding ". For example, open the source code of an email and you can see similar content:
Content-Type: text/plain; charset = "gb2312"
Content-transfer-encoding: base64
SBG + qcrquqo17cf4yee74bgjz9w7 + b3wudza7dbq0mqnc1_kvpkzxqo6uqo17cnnsapw0ndedqoncg =
The most commonly used content-transfer-encoding includes base64 and quoted-printable, and the other is binhex, which is basically dedicated to Mac. When converting a binary file or a Chinese text, base64 produces a "Byte string" that is shorter than quoted-printable. When converting English text, quoted-printable gets a shorter "Byte string" than base64. Base64 re-encodes all characters (including ASCII code), while QP only encodes non-ASCII code.
For the mail title, mime uses a shorter format to mark "character encoding" and "Transport encoding ". For example, if the title content is "medium", it is represented as: Subject: =? In the mail source code? Gb2312? B? 1ta =? =
Where:
The first "=? "And "? "The middle part specifies the character encoding. In this example, gb2312 is specified.
"? "And "? "B" in the middle represents base64. If it is "Q", it indicates quoted-printable.
Last "? "And "? = "Is the header content after gb2312 is converted into a byte string and base64 is converted.
If "Transfer Encoding" is changed to quoted-printable, similarly, if the title content is "medium", the mail source code is:
Subject: =? Gb2312? Q? = D6 = D0? =
The following describes two types of Transmission Encoding: base64 and qP;
Base64
According to rfc2045, base64 is defined as base64 content Transfer Encoding. It is designed to describe the 8-bit bytes of any sequence as a form that is not easily recognized by people.
Why use base64?
When designing this code, I think the designers mainly consider three issues:
1. encryption?
2. complexity and efficiency of encryption algorithms
3. How to Handle transmission?
Encryption is positive, but the purpose of encryption is not to allow users to send very secure emails. This encryption method is mainly used to "Prevent the gentleman from defending against the villain ". That is, you can see nothing at a glance.
The complexity and efficiency of encryption algorithms for this purpose cannot be too large or too low. Similar to the previous reason, the mime protocol and other protocols used to send emails solve the problem of how to send and receive emails, rather than how to send and receive emails safely. Therefore, the complexity of the algorithm is small and the efficiency is high. Otherwise, resources are greatly occupied by email sending, and the path is a bit distorted.
However, if it is based on the above two points, we can use the simplest Caesar method. Why does base64 seem more complex than Caesar? This is because, during the email transmission process, for historical reasons, the email is only allowed to transmit ASCII characters, that is, an 8-Byte Low 7-bit. Therefore, if you send an email with non-ASCII characters (that is, the maximum bit of the byte is 1), you may encounter a problem through the gateway with "historical problems. The Gateway may set the maximum position to 0! Obviously, this is the case! Therefore, in order to send emails normally, this issue must be considered! Therefore, the solutions such as Caesar, which only relies on changing the positions of letters, will not work. For more information, see rfc2046.
Base64 encoding is generated based on the preceding reasons.
Base64 encoding requires that three 8-bit bytes (3*8 = 24) be converted into four 6-bit bytes (4*6 = 24 ), then add two zeros before the six bits to form the 8-bit one-byte format. That is to say, the converted string is theoretically 1/3 longer than the original one. Example:
Before conversion, aaaaaabb ccccdddd eeffffff
00 aaaaaa 00 bbcccc 00 ddddee 00 ffffff after conversion
The above three bytes are the original text, and the following four bytes are the converted base64 binary encoding, the first two of which are both 0. The encoding is not complete yet. After the conversion, we need to use a code table to obtain the desired string (that is, the final base64 encoding). The encoding table (from rfc2045) is as follows:
Value |
Encoding |
Value |
Encoding |
Value |
Encoding |
Value |
Encoding |
0 |
A |
17 |
R |
34 |
I |
51 |
Z |
1 |
B |
18 |
S |
35 |
J |
52 |
0 |
2 |
C |
19 |
T |
36 |
K |
53 |
1 |
3 |
D |
20 |
U |
37 |
L |
54 |
2 |
4 |
E |
21 |
V |
38 |
M |
55 |
3 |
5 |
F |
22 |
W |
39 |
N |
56 |
4 |
6 |
G |
23 |
X |
40 |
O |
57 |
5 |
7 |
H |
24 |
Y |
41 |
P |
58 |
6 |
8 |
I |
25 |
Z |
42 |
Q |
59 |
7 |
9 |
J |
26 |
A |
43 |
R |
60 |
8 |
10 |
K |
27 |
B |
44 |
S |
61 |
9 |
11 |
L |
28 |
C |
45 |
T |
62 |
+ |
12 |
M |
29 |
D |
46 |
U |
63 |
/ |
13 |
N |
30 |
E |
47 |
V |
|
|
14 |
O |
31 |
F |
48 |
W |
(PAD) |
= |
15 |
P |
32 |
G |
49 |
X |
|
|
16 |
Q |
33 |
H |
50 |
Y |
|
|
Note: The pad at the end of the code table is used to fill the document at the end of the document, because not all documents have three integers. If there is an exception, fill in with =;
QP (Quote-printable)
The principle is to represent an 8-bit character with two hexadecimal values, and then add "=" in front ". So we can see that the QP-encoded file is usually like this: = B3 = C2 = BF = A1 = C7 = E5 = a3 = ac = C4 = fa = BA = C3 = a3 = A1.
Compared with base64, qP has relatively low confidentiality because it does not recode the ASCII code, so that it is equivalent to no encoding for Common English documents and the document content is completely visible; this also brings about an advantage, that is, the file size will not increase too much after the English character-majority document encoding;