JavaScript: Detailed BASE64 encoding and decoding

Source: Internet
Author: User
Keywords encoded byte nbsp;

Base64 is one of the most commonly used encodings, such as the development of passing parameters, the <img/> tags in modern browsers to render pictures directly through Base64 strings, and so on in messages. The BASE64 encoding is defined in RFC2045, which is defined as: Base64 content transfer encoding is designed to describe the 8-bit byte of any sequence as a form that is not easily recognized directly by the person.
We know that any data is stored in binary form on the computer. A byte is 8 bits, one character is stored in the computer as one or more bytes, such as English letters, numbers, and English punctuation marks are stored in one byte, usually called ASCII code. http://www.aliyun.com/zixun/aggregation/23115.html "> Simplified Chinese, Traditional Chinese, Japanese, and Korean are all stored in multibyte, often called multibyte characters. Because the BASE64 encoding is handled by the encoding representation of strings, the results of the Base64 of the different encoded strings are different, so we need to understand the basic character encoding knowledge. The
character encoding base
Computer initially supports only ASCII code, one character in one byte, only 7 digits low, and a maximum of 0, so there are 128 ASCII codes, the range is 0~127. Later, in order to support the languages of various regions, large organizations and it vendors began to invent their own coding schemes to compensate for the lack of ASCII coding, such as GB2312 encoding, GBK encoding and BIG5 encoding. But these codes are only for local areas or a few languages, there is no way to express all the language. And there is no connection between these different encodings, and the conversion between them needs to be done by look-up tables.
in order to improve the computer information processing and Exchange functions, so that the world's language can be processed in the computer, since 1984, the ISO organization began to study the development of a new standard: Universal eight-bit (that is, multibyte) coded character set (Universal Multiple-octet Coded Character Set), referred to as UCS. The standard number is: ISO 10646. This standard for the world's major language characters (including simplified and traditional Chinese characters) and additional symbols, the compilation of a unified internal code.
Uniform Code (Unicode) is the abbreviation for Universal code and is a character encoding system developed by another institution called the Unicode Academic Society (the Unicode Consortium). Unicode and ISO 10646 International coding standards are consistent in content. Specific reference: Unicode.  
ANSI
ANSI does not represent a specific encoding, it refers to a local encoding. For example, on the simplified version of Windows it represents the GB2312 encoding, on the traditional version of Windows it represents BIG5 encoding, on the Japanese operating system it represents JIS code. So if you create a new text file and save it as an ANSI encoding, you should now know that the file is encoded locally. The
Unicode
Unicode encoding is mapped to a character map. For example, 56DE represents Chinese characters ' back ', this mapping relationship is fixed. Popular that the Unicode code is the coordinates of the character chart, through 56DE can find Chinese characters ' back '. The implementation of Unicode encoding includes UTF8, UTF16, UTF32, and so on.
Unicode itself defines the numeric value of each character, which is the mapping of characters and natural numbers, while UTF-8 or UTF-16, or even UTF-32, defines how to break words in a byte stream, which is the concept of the computer domain.


As we know from the above figure, the UTF-8 encoded as a variable length encoding, accounting for 1~6 bytes, can be judged by the range of Unicode encoded values, and each byte that makes up the UTF8 character is regularly traceable. This article only discusses the two encodings of UTF8 and UTF16.
UTF16
The UTF16 encoding is stored with a fixed 2 bytes. Because it is multibyte storage, it is stored in 2 different ways: big-endian and small-end order. UTF16 encoding is the most straightforward way to implement Unicode, usually when we create a new text file on Windows and save it as a Unicode encoding, which is actually saved as a UTF16 encoding. UTF16 encoding is stored in small order on Windows, the following I created a text file and saved it as a Unicode encoding to test, the file only entered a Chinese character ' back ', and then I open it with editplus, switch to 16 to view, as shown:




we see 4 bytes, the first 2 bytes ff Fe is the header of the file, which means that this is a UTF16 encoded file, and De 56 is the ' back ' of the UTF16 encoded hexadecimal. We often use the JavaScript language, which is the internal use of UTF16 encoding, and its storage mode is big-endian, to see an example:


&lt;script type= "Text/javascript" &gt;


console.group (' Test Unicode: ');


Console.log ((' Back '. charCodeAt (0)). ToString (toUpperCase ());


&lt;/script&gt;


is obviously different from what EditPlus just showed, the order is reversed because the byte sequence is different. Concrete can refer to: UTF-16.


UTF8


UTF8 is a variable-length encoding for 1~6 bytes, but usually we only think of it as a single-byte or three-byte implementation, because other cases are rare. UTF8 encoding is displayed by means of multiple byte combinations, which is the mechanism of computer processing UTF8, which is byte-free, and each byte is very regular, as described above, no longer detailed here.

Mutual conversion of
UTF16 and UTF8


UTF16 Turn UTF8

The conversion between
UTF16 and UTF8 can be achieved by the conversion table in the above graph, where the Unicode code can be judged by a number of bytes and then shifted. We use Chinese characters ' back ' to give an example of conversion.


we already know that the Unicode code for Kanji ' Back ' is 0x56de, which is between u+00000800–u+0000ffff, so it is represented in three bytes.


so we need to 0x56de this double byte value into three bytes of value, note that the X section of the above, is the corresponding 0x56de bytes, if you count the number of x, you will find just 16 digits.


Conversion Thinking


4 bits from the 0x56de and put them in the low position, combined with the binary 1110, which is the first byte. Remove 6 bits from the remaining bytes in the 0x56de and combine them with binary 10, which is the second byte. The third byte is implemented in a similar way.


Code Implementation


in order to let everyone better understand, the following code is only to achieve the conversion of Chinese characters ' back ', the code is as follows:


&lt;script type= "Text/javascript" &gt;


  /**


* Conversion Comparison table


* u+00000000–u+0000007f 0xxxxxxx


* u+00000080–u+000007ff 110xxxxx 10xxxxxx


* u+00000800–u+0000ffff 1110xxxx 10xxxxxx 10xxxxxx


* u+00010000–u+001fffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


* u+00200000–u+03ffffff 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx


* u+04000000–u+7fffffff 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx


  */


  /*

The Unicode encoding for
* ' Back ' is: 0x56de, which is between u+00000800–u+0000ffff, so it occupies three bytes.


* u+00000800–u+0000ffff 1110xxxx 10xxxxxx 10xxxxxx


  */


var ucode = 0x56de;


//1110xxxx


var byte1 = 0xe0 | ((Ucode &gt;&gt;) &amp; 0x0f);


//10xxxxxx


var byte2 = 0x80 | ((Ucode &gt;&gt; 6) &amp; 0x3F);


//10xxxxxx


var byte3 = 0x80 | (Ucode &amp; 0x3F);


var utf8 = string.fromcharcode (byte1)


+ string.fromcharcode (byte2)


+ string.fromcharcode (byte3);


console.group (' Test Utf16toutf8: ');


Console.log (UTF8);


Console.groupend ();


&lt;/script&gt;


The results of the output look like garbled, because JavaScript doesn't know how to display UTF8 characters. You might say what is the use of an incorrect conversion, but you should know that the purpose of the conversion is often used for transmission or API needs.


UTF8 Turn UTF16


This is the inverse conversion of UTF16 to UTF8, which also needs to be implemented against the conversion table. Or to take an example, we've got the UTF8 encoding of the Chinese character ' back ', which is three bytes, we just have to convert to a double byte according to the conversion table, as shown, we need to keep all the X.

The
code is as follows:


&lt;script type= "Text/javascript" &gt;


  /**


* Conversion Comparison table


* u+00000000–u+0000007f 0xxxxxxx


* u+00000080–u+000007ff 110xxxxx 10xxxxxx


* u+00000800–u+0000ffff 1110xxxx 10xxxxxx 10xxxxxx


* u+00010000–u+001fffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


* u+00200000–u+03ffffff 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx


* u+04000000–u+7fffffff 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx


  */


  /*

The Unicode encoding for
* ' Back ' is: 0x56de, which is between u+00000800–u+0000ffff, so it occupies three bytes.


* U+00000800–U+0000FFFF 1110xxxx 10xxxxxx 10xxxxxx


  */


var ucode = 0x56de;


//1110xxxx


var byte1 = 0xe0 | ((Ucode &gt;&gt;) &amp; 0x0f);


//10xxxxxx


var byte2 = 0x80 | ((Ucode &gt;&gt; 6) &amp; 0x3F);


//10xxxxxx


var byte3 = 0x80 | (Ucode &amp; 0x3F);


var utf8 = string.fromcharcode (byte1)


+ string.fromcharcode (byte2)


+ string.fromcharcode (byte3);


console.group (' Test Utf16toutf8: ');


Console.log (UTF8);


Console.groupend ();


  /** ------------------------------------------------------------------------------------*/


//consists of three bytes, so remove


var c1 = utf8.charcodeat (0);


var c2 = utf8.charcodeat (1);


var C3 = Utf8.charcodeat (2);


  /*


* It needs to be converted by judging the location, but this is known to be three bytes, so ignore the judgment and get all the X and make up 16 bits.


* u+00000800–u+0000ffff 1110xxxx 10xxxxxx 10xxxxxx


  */


//Discard the high four bits of the first byte and form a byte with a high four bit of the second byte


var B1 = (C1 &lt;&lt; 4) | ((C2 &gt;&gt; 2) &amp; 0x0f);


//Similarly the second byte and the third byte combination


var b2 = (C2 &amp; 0x03) &lt;&lt; 6 | (C3 &amp; 0x3F);


//B1 and B2 are composed of 16-bit


var ucode = (B1 &amp; 0x00ff) &lt;&lt; 8 | B2;


console.group (' Test utf8toutf16: ');


Console.log (ucode.tostring), toUpperCase (), String.fromCharCode (Ucode));


Console.groupend ();


&lt;/script&gt;


knows the conversion rules, it's easy to implement.


Base64 Code

The
BASE64 encoding requires 3 8-bit bytes (3*8=24) to be converted to 4 6-bit bytes (4*6=24), followed by 6-bit two, to form 0-bit bytes. Because 2 of the 6 times is 64, each 6 bit is a unit, corresponding to a printable character. When the original data is not 3 times the integer, if the last two input data are left, add 1 "=" After the result of the encoding, and then add 2 "= If the last one is left, and if there is no data left, then nothing is added, so that the correctness of the data restoration can be guaranteed.


Transfer Code Comparison table


each of the 6 cell highs to fill 2 zeros is located between 0~63, by looking for the corresponding printable characters in the transcoding table. "=" is used for padding. The following figure shows the Transcoding table.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.