JavaScript: Base64 encoding and decoding (1)

Source: Internet
Author: User
Tags printable characters

Base64 is one of the most commonly used encodings. For example, in development, the label in modern browsers is used to render images directly using Base64 strings, and in emails. Base64 encoding is defined in RFC2045. It is defined as Base64 content Transfer Encoding. It is designed to describe the 8-bit bytes of any sequence as a form that is not easily recognized by people.

We know that any data is stored in binary mode in computers. One byte is 8 bits, and one character is stored as one or more bytes in the computer. For example, English letters, numbers, and punctuation marks are stored in one byte, it is usually called an ASCII code. Simplified Chinese, traditional Chinese, Japanese, and Korean are stored in multiple bytes, which are usually called multi-byte characters. Because Base64 encoding is used to process the encoding representation of strings, the Base64 results of different encoded strings are different, so we need to understand the basic character encoding knowledge.

Character encoding Basics

At the beginning, the computer only supports ASCII code. A single character is represented in one byte and uses only 7 lower bits. The highest bits are 0. Therefore, there are a total of 128 ASCII codes in the range of 0 ~ 127. Later, in order to support multiple regional languages, major organizations and IT vendors began to invent their own encoding schemes to make up for the shortcomings of ASCII coding, such as GB2312, GBK, and Big5 encoding. However, these codes are only for local characters or a few languages, and there is no way to express all languages. In addition, there is no connection between these different encodings, and the conversion between them needs to be achieved through the Table query.
 

In order to improve the computer's information processing and exchange functions, so that all the texts in the world can be processed in the computer, since 1984, ISO began to study and develop a new standard: general multi-eight-bit (that is, multi-byte) encoding Character Set Universal Multiple-Octet Coded Character Set), short for UCS. The standard number is ISO 10646. This standard is the character of the world's main languages (including simplified and traditional Chinese characters) and additional symbols, unified internal code.

Unicode is a character encoding system developed by another organization called The Unicode Consortium. Unicode and ISO 10646 International encoding standards are consistent in terms of content. For more information, see Unicode.

ANSI

ANSI does not represent a specific encoding. It refers to local encoding. For example, in simplified windows, it indicates GB2312 encoding, in traditional windows, it indicates Big5 encoding, and in Japanese operating systems, it indicates JIS encoding. Therefore, if you create a new text file and save it as ANSI encoding, you should now know that the file is encoded locally.
 

Unicode

Unicode encoding is a one-to-one ing with the two-dimensional table. For example, 56DE represents the 'loopback 'of Chinese characters, and the ing relationship remains unchanged. In layman's terms, Unicode encoding is the coordinate of the orders table. With 56DE, We can find Chinese characters 'loan '. Unicode encoding includes UTF8, UTF16, and UTF32.

Unicode itself defines the value of each character, is the ing relationship between characters and natural numbers, while UTF-8 or UTF-16 or even UTF-32 defines how to interrupt the word in byte stream, is the concept of the computer field.

We know that UTF-8 encoding is a variable length encoding method, accounting for 1 ~ Six bytes, which can be determined by the range of Unicode encoded values, and each byte consisting of UTF8 characters is regular. This article only discusses UTF8 and UTF16 encoding.
 

UTF16

UTF16 encoding uses fixed 2 bytes for storage. Because it is multi-byte storage, its storage methods are divided into two types: Large-end order and small-end-order. UTF16 encoding is the most direct implementation of Unicode. Generally, we create a text file on windows and save it as Unicode encoding, which is actually saved as UTF16 encoding. UTF16 encoding is stored in Small-end order on windows. Below I have created a new text file and saved it as Unicode encoding for testing. Only one Chinese character is entered in the file 'loan ', then I open it with Editplus and switch to the hexadecimal mode to view it ,:

We can see that there are four bytes. The first two bytes ff fe are the file header, indicating that this is an UTF16 encoded file, and DE 56 is the hexadecimal format of the 'login' UTF16 encoding. We often use the JavaScript language, which uses UTF16 encoding internally, and its storage method is in large-end order. Let's look at an example:

 
 
  1. <Script type = "text/javascript">
  2. Console. group ('test Unicode :');
  3. Console. log ('login'. charCodeAt (0). toString (16). toUpperCase ());
  4. </Script>

Obviously, the order is different from that shown in Editplus, because the byte order is different. Specific reference: UTF-16.

UTF8

UTF8 adopts a variable-length encoding method, which is 1 ~ 6 bytes, but we usually only regard it as a single-byte or three-byte implementation, because it is rare in other cases. UTF8 encoding is displayed by combining multiple bytes. This mechanism is used by computers to process UTF8. It has no sort of bytes and each byte is regular. For details, see, we will not detail it here.
 

Mutual conversion between UTF16 and UTF8 

UTF16 to UTF8

The conversion between UTF16 and UTF8 can be achieved through the conversion table. Judging the range of the Unicode code, we can see that this character is composed of several bytes and then implemented by shift. We use the Chinese character 'login' to give an example of conversion.

We already know that the Unicode code of the Chinese character 'login' is 0x56DE, which is between U + 00000800-U + 0000FFFF, so it is represented in three bytes.

Therefore, we need to change the value of 0x56DE to a value of three bytes. Note that the x part corresponds to all bytes of 0x56DE. if you count the number of x, we will find that it is exactly 16 bits.

Conversion ideas

Take four bits from 0x56DE and combine them with the binary 1110. This is the first byte. Take 6 bits from the remaining bytes in 0x56DE and combine them with the binary value of 10. This is the second byte. The third byte is implemented in a similar way.
 

Code Implementation

For better understanding, the following code only converts Chinese characters 'login'. The Code is as follows:

 
 
  1. <Script type = "text/javascript">
  2. /**
  3. * Conversion table
  4. * U + 00000000-U + 0000007F 0 xxxxxxx
  5. * U + 00000080-U + 000007FF 110 xxxxx 10 xxxxxx
  6. * U + 00000800-U + 0000 FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
  7. * U + 00010000-U + 001 FFFFF 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
  8. * U + 00200000-U + 03 FFFFFF 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
  9. * U + 04000000-U + 7 FFFFFFF 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
  10. */
  11. /*
  12. * The Unicode code of 'login' is 0x56DE, which is between U + 00000800-U + 0000FFFF, so it occupies three bytes.
  13. * U + 00000800-U + 0000 FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
  14. */
  15. Var ucode = 0x56DE;
  16. // 1110 xxxx
  17. Var byte1 = 0xE0 | (ucode> 12) & 0x0F );
  18. // 10 xxxxxx
  19. Var byte2 = 0x80 | (ucode> 6) & 0x3F );
  20. // 10 xxxxxx
  21. Var byte3 = 0x80 | (ucode & 0x3F );
  22. Var utf8 = String. fromCharCode (byte1)
  23. + String. fromCharCode (byte2)
  24. + String. fromCharCode (byte3 );
  25.  
  26. Console. group ('test UTF16ToUTF8 :');
  27. Console. log (utf8 );
  28. Console. groupEnd ();
  29. </Script>

The output looks garbled because JavaScript does not know how to display UTF8 characters. You may say that the output is not converted normally, but you should know that the conversion is often used for transmission or API needs.

UTF8 to UTF16

This is the inverse conversion from UTF16 to UTF8, which also needs to be implemented against the conversion table. In the following example, we have obtained the UTF-8 encoding of the Chinese character 'login', which is three bytes. We only need to convert the data into double bytes according to the conversion table ,, we need to keep all x.
 

The Code is as follows:

 
 
  1. <Script type = "text/javascript">
  2. /**
  3. * Conversion table
  4. * U + 00000000-U + 0000007F 0 xxxxxxx
  5. * U + 00000080-U + 000007FF 110 xxxxx 10 xxxxxx
  6. * U + 00000800-U + 0000 FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
  7. * U + 00010000-U + 001 FFFFF 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
  8. * U + 00200000-U + 03 FFFFFF 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
  9. * U + 04000000-U + 7 FFFFFFF 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
  10. */
  11. /*
  12. * The Unicode code of 'login' is 0x56DE, which is between U + 00000800-U + 0000FFFF, so it occupies three bytes.
  13. * U + 00000800-U + 0000 FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
  14. */
  15. Var ucode = 0x56DE;
  16. // 1110 xxxx
  17. Var byte1 = 0xE0 | (ucode> 12) & 0x0F );
  18. // 10 xxxxxx
  19. Var byte2 = 0x80 | (ucode> 6) & 0x3F );
  20. // 10 xxxxxx
  21. Var byte3 = 0x80 | (ucode & 0x3F );
  22. Var utf8 = String. fromCharCode (byte1)
  23. + String. fromCharCode (byte2)
  24. + String. fromCharCode (byte3 );
  25.  
  26. Console. group ('test UTF16ToUTF8 :');
  27. Console. log (utf8 );
  28. Console. groupEnd ();
  29. /** Restart /**------------------------------------------------------------------------------------*/
  30. // It consists of three bytes.
  31. Var c1 = utf8.charCodeAt (0 );
  32. Var c2 = utf8.charCodeAt (1 );
  33. Var c3 = utf8.charCodeAt (2 );
  34. /*
  35. * The conversion needs to be done by determining the specific location, but it is known to be three bytes. Therefore, ignore the judgment and get all x directly to form a 16-bit structure.
  36. * U + 00000800-U + 0000 FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
  37. */
  38. // Discard the four-byte height of the first byte and form a byte with the four-byte height of the second byte
  39. Var b1 = (c1 <4) | (c2> 2) & 0x0F );
  40. // Similarly, the combination of the second and third bytes
  41. Var b2 = (c2 & 0x03) <6) | (c3 & 0x3F );
  42. // Combine b1 and b2 into 16 bits
  43. Var ucode = (b1 & 0x00FF) <8) | b2;
  44. Console. group ('test UTF8ToUTF16 :');
  45. Console. log (ucode. toString (16). toUpperCase (), String. fromCharCode (ucode ));
  46. Console. groupEnd ();
  47. </Script>

It is easy to implement the conversion rules.

Base64 encoding

Base64 encoding requires that three 8-bit bytes (3*8 = 24) be converted to four 6-bit bytes (4*6 = 24), followed by two zeros before the Six-bit, form 8 bytes. Because the 6 times of 2 are 64, each 6 digits is a unit and corresponds to a printable character. When the original data is not an integer multiple of 3, if the last two inputs are left, add 1 "=" after the encoding result. If the last input data is left, add two "=" after the encoding result. If no data is left, do not add anything. This ensures the correctness of data restoration.

Transcoding table

Each six units complement two zero-formed bytes at a high level between 0 and ~ Between 63, find the corresponding printable characters in the transcoding table. "=" Is used for filling. Shows the conversion table.
 

For details, refer to: Base64.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.