Character sets and encodings in web development

Source: Internet
Author: User
Tags base64 coding standards control characters html form unpack urlencode

Introduction

I believe that many people in the initial contact programming, are the character set mercilessly abused, especially the database of Chinese garbled problem, then the garbled is how to produce it? We all know that computers are stored and run in binary, so how does it convert binary data into various text? There are all kinds of character sets commonly used, the common encoding conversion, how is it done?

The content of this blog is not technical dry, just a small summary of our commonly used character sets and encodings, small science. I believe that after reading this article, you should have a similar understanding of character sets and common coding methods.

ASCII code

The ASCII code (American Standard Code for information Interchange, US Information Interchange standards codes) should be the encoding that we first contacted, and the most commonly used characters for programming are included. It uses 7bit to represent an X (2e7) character, the highest bit fixed at 0, and a total of one byte. which

    • 0~31 and 127 (33 total) are control characters or communication-specific characters (the rest are display characters), such as the Control: tab (tab), CR (carriage return), DEL (delete), BS (BACKSPACE), etc., commonly used ASCII values of 8, 9, 10 and 13 are converted to backspace, tab, NewLine and carriage return characters.

    • The 48~57 is a 0 to 90 Arabic numerals.

    • 65~90 is 26 uppercase English letters, 97~122 is 26 lowercase English letters, the rest is some punctuation marks, arithmetic symbols and so on.

    • 32~47,58~64,123~126 represent commonly used punctuation marks (: ' etc);

We will find that many of these can be found on the keyboard.

Tips

    • In PHP we can use ord($char) the ASCII code to get a character;
    • Can be used chr($int) to obtain the corresponding ASCII value of the character;
ANSI Encoding

Americans invented computers and put their most commonly used characters in a single byte on a computer, but how do you use computers to represent so many languages in the world?

To enable computers to support multiple languages, different countries and regions have developed different standards. For Chinese characters, the coding standards of GB2312, BIG5 and JIS are produced. These use 1 bytes to represent an English character, 2 bytes to represent the characters of a variety of Chinese character extension encoding, called ANSI encoding.

When we use the window system to save files to choose the encoding method, we will see this ANSI encoding this option, in different Windows systems, ANSI represents a different encoding. Different ANSI encodings are incompatible, and when information is exchanged internationally, text that is in two languages cannot be stored in the same piece of ANSI-encoded text.

Unicode encoding Source

Since the ANSI coding has the disadvantage that the incompatibility between different encodings cannot coexist, and the modern network will frequently appear the multi-language interaction, if in the multi-language network propagation, a ' 11011011 ' exactly what character does it represent?

At this point, Unicode came into being, which is a large enough character encoding map to include all the characters, each corresponding to a unique Unicode value. such as the Chinese character ' good ' corresponds to the Unicode value of ' 0x597d ', to the binary as ' 0101 1001 0111 1101 ', indicating that it requires a number of bits, two bytes, and of course, there is a need for more bytes to save the character (forgive me to lift not chestnut).

The latest UCS-4 standard is a fully populated 31-bit Unicode character set that uses 31 bits to hold characters, plus a constant of 0 for the first place, with a total of 32 bits and 4 bytes. In this way, Unicode can hold 2e31 characters and is fully sufficient to store all the characters in the world.

Tips

    • In the network transmission, Chinese characters are converted to Unicode for transmission, with regular matching of a Chinese character as: \x{4e00}-\x{9fa5} ,

    • PHP in the want to see a Chinese character Unicode code, you can use json_encode($str) ;

    • Want to Json_encode keep the original Chinese is not automatically converted to Unicode can be used json_encode($str, JSON_UNESCAPED_UNICODE) ; add an option constant.

    • Various encoding methods in PHP you can look at my blog: PHP uses the mb_string function library to handle Windows-related Chinese characters

    • Garbled production is because the data encoding and decoding in different ways: Windows using ANSI standard GBK encoding, the database using Unicode in different encoding method of storage, web browser and different encoding to parse, unified for the UTF-8 data encoding can solve such problems.

Note that Unicode is just a set of symbols, the specific implementation of character storage see below

UTF-8

We know that, according to the Unicode standard, a maximum of 4 bytes is used to store a character. If all the characters are stored according to this standard, then Europe and the United States may cry because they can easily store documents in one byte, because internationalization, all storage space will increase three times times. To solve this problem, UTF-8 (8-bit Unicode transformation Format) appears.

UTF-8 uses a variable-length encoding that uses 1~4 bytes to represent a symbol:

    • For a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.
    • For the N-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

So, happily, UTF-8 became the most widely used Unicode encoding implementation in the Internet.

In addition, Unicode also has UTF-7, Punycode, CESU-8, SCSU, UTF-32, GB18030 and other implementation methods;

Utf8mb4

UTF8MB4 is not one of the implementations of Unicode, it is the encoding of MySQL, in the latest MySQL, UTF8MB4 has been able to replace the UTF8, and has the characteristics UTF8 does not have.

MB4, which is most bytes 4, MySQL's UTF8 encoding uses up to 3 bytes to store one character, error when storing 4-byte characters, and UTF8MB4 can use up to 4 bytes to store one character. So it can be used to store more Unicode characters, including some Emoji emoticons (Emoji is a special Unicode encoding, common on iOS and Android phones), and a lot of infrequently used Chinese characters, as well as any new Unicode characters.

Because UTF8MB4 is a superset of UTF8, the UTF8 encoded MySQL database can be smoothly transitioned to UTF8MB4.

URL encoding

URL encoding is the most commonly used encoding in web development. Because some characters in the URL have special effects, then it is called the reserved character (reserved purpose), such as = is used to assign a value,? Used to denote the beginning of a query_string, # used to identify an anchor. When we just want to transfer these characters as normal strings, we need to use URL encoding.

URL encoding (URL encoding), because it uses a% prefix to replace special characters, also known as percent-encoding, is a Uniform Resource Locator (URL) encoding mechanism for a particular context. It is also used to prepare data for the "application/x-www-form-urlencoded" MIME, because it is used to submit HTML form data through the request operation of HTTP.

Conversion rules:

You first need to represent the ASCII value of the character as two hexadecimal digits, then place the escape character (%) in front of it, position it in the URI, and for non-ASCII characters (such as Chinese, for example), you need to convert to UTF-8 byte order, and then each byte is represented in the above manner.

The following table is a common character and UrlEncode after the identity:

char URL char URL char URL Char URL Char URL
! %21 # %23 $ %24 & %26 ' %27
( %28 ) %29 * %2a + %2b , %2c
/ %2f : %3a ; %3b = %3d ? %3f
@ %40 [ %5b ] %5d   & nbsp;    

The urlencode() urldecode() encoding and decoding of URLs is used in tips:php.

BASE64 encoding

Base64 is also a kind of common coding in web development, it can realize simple reversible encryption, and it is convenient to transfer binary characters between systems using Base64 encoding.

It uses the A-Z a-z 0-9 + / equal to (2e6) characters to represent a character. Strictly speaking, there is also the number of bytes used to identify the end of the packet = , which only appears at the end of the encoding string.

Encoding Rules:

A string is divided into three bytes (3 * 8 = five bit) as a grouping, the 24 bits are divided into four groups, 6 bits per group, and then use their 6 bit corresponding decimal number to map a base64 character;

such as UTF-8 (three bytes for a Chinese) The Chinese ' Kiki ' Turn base64 process for

    • Convert to hexadecimal representation as e790aa ;
    • Each hexadecimal character is converted to a 4 binary bit 11100111 10010000 10101010 ;
    • Split into four 6 bit groupings for 111001 111001 000010 101010 ;
    • The corresponding decimal number is 57 57 2 42 ;
    • The corresponding base64 code is 55Cq ;

The mapping table for the decimal corresponding Base64 encoding is as follows:

So what if a string is split to the last less than three bytes?

    • Two byte case: the two bytes of the three bits are divided into three groups, then the last group only 4 bit (16 6 = 4); At the end of these 4 bits add 2 0 to the same 6 bit, and then a number at the end = of the identification complement, in order to decode;
    • One byte case: divide this byte altogether 8 bits into two groups, then the last group only has 2 bit (8 6 = 2); At the end of these 2 bits add 4 0 to the same 6 bit, and then at the end == of the number identification complement, in order to decode;

Since the original three-byte character is finally converted to four bytes, the string length of the base64 encoding is usually the original 3/4.

Here is a Base64 encoding class that I have implemented with PHP to fully understand Base64 encoding (I'm lazy after writing the code ...) ):

<?phpclass Base64 {Private $mapping = [' A ', ' B ', ' C ', ' D ', ' E ', ' F ', ' G ', ' H ', ' I ', ' J ', ' K ', ' L ', ' M ', ' N ', ' O ' ', ' P ', ' Q ', ' R ', ' S ', ' T ', ' U ', ' V ', ' W ', ' X ', ' Y ', ' Z ', ' A ', ' B ', ' C ', ' d ', ' e ', ' f ', ' g ', ' h ', ' I ', ' j ', ' K ', '  L ', ' m ', ' n ', ' o ', ' P ', ' Q ', ' R ', ' s ', ' t ', ' u ', ' V ', ' w ', ' x ', ' y ', ' z ', ' 0 ', ' 1 ', ' 2 ', ' 3 ', ' 4 ', ' 5 ', ' 6 ', ' 7 ',    ' 8 ', ' 9 ', ' + ', '/', ';  /** * Base64 Main Method * * @param $str * * @return String */Public function encode ($STR) {//        Unpack the string into hexadecimal $unpacked = unpack (' h* ', $str);        $hex = Str_split ($unpacked [1]);        $bin _str = $this->hextobin ($hex);    return $this->bintobase64 ($bin _str);  /** * Map binary strings to corresponding base64 strings * * @param $bin _str * * @return String */Private function        BinToBase64 ($bin _str) {$base 64_str = ';        $bin _list = Str_split ($bin _str, 6);            foreach ($bin _list as $bin) {$append = '; Switch (Strlen ($bin)) {//$bin for 6-bit without special handling case 6:break;                    The $bin 4-bit is a two-byte string 2*8%6 = 4 Case 4: $bin = Str_pad ($bin, 6, ' 0 ', str_pad_right);                    $append = ' = ';                Break                    $bin 2 bits is a byte string 1*8%6 = 2 Case 2: $append = ' = = ';                    $bin = Str_pad ($bin, 6, ' 0 ', str_pad_right);            Break            } $order = Base_convert ($bin, 2, 10);            $char = $this->mapping[$order]; $base 64_str. = $char.        $append;    } return $base 64_str; /** * Converts a hexadecimal string to a binary string * * @param $hex * * @return String */Private function Hextobin ($h        Ex) {$bin _str = '; foreach ($hex as $char) {//hexadecimal to binary string, each hexadecimal character to 4-bit binary, less than 0 supplement $bin = Base_convert ($char, 16, 2)            ;             if (strlen ($bin) < 4) {   $bin = Str_pad ($bin, 4, ' 0 ', str_pad_left);        } $bin _str. = $bin;    } return $bin _str; }} $encoder = new Base64 (); Var_dump ($encoder->encode (' Pillow Book blog '); 5p6v6l655lmmymxvzw==var_dump (Base64_encode (' Pillow Book Blog ')); 5p6v6l655lmmymxvzw==

Tips: Use base64_encode() and base64_decode() perform base64 encoding and decoding in PHP.

Summary

Character sets and encodings are generally not the focus of web development, but it's also interesting to know how to gain insight and prevent the day when you suddenly step on the pits.

If you feel that this article is helpful to you, you can help to point out the recommendation, you can also follow me. If there is any mistakes, please point out, thank you.

Reference:

Nanyi: Character-coded notes: Ascii,unicode and UTF-8

Wikipedia: Unicode

Base64 notes

Character sets and encodings in web development

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.