What is UTF8
UTF8 is not a computer code, but a form of storage and transmission, as described above, each unicode/ucs character is stored in 2 or 4 bytes to see the following comparisons:
Take "I am Chinese" as an example
Store with ANSI: Bytes
Storage with UNICODE/UCS2: Bytes + 2 Bytes (header)
Storage with UCS4: Bytes + 4 Bytes (header)
Take the example of "I am Chinese"
Store with ANSI: Ten Bytes
Storage with UNICODE/UCS2: Bytes + 2 Bytes (header)
Storage with UCS4: Bytes + 4 Bytes (header)
This shows that the direct unicode/ucs of the original form of storage is a great waste, but also not conducive to the transmission of the Internet (Chinese is a bit more cost-effective ^_^).
In view of this, Unicode/ucs's compression form--utf8 appeared, applying the official website's first sentence "UTF-8 stands for Unicode transformation Format-8." It is an octet (8-bit) lossless encoding of Unicode characters. "Because UTF also applies to coded UCS, so it can be called" UCS transformation Formats (UTF) "
UTF8 is the most basic unit of 8bits or 1Bytes encoding, of course, it can also be based on 16bits and 32bits, respectively, called UTF16 and UTF32, but the current use is not much, and UTF8 is widely used in file storage and network transmission.
Coding Principle
Look at this template first:
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001f FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03ff FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7fff FFFF 1111110x 10xxxxxx ... 10xxxxxx
Encoding steps:
1) First determine how many 8bits (octets) are required
2) Fill each octets's high bits with the above template
3) Fill the character bits into X, character order: Low → High, UTF8 order: Last octet lowest x→ first octet highest bit X
4) decoding the same principle.
Example:
UCS-4 UTF-8
Hex bin Bytes bin Hex Bytes
0000 000A 00001010 4 00001010 0A 1
0000 0099 10011001 4 11000010 10011001 C2 99 2
0000 8d99 10001101 10011001 4 11101000 10110110 10011001 E8 B6 99 3
I do not know that we understand no, in fact, do not understand it doesn't matter, anyway, do not have to calculate, the program can do it entirely.
The file files stored in the UTF8 format are identified as EF BB BF.
Efficiency
The conclusions derived from the above coding principles are:
1. Each English letter and number occupies a space of 1 Byte;
2. The pan-European language, Slavic alphabet accounted for 2 Bytes;
3. Chinese characters accounted for 3 Bytes.
Benefits of UTF8:
Data performance: The Web page can display any language and text, as long as your operating system support Unicode, and the corresponding font, Linux under the system code is UTF8, can solve a lot of unnecessary Chinese problems, such as MP3 player or GTK2.
Data exchange: No conversion between gb2312 and Big5 is required.
PHP in the famous "Smarty" problem, the problem can be very good to solve, this aspect of the future expansion to introduce:
Disadvantages of UTF8:
The use of Chinese sites is not much, not conducive to data exchange.
Chinese characters are three characters, sometimes varchar is not enough.
But overall it's more than the pros and cons, so keep using it. 4 Bytes (header)
This shows that the direct unicode/ucs of the original form of storage is a great waste, but also not conducive to the transmission of the Internet (Chinese is a bit more cost-effective ^_^).