The origin of ASCII code
After the invention of the computer, in order to represent characters in the computer, people developed an encoding called ASCII code. The ASCII code is represented by 7 bits (bit) in one byte, and the range is 0x00-0x7f a total of 128 characters.
Then they suddenly found out that "tabs" were missing if they needed to be printed in tabular format. It then expands the definition of ASCII by using all 8 bits (bit) of a byte to represent the character, which is called extended ASCII code. The range is 0x00-0xff a total of 256 characters.
Unicode Detailed Introduction
1. Two bytes easy to produce after ambiguity
The first version of Unicode is a two-byte (16bit) representation of all characters. in fact, it's easy to be ambiguous, we always think that two bytes is two bytes when it is saved on the computer. As a result, any character that is stored in Unicode is two bytes, which is actually wrong.
In fact, Unicode involves two steps, the first is to define a specification, give all the characters a unique corresponding number, this is a mathematical problem, it can be said that the computer does not matter. The second step is how to save the number of characters corresponding to the computer, which is related to the actual number of bytes in the computer space.
So we can also understand that Unicode uses numbers between 0 and 65535 to represent all characters. where 0 to 127 of these 128 numbers represent characters that are still exactly the same as ASCII. 65536 is 2 of 16 times, this is the first step. The second step is how to convert 0 to 65535 of these numbers into 01 strings saved to the computer, there must be a different way to save. Then there was UTF (Unicode Transformation format), with Utf-8,utf-16.
The difference between 2.utf-8 and UTF-16
UTF-16 a good understanding, that is, any character corresponding to the number is two bytes to save. Our common misconception about Unicode is to equate Unicode with UTF-16, but it is clear that if all are English letters This is a bit wasteful, obviously with a word energy-saving expression of a character why the whole two ah.
Then there is a UTF-8, 8 here is very easy to mislead people, 8 does not mean a byte. Does a byte represent a character? Not really, when using UTF-8, a character is mutable, it is possible to represent one character in one byte, or two, or up to three. That is to say, UTF-8 can represent characters in bytes, and the actual size can be dynamically changed, depending on the number of characters corresponding to the size to determine
So UTF-8 and UTF-16 the pros and cons are easy to see out. If all English or English and other words mixed, but English accounted for the vast majority, with UTF-8 than UTF-16 save a lot of space. And if all is Chinese such a similar character or mixed characters in Chinese accounted for the overwhelming majority. UTF-16 has the advantage, can save a lot of space, of course, there is a fault-tolerant problem.
For example, if the Chinese character "Han" corresponds to the Unicode is 6c49 (this is in hexadecimal notation, in decimal notation is 27721 Why do not use decimal notation?) It is clear that hexadecimal is the shortest point. In fact, it's all equivalent. No, it's like 60 minutes and 1 hours. You might ask how do we know if it's a UTF-8 or a UTF-16 when you open a file with a program? Naturally there is a sign that a few bytes at the beginning of the file is a sign.
EF BB BF represents UTF-8
FE FF represents UTF-16.
The things that Unicode and ASCII