From ASCII to UTF-8: big talk Encoding
In other words, laomei developed an ascii code that represents a character in 8 bits,
Solved the problem of computer storage of human language.
At that time, the Group was a little childish. They only focused on English, numbers, and some simple symbols.
I have never thought about how to store Chinese characters, Latin, and Tibetan Characters.
As computers become more and more popular, this problem becomes increasingly acute and cannot make people around the world
Do they all use English? So there are two organizations, one is ISO and the other is Unicode.
I got a solution...
The Unicode method is relatively simple. Isn't one byte enough? I used two bytes to store them. Is that enough?
This is how Unicode 1.0 is implemented.
To say that ISO is an atmosphere, it may be difficult for decision makers to spend tens of KB of memory,
Isn't one byte enough? Is it enough to use 4 bytes? Is it enough for another hundred years?
This is the kind of ucs-4.
As some strange texts need to be incorporated into Unicode, the Unicode decision makers are a little sweaty,
Are there so many strange letters? Do not add more data. Use 2 bytes + 4 bits for storage ..
The 4-bit header can represent a lot of strange words again ....
This is the prototype of Unicode 2.0.
Now we have two different encoding methods. Which one should we use?
So the Unicode organization and the ISO organization have reached an agreement, that is, you have me, I have you,
Although the ucs-4 has 32-bit encoding space, it only uses 20 bit, and Unicode is consistent, Unicode is not modified
This is the result of the rape of ucs-4 and Unicode 2.0.
Later, in August 2000, Unicode staff began to look like they were not eating white food,
I just modified the Unicode 2.0 document and published it as Unicode 3.0. I didn't add a new character !!!!!!
(In fact, there are about 12 current languages and dozens of ancient languages, such as the Yima language and Class B linear text in ancient Greece,
The ancient Persian simplified text is not supported yet)
So far, the encoding scheme is unified. Next, the Code becomesProgramMembers.
A programmer's anger makes sense, for example, entering a 100-word EnglishArticle, If ASCII
Encoding, requires only 100 bytes, and if there is even a strange character and have to use ucs-4,
400 bytes! This is a disaster for early programmers... Even if bandwidth is limited today,
This is also a very important issue ..
As a result, IETF introduced UTF-8 and UTF-16 solutions (UTF32 uses too few, ignore)
UTF 8 is actually the smartest encoding method. In short, there are three rules
(1) the ASCII code remains unchanged and is represented by 1 byte.
(2) If one byte is not enough, use two bytes.
(3) If two are not enough, use three bytes. What? Not enough?
No. Three bytes have exceeded the Unicode representation limit. Are you an alien?
It brings the following two benefits:
(1) platform independence, windows with UTF-8 to write novels, others in UNIX still can see ..
(2) There is a flag. One word cannot be read and other words are not affected.
UTF 16 was prepared for dumb programmers. In short, there are two rules.
(1) The characters in Unicode 1.0 are completely copied, and two bytes are used.
(2) Unicode 2.0 continues copying. It must be 20-bit characters and processed with 2 bytes + 4-bit characters.
This does not bring about two disadvantages,
(1) because it is longer and not extended according to the computer word length (8 bit), so the UTF16 Encoding
The decoding of Dongdong is related to the Processing Methods of CPU and operating system, which is not conducive to communication.
(2) Some special characters cannot be processed by the computer.
(3) the above two items can be sentenced to death,
However, UTF16 is really the most space-saving. After all, it is compact-coded, and there is no 000000000 of the large section ....
In fact, IETF prefers UTF-8 to be the de facto standard (rfc2279 ),
And UTF-16, that is to sell ISO and Unicode face, to achieve it (rfc2781)
In reality, due to the excellent performance of the UTF-8, has been widely recognized and used.
For example, XML, which is highlighted in the second specification of xml1.0,
When the encoding attribute of the XML document is not specified
UTF-8 coding/Decoding
(Although I strongly recommend that you specify the encoding attribute)
OK. It's over! You can throw tomatoes and eggs.