If you're a programmer who lives in the 2003, you don't know the basics of character, character set, encoding, and Unicode. Then you must be careful, if I catch you, I will let you peel six months of onions in the submarine to punish you.
This vicious threat was first made by Joel Spolsky ten years ago. Unfortunately, many people think he's just joking, so there are still a lot of people who don't fully understand Unicode, and the distinction between Unicode, UTF-8, and UTF-16. That's why I wrote this article.
To get back to the idea that on a sunny afternoon, you receive an email from a friend you lost contact with after high school, with a txt format (also known as plain text format) attachment. This attachment contains the following string of binary bits:
Copy Code code as follows:
0100100001000101010011000100110001001111
The text of the email is empty, which makes it more mysterious. Before you start the common text editor to open this attachment, have you ever wondered how a text editor translates binary forms into characters? There are two key issues:
1. How are bytes grouped? (for example, 1-byte characters and 2-byte characters)
2. How does one or more bytes map to characters?
The answer to these questions is in this document (Character Encoding), roughly speaking, the coding defines two things:
1. How bytes are grouped, such as a group of 8 bits or bits, which is also called a coding unit.
2. The mapping relationship between the encoding unit and the character. For example, in ASCII code, decimal 65 is mapped to the letter A
There is a slight difference between the character encoding and the character set. But usually it has nothing to do with you unless you are designing an underlying library.
ASCII code is one of the most popular coding systems in the last century, at least in the west. The following figure shows how the encoding units in ASCII code are mapped to characters.
One common misconception, even among experienced programmers, is that plain text uses ASCII code and each character is 8 bits.
The fact is, there is no such "plain text". If you have a string in memory or on your hard disk that you do not know to encode, you cannot translate or display it. There is absolutely no alternative to the second route.
So when the attachment you just received does not specify the encoding format, how does the computer translate it? Does that mean you'll never read what an old friend wants to say to you? Before we find the answer, we first go back to that time ———— the biggest hard drive that money can buy is the age of 29MB.
Historical Review
A long time ago, computer manufacturers had their own way of representing characters. They don't have to worry about how to communicate with other computers, and they propose ways to render the glyphs to the screen. As computers become more popular, competition among manufacturers is more intense, and the conversion of data between different computer systems becomes so painful that people are tired of the confusion caused by this customization.
Eventually, the computer manufacturer worked out a standard method to describe the characters. They define the character by using a low 7-bit byte, and make a comparison of seven bits to a character, as shown in the previous illustration. For example, the letter A is 65,c is 99,~ is 126 and so on, the ASCII code is thus born. The original ASCII standard defines characters from 0 to 127, which can be represented by seven bits. But not a very short-lived ...
Why do you choose 7 bits instead of 8 to represent one character? I don't care. But a byte is 8 bits, which means that 1 bits are not being used, that is, the code from 128 to 255 is not regulated by the ASCII standard, and these Americans don't even care about the rest of the world.
People from other countries take this opportunity to start using codes from 128 to 255 to express the characters in their language. For example, 144 is گ in the Arabic ASCII code, while in the Russian ASCII code is ђ. Even in the United States, there are a variety of uses for unused areas. The IBM pc appears as "OEM font" or "extended ASCII", providing users with beautiful graphical text to draw text boxes and support some European characters such as the Pound sterling (£) symbol.
Again, the problem with the ASCII code is that although everyone agrees on the use of the number 0-127, there are many different explanations for the number 128-255 character. You must tell the computer which style of ASCII to display the number 128-255 characters correctly.
This is not a problem for North Americans and the British Isles, because the Latin alphabet shows the same regardless of the type of ASCII code used. The problem for Britons is that the original ASCII code does not contain the pound sign, but it doesn't matter.
At the same time, there are more troubling problems in Asia. Asian languages have more characters and glyphs to be stored, and one byte is not enough. So they started using two bytes to store the characters, which is called DBCS (double-byte coding scheme). In DBCS, string manipulation becomes very egg ache, what should you do str++ or str–?
These problems have become a nightmare for system developers. For example, MS dos must support all styles of ASCII, because they want to sell the software to other countries. They put forward the concept of "inside Code table". For example, you need to tell DOS (by using the "CHCP" command) that you want to use the Bulgarian Code table to display the Bulgarian alphabet. The replacement of the internal code table will be applied to the whole system. This is a problem for people who work in multiple languages because they have to frequently switch back and forth between several internal code tables.
While the Inner code table is a good idea, it's not a neat solution, it's just a hack technique or a simple fix to make the coding system work.
Into the world of Unicode
In the end, Americans realized that they should propose a standard scheme to show all the characters in all languages of the world in order to ease the pain of programmers and avoid the character encoding caused by the Third World War. For this purpose, Unicode was born.
The idea behind Unicode is very simple, but it is widely misunderstood. Unicode is like a telephone book that marks the mapping between characters and numbers. Joel calls them "magic numbers" because they may be randomly assigned and do not give any explanation. The official term is code point, which always starts with u+. Theoretically, each character in each language is assigned a magical number by the Unicode association. For example, the first letter א in Hebrew is u+2135, and the letter A is u+0061.
Unicode does not refer to how characters are represented in bytes, it only specifies the number corresponding to the character, and that's it.
Other misconceptions about Unicode include: Unicode supports a maximum of 65,536 characters, and Unicode characters must occupy two bytes. The people who told you these should go change their brains.
Remember, Unicode is just a standard for mapping characters and numbers. It does not limit the number of characters that are supported, nor does it require that the characters must account for two, three, or any number of bytes.
How Unicode characters are encoded into in-memory bytes This is another topic that is defined by UTF (Unicode transformation Formats).
Unicode encoding
The two most popular Unicode coding schemes are UTF-8 and UTF-16. Let's take a look at their details.
UTF-8
UTF-8 is a very stunning concept, and it beautifully implements backwards compatibility with ASCII code to ensure that Unicode is acceptable to the public. The man who invented it deserved at least a Nobel Peace prize.
In UTF-8, the number 0-127 characters are represented in 1 bytes, using the same encoding as US-ASCII. This means that documents written in the 1980 generation have no problems with UTF-8. Only 128th and above characters are represented by 2, 3, or 4 bytes character character. Therefore, UTF-8 is called variable length encoding.
To get back to the beginning of the article, the byte stream from your old friend's attachment is as follows:
Copy Code code as follows:
0100100001000101010011000100110001001111
This byte stream represents the same character in ASCII and UTF-8: HELLO
UTF-16
Another popular variable-length coding scheme is UTF-16, which uses 2 or 4 bytes to store characters. However, people are becoming aware that UTF-16 may waste storage, but that's another topic.
Low byte order (Little endian) and high byte order (big endian)
Endian read as End-ian or Indian. The origins of this term can be traced to Gulliver's Travels. (In the novel, the villain Country for boiled egg should from the big one end (big-end) Peel Open or small end (little-end) to peel away and debate, the two sides of the controversy are called "large-end faction" and "small-ended faction". )
A low byte sequence and a high byte sequence are just a convention about storing and reading a paragraph of bytes (called words) in memory. This means that when you have the computer use UTF-16 to put the letter A (two bytes) in memory, the byte-order scheme used determines whether you put the first byte before or after the second byte. It's a little bit difficult to understand, let's take a look at an example: when you use UTF-16 to save an attachment from your friend, the second half of it may be like this in a different system:
Our 6C 6C 6F (high byte order, high byte is present in front)
The 6C 6C 6F 00 (Low byte order, low byte is present in front)
The byte-order scheme is just a matter of preference for a microprocessor architect, for example, Intel uses a low byte sequence, and Motorola uses a high byte sequence.
byte order mark (BOM)
If you often want to convert documents between high and low byte systems and want to differentiate between byte sequences, there is also a strange convention called a BOM. A BOM is a cleverly designed character used to tell the reader the byte-order of the document at the beginning of the document. In UTF-16, it is implemented by placing FE FF in the first byte. In a different byte-sequence document, it is displayed as FF FE or FE FF, which clearly tells the interpreter the byte order of the document.
The BOM, although useful, is not very concise, because there is a similar concept, called the "magic word" (Magic Byte), which has been used to indicate the format of the file for many years. The relationship between the BOM and the magic word has not been clearly defined, so some interpreters will confuse them.
Congratulations on your reading here, you must be a very patient reader.
Remember the question at the beginning of the article, since there is no "plain text" file, so why does your text editor and browser display the content correctly every time? The answer is that the software deceives you, which is why so many people know nothing about coding. When the software is not sure of the code, it guesses. Most of the time, it guesses whether it's UTF-8, or iso-8859-1, that covers the ASCII code, and it's possible to guess any other character set that can be thought of. Because the Latin alphabet used in English can be displayed in almost all character sets, including UTF-8, so even if the encoding is wrong, the English alphabet looks right.
However, if you see the symbols while browsing the web, this means that the code for this page is not the one that your browser guesses. You can then click on the browser's view--> character code menu to try different encodings.
Summarize
If you don't have time to read the whole article or you just skim through the previous content. Then please make sure you understand the following:
There's never a plain text in this world, and if you want to read out a string, you have to know it's encoded.
Unicode is a simple standard used to map characters to numbers. The Unicode association will help you deal with all the behind-the-scenes problems, including specifying the encoding for the new characters.
Unicode does not tell you how characters are encoded into bytes. This is determined by the encoding scheme, specified by UTF.
And the most important:
Always remember to explicitly specify the encoding of your document by Content-type or meta charset tags. This way the browser does not need to guess the code you are using, and they will accurately render the document using the code you specify.