1 ASCII code
The relationship between English characters and bits is uniformly stipulated. The ASCII code altogether specifies a 128-character encoding. For example, the space "space" is 32 (binary 00100000), and capital A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) occupy only one byte of the following 7 bits, and the first 1-bit uniform is 0.
2 Non-ASCII encoding
128 symbols are not enough to represent other languages that are not English. For example, in French, there is a phonetic symbol above the letter and cannot be expressed in ASCII code. As a result, some European countries decided to use the most unused bits of bytes to incorporate new symbols. For example, the encoding of é in French is 130 (binary is 10000010). This can represent a 256 symbol.
However, different countries have different letters, so even if they are encoded using 256 symbols, the letters represented are not the same. For example, 130 represents é in French encoding, represents the letter Gimel (?) in the Hebrew encoding, and represents another symbol in the Russian encoding. 0-127 the symbol is the same, not the same is 128-255 this paragraph. Chinese characters up to 100,000, need to use a number of bytes to represent a Chinese character. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it can theoretically represent up to 256x256=65536 characters. Although a symbol is represented in multiple bytes, the Chinese character coding of the GB class is not related to the following Unicode and UTF-8.
3 Unicode
There are many coding methods in the world, and the same binary numbers can be interpreted into different symbols. The reason why e-mail often garbled, because the sender and the recipient use the same encoding method. As the encoding of all symbols, Unicode incorporates all the symbols in the world, giving each symbol a unique encoding. It is a huge collection that can accommodate 100多万个 symbols. For example, u+0639 indicates that the Arabic letter ain,u+0041 means that the capital letter of English A,u+4e25 denotes strict Chinese characters. The specific symbol table, you can query http://www.unicode.org/or Kanji correspondence table.
Problems with Unicode
Unicode is just a set of symbols that specifies the binary encoding of the symbol, but does not specify how it is stored. For example, the strict Unicode character of Chinese characters is hexadecimal number 4E25, and the conversion to binary number is 15 bits (100111000100101), which requires 2 bytes. Different symbols require different numbers of bytes. There are 2 questions:
1 How do I differentiate between Unicode and ASCII?
How does the computer know that 3 bytes represents a symbol instead of 3 symbols?
2 space waste is easy to appear
The English alphabet requires only one byte. If Unicode unification stipulates that each symbol is represented by 3 or 4 bytes, then storing the English letter will appear with 2 or 3 bytes all 0, a waste of space.
So, There are multiple implementations of Unicode.
4 UTF-8
UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 (characters are represented by 2 bytes or 4 bytes) and UTF-32 (characters are represented by 4 bytes). UTF-8 is a variable-length encoding that uses 1~4 bytes to represent a symbol and adjusts the number of bytes according to different symbols.
Coding rules for UTF-8:
1 for single-byte symbols, the first bit of the byte is set to 0, and the subsequent 7 bits are the Unicode encoding of the symbol. Therefore, for the English alphabet, the UTF-8 encoding and ASCII code are the same.
2 for N-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The remaining bits are filled in by the Unicode encoding of the symbol from the back to the next, filling 0.
A summary of the coding rules is shown in the following table, with the letter x representing the encoded bits:
Take the Chinese character strictly as an example, how to realize UTF-8 code:
Known strict Unicode is 4E25 (100111000100101), according to the table above, 4E25 in the range of line 3rd (0000 0800-0000 FFFF), so strict UTF-8 encoding requires 3 bytes, that is, the format is "1110xxxx 10xxxxxx 10xxxxxx ". Then, the strict last bits begins, then the X in the format is filled in sequentially, and the vacancy is 0. So, strict UTF-8 code is "11100100 10111000 10100101", converted to 16 binary is e4b8a5.
Resources
Character-coded notes: Ascii,unicode and UTF-8
Java character encoding ASCII, Unicode, and UTF-8