-- Unicode must be mentioned separately.
Like tianchao, when computers are transferred to various countries in the world, a code scheme similar to gbw./ GBK/gb18030/big5 is designed and implemented to suit local languages and characters. In this way, there is no problem in local use. Once it appears in the network, due to incompatibility, garbled code occurs during mutual access.
In order to solve this problem, a great idea has produced Unicode. The Unicode encoding system is designed to express any characters in any language. It uses 4-byte numbers to express each letter, symbol, or ideograph ). Each number represents a unique symbol used at least in a language. (Not all digits are used, but the total number has exceeded 65535. Therefore, two bytes of digits are not enough .) The characters shared by several languages are generally encoded using the same number, unless there is a reason for the etymological. In this case, each character corresponds to a number, and each digit corresponds to a character. That is, there is no ambiguity. You no longer need to record the "mode. U + 0041 always represents 'A', even if the language does not contain the 'A' character.
In the field of computer science,Unicode(Uniform Code,Wanguo code,Single Code,Standard Wanguo codeIs a standard in the industry, which enables computers to reflect dozens of types of texts in the world. Unicode is developed based on the standard of the Universal Character Set and published in the form of books [1. Unicode is constantly expanding. More characters are inserted in each new version. For the sixth edition so far, Unicode has already contained more than 100,000 characters (in 2005, Unicode's 100,000 characters were accepted and recognized as one of the Standards) A group of code charts that can be used as a visual reference, a set of encoding methods and a set of standard character encoding, a set of enumerations that contain character features such as superscript and subscript. The Unicode Consortium is operated by a non-profit organization and leads the subsequent development of Unicode. Its goal is: the existing character encoding scheme is replaced by the Unicode encoding scheme. In particular, the existing scheme has only limited space and incompatibility problems in the multi-language environment.
(It can be understood that Unicode is a character set, and UTF-32/UTF-16/UTF-8 are three character encoding schemes.)
3.1.ucos & UnicodeGeneral Character Set(Universal character set,UCs) Is developed by ISO.ISO 10646(OrISO/IEC 10646) The standard character set defined by the standard. There have historically been two independent organizations trying to create a single character set, namely the unified code alliance composed of the International Organization for Standardization (ISO) and multilingual software manufacturers. The ISO/IEC 10646 project developed by the former and the unified Code project developed by the latter. Therefore, different standards were initially developed.
Around 1991, participants from both projects realized that the world does not need two incompatible character sets. As a result, they began to merge the work results of both parties and work together to create a single encoding table. Since Unicode 2.0, Unicode uses the same font and character code as ISO 10646-1; ISO also promises that ISO 10646 will not assign a value to the UCS-4 code that exceeds U + 10ffff, to make the two consistent. Both projects still exist and their respective standards are published independently. However, the unified code alliance and ISO/IEC JTC1/SC2 both agree to maintain compatibility with the standard code table and closely adjust any future expansion. At the time of release, Unicode generally uses the most common fonts related to the code, but ISO 10646 generally uses the century font as much as possible.
3.2.utf-32The above uses 4-byte numbers to express each letter, symbol, or ideograph, each digit represents a unique encoding scheme, called a UTF-32, that is, a symbol used at least in a language. UTF-32 is also calledUCS-4Is a Unicode character encoding protocol that uses 4 bytes for each character. In terms of space, it is very inefficient.
This method has its advantages. The most important thing is that the nth character in the string can be located within the constant time, because the nth character starts from 4th × nth bytes. Although each bitwise uses a Fixed Length byte, it is not as widely used as other unicode encoding.
3.3.utf-16Although there are many Unicode characters, most people do not actually use more than the first 65535 characters. Therefore, there is another unicode encoding method called UTF-16 (because 16-bit = 2 bytes ). The UTF-16 encodes characters in the range of 0-65535 into 2 bytes, if you really need to express Unicode characters that are rarely used within the range, you need to use some strange techniques. The most obvious advantage of UTF-16 encoding is that it is twice the space efficiency of the UTF-32, because each character only needs 2 bytes to store (out of the 65535 range ), instead of the four bytes in the UTF-32. In addition, if a string does not contain any characters in the spark layer, we can still find the nth character in the constant time, this is always a good inference until it is not true. The encoding method is: