In the past, I did not really have any in-depth research on character encoding. This learning is also due to a customer complaint. In short, the customer complaint is that when our gateway truncates the customer's SMS content, the text message is garbled on the terminal. Following this cause, I spent some time learning some common knowledge about computer character encoding.
Character, which is the most common element in the coding process.
A computer is undoubtedly a machine. When we first came into contact with computers or received computer education, we knew that computers could only recognize binary codes of 010101. In the early days of interaction between people and computers, the binary mode was also used. At that time, people input information to the computer by pulling countless switches on the computer's huge panel, alternatively, punch cards can be used to input commands and data to the computer. The birth of man-machine interface consisting of terminals and keyboards has greatly improved the interaction efficiency with computers. Here we mention the 'char', so what is the 'char '? To put it bluntly: A character is a mark used by people and a symbol in the abstract sense. For example, Arabic numerals 1, which is a symbol. The abstract meaning of this symbol: 1 represents a concept of quantity and how the abstract concept 1 was born, if you are interested, you can read popular science books such as the history of data science.
Human marks are varied, including Chinese characters, punctuation marks, graphical symbols, and numbers. These are collectively referred to as 'characters' in the computer field '. The set of all characters is called 'character charsets '. With the 'char' concept, How do I express the 'char' in a computer? As mentioned above, binary bits are used for communication in computers, and the 'chars' can only be built on the bit. How many BITs indicate a proper character? Or how big is our character set? If the character set contains only eight characters, I can use a combination of three bits to show and recognize these characters. I thought that Americans were still thinking about this issue, but the Americans took it for granted that they could use no more than 256 practical characters. At that time, the Americans only used 128 characters, we reserve 128 backups, and 8-bit character sets for 256 characters. This is the world-famous American Standard Code for information interchange (ASCII ). The 8bit is exactly the same as the number of bits in the computer's basic data storage unit-'byte', so that a byte can represent an ASCII character. For example, the memory bit mode of the ASCII character 'a' is 0x41.
Here we mention the concept of 'code'. the ASCII mentioned above is one of the many character encoding specifications, the earliest and most important one. So what is character encoding? Let's review what ASCII has done during its formulation:
1) 8 bits are required to represent an ASCII character;
2) An ASCII character table is created, that is, the bit mode corresponding to each character in the character set. For example, the memory bit mode of the ASCII character 'B' is 0x42, and the memory bit mode of '1' is 0x31.
From this perspective, a character encoding specification requires two things:
1) specify the number of bytes of Characters in the character set;
2) create a sequence table for the character sequence set, that is, the bit mode corresponding to each character in the character set.
1) and 2) the two rules are encoded together.
With the popularization of computers, computers are used all over the world. However, for non-English countries such as China, Japan, and Korea, ASCII codes are far from meeting the needs of Chinese people, my Chinese civilization has been around for five thousand years. How can the civilization accumulated over the past five thousand years be expressed by these 256 characters (precisely 128 characters. We also need to develop our own code, which is also done by Japanese and Korean people. In this case, localized coding standards such as gb2312, big5, and JIS are used in a country or region, collectively referred to as ANSI encoding. These ANSI codes have some common features:
1) Each ANSI encoding or ANSI character set only specifies the 'characters' required for the language used by the country or region; for example, Chinese GB-2312 encoding does not contain Korean text.
2) the space of the ANSI character set is much larger than that of ASCII, and one byte is not enough. Most of them use the multi-byte storage solution.
3) ANSI encoding is generally compatible with ASCII codes.
The emergence of ANSI has allowed computers to spread to every corner of the world. Every country has used such advanced tools to improve its productivity. Open windows notepad. The "encoding" drop-down box in the "Save as" dialog box contains ANSI encoding. In simplified Chinese systems, ANSI encoding represents gb2312 encoding. In Japanese operating systems, ANSI encoding represents JIS encoding. But with the rise of the Internet, problems have emerged. Because of the first feature of the ANSI code: each country or region does not take into account the ANSI code of another country or region when preparing its own ANSI code, leading to overlapping coding spaces, for example: in Chinese characters, the GB encoding is [0xd6, 0xd0]. What is this encoding in JIS? I don't know, and I don't want to look up strange ghost articles, however, I am sure it is not a Chinese character, although many Chinese characters are copied from these languages. In this way, garbled characters are inevitable when information is exchanged and displayed between different ANSI encoding systems.
Unicode Character Set encoding was born to facilitate international information exchange. Unicode is the abbreviation of universal multiple-octet coded character set. The Chinese meaning is "general multi-octal encoding character set ". It is a character encoding system developed by a Institute named Unicode Consortium. Unicode aims to include the texts and symbols of the vast majority of countries in the world into their character sets, it sets a unified and unique binary code (bit mode) for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing, in order to support the exchange, processing and display of written texts in different languages in the world today, people around the world can freely and freely communicate with each other through computers. To put it bluntly, Unicode encoding is to first include the vast majority of common characters in the world into the Unicode Character Set, and then conduct a unified troubleshooting. The encoding (bit mode) of each UNICODE character is the serial number of the character in the Unicode Character Table. Therefore, unlike the ANSI encoding mentioned above, the encoding of a Unicode character is represented by an integer. The length of this integer is usually greater than or equal to 2 bytes. In this way, Unicode encoding is stored on different platforms in the byte sequence. For example, in 'with standard unicode encoding,' the storage on Windows is '2d4e ', while in the iSCSI Solaris, the storage is '4e2d '.
The standard unicode encoding is mentioned above. Are there other unicode encoding methods? Indeed, the emergence of Unicode makes us a huge step in the process of unified computer encoding, after all, Unicode was born only for 10 years. Before that, everyone had been using ASCII code and their respective ANSI encoding. It is unlikely that all computer systems around the world should be converted to unicode encoding at one time. Now more and more new systems are supporting and using Unicode. How to exchange data between these new systems and old systems is actually the first challenge. So a new term is born again, that is, UTF, Unicode translation format, that is, converting Unicode to a certain format. Why convert to a certain format? The conversion is for transmission and exchange. A good UTF-x solution should facilitate the transmission of texts of different languages and codes between different computers over the network, this enables the standard dubyte Unicode to be correctly transmitted on the existing system that processes a single byte. There are currently three common UTF solutions:
UTF-16: it is itself a standard unicode encoding scheme, also known as a UCS-2, which uses 16 bits (two-byte) Integers to represent a character.
UTF-32: Also known as a UCS-4, which uses a 32 bits (four byte) integer to represent a character.
UTF-8: the most widely used UTF solution, where the UTF-8 uses variable-length bytes to store Unicode characters, such as ASCII letters that continue to use 1-byte storage, accent text, Greek letters, or Spanish letters are stored in 2 bytes, while commonly used Chinese characters are stored in 3 bytes. The secondary flat character is 4 bytes. UTF-8 makes it easier for Unicode systems to transfer and exchange data with existing single-byte systems. Different from the first two schemes: the UTF-8 uses byte as the encoding unit and there is no bytecode problem.
There are three methods for UTF. How can we identify data when receiving and storing data and which solution is used to guide data identification? In the UTF Encoding scheme, there is a character called "Zero Width no-break space", and its encoding is feff. Fffe does not exist in the UCs, so it should not appear in actual transmission or storage. We recommend that you transmit the character "Zero Width no-break space" before transmitting or storing the byte stream ". The preceding "Zero Width no-break space" can be identified to identify the encoding scheme:
Ef bb bf UTF-8
Fe FF UTF-16/UCS-2, little endian
FF Fe UTF-16/UCS-2, big endian
FF Fe 00 00 UTF-32/UCS-4, little endian.
00 00 Fe FF UTF-32/UCS-4, big-Endian.
The above is the basic knowledge of character encoding. Next we will combine encoding with specific programming languages for more intuitive learning. The C language is used as an example.
The C language defines two character sets (character set): The Source character set (source Character Set) is used to form a character set combination of C source code, while the running character set (Execution Character Set) is a character set that can be interpreted by the execution program. Each application has its own execution character set, that is, the character set or character encoding used during application execution to identify bit streams in various data storage media.
[Example1]
/* Testwprintf. C, Windows XP, mingw gcc-3.4.2 */
Int main (){
Wchar_t ws [] = l "Chinese"; --- (1)
Wprintf (L "% s/n", WS );
}
Compile this program GCC compiler prompt: (1) this line: Converting to execution Character Set: Illegal byte sequence
Why does the conversion fail? We can see that the wide character constant is used in the program. Here we will first insert a short story in C language: Multi-byte characters and wide byte characters.
The C language was originally designed in an English environment and its main character set is ASCII. However, internationalization software must be able to express different characters, which have a large number of characters and cannot use one byte encoding. Therefore, in 1994, "normative Addendum 1" (baseline Supplement 1) was adopted, iso c can standardize two methods to represent a large character set: wide character set (wide character, each character in this character set uses the same bit length) and Multi-byte character set (multibyte character, each character can be one to multiple bytes, and the character value of a byte sequence is determined by the environment background of the string or stream ). Since the addition in 1994, C not only provides the char type, but also provides the wchar_t type (wide character ). Although the c Standard does not yet support Unicode character sets, many implementations use the Unicode conversion format UTF-16 and UTF-32 to handle wide characters (the mingw gcc I met uses a UTF-16, (Sun's iSCSI GCC uses a UTF-32), that is, in most standard C implementations, the default wchar_t is a Unicode character, a wide character is actually a Unicode character, A wide character constant string (L "... ") is actually a unicode encoded constant string. In this way, we will explain the problem above.
In the above program, when the compiler encounters a wide character constant: l "Chinese", it tries to convert it into Unicode code storage, mingw GCC tries to use the default source code symbolic set-> Unicode transcoding mode to convert the literal binary bit mode of "Chinese" to the Unicode bit mode, however, we found that the "Chinese" literal bit pattern cannot be identified. This requires us to tell GCC that the "Chinese" literal bit pattern is gb2312, we use: gcc-finput-charset = gb2312 testwprintf. c can solve this problem.
All right, the compilation is complete. Let's execute a.exe, but we find that there is no output in the console. What is the problem? Analysis: currently, we use the Unicode encoding bit mode in WS. Wow, wprintf does not support direct output: unicode encoding. For example, functions such as printf and wprintf that are output to the console or file library only support ANSI encoding or multi-byte encoding. In fact, this is in line with the C language specification, because the c Standard does not support Unicode, but many implementations of C will use the Unicode bit mode to represent the wide character. In this case, we need to use the setlocale function to set how to convert the wide character of Unicode encoding into an output encoding.
[Example2]
/* Testwprintf. C, Windows XP, mingw gcc-3.4.2 */
Int main (){
Wchar_t ws [] = l "Chinese ";
Setlocale (lc_all, "CHS");/* sets the GB code. There is no locale such as "CHS" on UNIX, And you can query it through locale-A on UNIX */
Wprintf (L "% s/n", WS );
}
Setlocale (...) only works at runtime. After compilation and execution, the word "Chinese" is displayed on our console.
Of course, we can also use the standard library call to manually convert wide characters into ANSI characters before direct output.
[Example3]
/* Testwprintf. C, Windows XP, mingw gcc-3.4.2 */
Int main (){
Wchar_t ws [] = l "Chinese ";
Char MS [12];
Memset (& Ms, 0, sizeof (MS ));
Setlocale (lc_all, "CHS");/* sets the GB code. There is no locale such as "CHS" on UNIX, And you can query it through locale-A on UNIX */
Wcstombs (MS, WS, sizeof (MS ));
Printf ("% s/n", MS );
}
After the compilation and implementation, the word "Chinese" is also on the paper. Wcstombs converts a wide string to a specified ANSI encoded string according to the setlocale encoding; mbstowcs converts a multi-byte string into a unicode encoded string and stores it in the wide string according to the encoding set by etlocale. The former calls setlocale to guide the target encoding; the latter calls setlocale to guide the translation of the source string into the target Unicode string. Similarly, there are standard character-level functions: wctomb and mbtowc.
Many useful open-source toolkit are available for character encoding conversion, such as the famous iconv, which is rarely implemented by itself. Learning the above knowledge is just to make yourself no longer confused when encountering garbled code problems, and it is necessary and helpful to have a conceptual understanding of computer character encoding knowledge.
For more information, see: http://bigwhite.blogbus.com/logs/10617585.html