Abstract: This article describes the development process of character and encoding and the correct understanding of related concepts. Examples illustrate how to implement encoding in some practical applications. Next, this article describes several misunderstandings about character and encoding, the causes of garbled characters, and the methods to eliminate garbled characters. This article covers Chinese and garbled questions ".
The key to understanding coding is to correctly understand related concepts. The technology involved in coding is actually very simple. Therefore, when reading this article, you need to read more slowly and think more.
Introduction
"Character and encoding" is a frequently discussed topic. Even so, garbled characters often plague everyone. Although we have many ways to eliminate garbled characters, we do not necessarily understand the internal principles of these methods. Some causes of garbled code are actually caused by problems with the underlying code itself. Therefore, not only do beginners feel fuzzy about character encoding, but some underlying developers also lack an accurate understanding of character encoding.
1. The origin of coding problems and understanding of related concepts
Development of 1.1 characters and encoding
From the perspective of computer support for multiple languages, there are roughly three phases:
System internal code Description System
Stage 1 ASCII computers only support English at the beginning, and other languages cannot be stored and displayed on computers. English DoS
Stage 2 ANSI Encoding
(Localization) in order to allow computers to support more languages, 0x80 ~ is usually used ~ 2 bytes in the 0xff range to 1 character. For example, in the Chinese operating system, the byte [0xd6, 0xd0] is used for storage.
Different countries and regions have developed different standards, resulting in respective coding standards such as gb2312, big5, and JIS. These two bytes are used to represent the extended Chinese character encoding methods of a single character. They are called ANSI encoding. In a simplified Chinese system, ANSI encoding represents gb2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding.
Different ANSI encodings are incompatible. When information is exchanged internationally, texts in two languages cannot be stored in the same ANSI encoded text. Chinese dos, Chinese Windows 95/98, Japanese Windows 95/98
Stage 3 Unicode
(International) in order to facilitate international information exchange, the international organization has developed the Unicode Character Set, which sets a uniform and unique numerical number for each character in various languages, to meet the requirements of cross-language and cross-platform text conversion and processing. Windows NT/2000/XP, Linux, Java
How to store strings in memory:
In the ASCII stage, a single-byte string uses one byte to store one character (sbcs ). For example, "bob123" is in the memory:
42 6f 62 31 32 33 00
B o B 1 2 3/0
In the multi-language stage supported by ANSI encoding, each character is represented by one or more bytes (MBCS). Therefore, the characters stored in this mode are also called multi-byte characters. For example, "Chinese 123" is 7 bytes in the memory of Windows 95. Each Chinese Character occupies 2 bytes, and each English character and digit occupies 1 byte:
D6 D0 ce C4 31 32 33 00
Text 1 2 3/0
After Unicode is used, when the computer stores strings, it stores the sequence numbers of each character in the Unicode Character Set. Currently, a computer generally uses two bytes (16 bits) to store a sequence number (DBCS). Therefore, the characters stored in this way are also called wide byte characters. For example, if the string "文123 2000" is in windows, the memory actually stores five serial numbers:
2d 4E 87 65 31 00 32 00 33 00 00 00 swap in x86 CPU, low bytes in front
Text 1 2 3/0
A total of 10 bytes.
1.2 characters, in bytes, string
The key to understanding encoding is to understand the concept of character and byte accurately. These two concepts are easy to confuse. Here we will make a distinction:
Concept ---> description ---> example
Character ---> the mark used by people, a symbol in the abstract sense. ---> '1', 'zhong', 'A', '$', '¥ ',......
Byte ---> the data storage unit in the computer. An 8-bit binary number is a very specific storage space. ---> 0x01, 0x45, 0xfa ,......
ANSI ---> A string is in memory. If the character is in ANSI format, one character may be represented by one or more bytes, we call this string an ANSI string or multi-byte string. ---> "123 Chinese" (7 bytes)
Unicode ---> the string is in memory. If the "character" exists as the serial number in UNICODE, we call it a unicode string or a wide-byte string. ---> L "123 Chinese" (10 bytes)
Because different ANSI encoding standards are different, we must know which encoding rule is used for a given multi-byte string, to know which "characters" it contains ". For a unicode string, the content of the "character" represented by it remains unchanged in any environment.
Character Set and encoding 1.3
Different ANSI coding standards set by countries and regions only specify the "characters" required by their respective languages ". For example, the Chinese Character Standard (gb2312) does not specify how to store Korean characters. The content specified by these ANSI coding standards has two meanings:
Which characters are used. That is to say, which Chinese characters, letters and symbols will be included in the income standard. The set containing "characters" is called "character set ".
Specify whether each "character" is stored in one or multiple bytes, and which bytes are used for storage. This rule is called "encoding ".
When coding standards are set for various countries and regions, both "character sets" and "encoding" are generally set at the same time. Therefore, what we usually call a "Character Set", such as gb2312, GBK, and JIS, besides the meanings of "Character Set" and "encoding.
The Unicode Character Set contains all the characters used in various languages ". There are many types of standards used to encode Unicode character sets, such as: UTF-8, UTF-7, UTF-16, unicodelittle, unicodebig and so on.
1.4 introduction to common coding
This section briefly introduces common encoding rules and prepares for the subsequent sections. Here, we divide all codes into three types based on the characteristics of the encoding rules:
Classification coding standard description
Single byte character encoding ISO-8859-1 simplest encoding rules, each byte directly serves as a Unicode character. For example, when the two bytes [0xd6, 0xd0] are converted to a string through a iso-8859-1, two Unicode characters, namely, [0x00d6, 0x00d0], are obtained ".
Otherwise, the Unicode string through the iso-8859-1 into a byte string, only normal conversion 0 ~ A character in the range of 255.
ANSI code gb2312,
Big5,
Shift_jis,
ISO-8859-2 ...... When Unicode strings are converted to byte strings through ANSI encoding, a Unicode character may be converted into one or more bytes according to their respective encoding rules.
If you convert a byte string to a string, multiple bytes may also be converted into one character. For example, when the two bytes [0xd6, 0xd0] are converted to a string through gb2312, a [0x4e2d] character is obtained, that is, the word "medium.
Features of "ANSI encoding:
1. All these "ANSI encoding standards" can only process Unicode characters in their respective languages.
2. The relationship between the "Unicode Character" and "converted bytes" is defined by humans.
Unicode encoding UTF-8,
UTF-16, unicodebig ...... Similar to "ANSI encoding", when a string is converted to a "Byte string" Through unicode encoding, a Unicode character may be converted to one or more bytes.
What is different from "ANSI encoding" is:
1. These "Unicode codes" can process all Unicode characters.
2. "Unicode Character" and "converted bytes" can be calculated.
In fact, there is no need to go into the specific bytes of each encoding, we only need to know that the concept of "encoding" is to convert "characters" into "bytes. For "unicode encoding", since they can be calculated, we can understand what a "unicode encoding" rule is in special cases.
2. Character and encoding in programs
Characters and bytes in the 2.1 Program
In C ++ and Java, it is used to represent the data types of "characters" and "bytes" and the encoding method:
Type or operation C ++ Java
Character wchar_t char
Byte char byte
ANSI string char [] Byte []
Unicode string wchar_t [] string
Byte string → string mbstowcs (), multibytetowidechar () string = new string (bytes, "encoding ")
String → byte string wcstombs (), widechartomultibyte () bytes = string. getbytes ("encoding ")
Pay attention to the following points:
In Java, char represents a "UNICODE character (wide byte character)", while char in C ++ represents a byte.
Multibytetowidechar () and widechartomultibyte () are Windows API functions.
Related implementation methods in 2.2 C ++
Declare a String constant:
// ANSI string, with a content length of 7 bytes
Char SZ [20] = "Chinese 123 ";
// Unicode string, with a content length of 5 wchar_t (10 bytes)
Wchar_t wsz [20] = l "/x4e2d/x6587/x0031/x0032/x0033 ";
Unicode string I/O operations, character and byte conversion operations:
// Set the current ANSI encoding and VC format during runtime
Setlocale (lc_all, ". 936 ");
// GCC format
Setlocale (lc_all, "zh_cn.gbk ");
// Use lower case % s in Visual C ++ and output the code to the file according to setlocale
// Use uppercase % s in GCC
Fwprintf (FP, l "% s/n", wsz );
// Convert a unicode string to a byte according to the encoding specified by setlocale
Wcstombs (SZ, wsz, 20 );
// Convert the byte string to a unicode string according to the encoding specified by setlocale
Mbstowcs (wsz, SZ, 20 );
In Visual C ++, Unicode string constants have simpler representation methods. If the source code encoding is inconsistent with the current default ANSI code, use # pragma setlocale to tell the compiler the encoding used by the source program:
// If the source code is inconsistent with the current default ANSI code,
// This line is required. It is used to indicate the encoding used by the current source program during compilation.
# Pragma setlocale (". 936 ")
// Unicode String constant with a content length of 10 bytes
Wchar_t wsz [20] = l "Chinese 123 ";
Note that # the roles of Pragma setlocale and setlocale (lc_all, "") are different. # pragma setlocale works during compilation and setlocale () works during runtime.
2.3 related implementation methods in Java
The content of string is a unicode string:
// Java code to directly write Chinese Characters
String string = "Chinese 123 ";
// The length is 5 because it is 5 characters.
System. Out. println (string. Length ());
String I/O operations, character and byte conversion operations. In the Java package Java. io. *, the class ending with "stream" is generally used to operate the "Byte string" class, with "Reader ", the class ending with "Writer" is generally used to operate the "string" class.
// String and byte String Conversion
// Obtain the byte according to gb2312 (obtain the multi-byte string)
Byte [] bytes = string. getbytes ("gb2312 ");
// Obtain the Unicode string from the byte according to gb2312
String = new string (bytes, "gb2312 ");
// There are two methods to write a string into a text file according to certain encoding:
// Method 1: Use the stream class to write a byte string that has been converted according to the specified Encoding
Outputstream OS = new fileoutputstream ("1.txt ");
OS. Write (bytes );
OS. Close ();
// Method 2: Construct the specified encoded writer to write the string.
Writer ow = new outputstreamwriter (New fileoutputstream ("2.txt")," gb2312 ");
Ow. Write (string );
Ow. Close ();
/* The last obtained 1.txt and 2.txt are both 7 bytes */
If the source code of Java is inconsistent with the current default ANSI code, you must specify the source code during compilation. For example:
E:/> javac-encoding big5 hello. Java
The preceding code must distinguish between the source code and the I/O operation code. The former works during compilation and the latter works during runtime.
3. Several misunderstandings and causes of garbled characters and Solutions
3.1 misunderstandings
Misunderstanding of coding
Misunderstanding 1: when converting a "Byte string" into a "Unicode string", for example, when reading a text file or transmitting text over the network, the "Byte string" is easy to use as a single-byte string and is converted using the "one byte" or "one character" method.
In fact, in a non-English environment, the "Byte string" should be used as an ANSI string and appropriate encoding should be used to obtain the Unicode string, it is possible that "multiple bytes" can be used to obtain "one character ".
Generally, programmers who have been developing in English environments are prone to this misunderstanding.
Misunderstanding 2: in non-Unicode environments such as DOS and Windows 98, strings exist in bytes encoded in ANSI format. This byte string must know which encoding to use correctly. This leads us to an inertial thinking: "encoding strings ".
After Unicode is supported, the string in Java is stored as the "Serial Number" of characters, not as a "encoded Byte, therefore, the concept of "string encoding" does not exist. The concept of encoding is only available when "string" and "Byte string" are converted, or when "Byte string" is treated as an ANSI string.
Many people have this misunderstanding.
The first misunderstanding is often the cause of garbled code. The second misunderstanding often leads to more complicated garbled code problems that are easily corrected.
Here, we can see that the "misunderstanding 1", that is, every "one byte" is the "one character" conversion method, in fact it is equivalent to the use of iso-8859-1 for conversion. Therefore, we often use bytes = string. getbytes ("iso-8859-1") for reverse operations to obtain the original "Byte string ". Then use the correct ANSI encoding, such as string = new string (bytes, "gb2312") to obtain the correct "Unicode string ".
3.2 garbled characters during migration of non-Unicode programs in different language Environments
Strings in non-Unicode programs exist in some ANSI encoding format. If the language environment for running the program is different from that for development, the display of ANSI strings fails.
For example, when a non-Unicode Japanese program interface is developed in a Japanese environment and runs in a Chinese environment, garbled characters are displayed on the interface. If the Japanese program interface is changed to Unicode to record strings, the interface will display normal Japanese when running in a Chinese environment.
For objective reasons, sometimes we have to run non-Unicode Japanese software in the Chinese operating system. At this time, we can use some tools, such as Antarctic star and applocale, simulate different language Environments temporarily.
3.3 Web Page Submission string
When the form on the page submits a string, first convert the string into a byte string according to the encoding of the current page. Then, convert each byte into the "% xx" format and submit it to the web server. For example, for a page encoded as gb2312, when "medium" is submitted, the content submitted to the server is "% D6 % D0 ".
On the server side, the web server converts the received "% D6 % D0" into two bytes: [0xd6, 0xd0], and then obtains the "medium" word according to the gb2312 encoding rules.
When request. getparameter () is garbled on the Tomcat server, it is often caused by the aforementioned "misunderstanding 1. By default, when "% D6 % D0" is submitted to the Tomcat server, request. getparameter () Returns [0x00d6, 0x00d0] Two Unicode characters, instead of returning a "medium" character. Therefore, we need to use bytes = string. getbytes ("iso-8859-1") to get the original byte string, and then use string = new string (bytes, "gb2312") to get the correct string "in ".
3.4 read strings from the database
When reading strings from the database server through a database client (such as ODBC or JDBC), the client needs to learn the ANSI encoding used from the server. When the database server sends a byte stream to the client, the client is responsible for converting the byte stream into a unicode string according to the correct encoding.
If garbled characters are obtained when reading strings from the database, and the data stored in the database is correct, it is often caused by the aforementioned "misunderstanding 1. The solution is to use string = new string (string. getbytes ("iso-8859-1"), "gb2312") method, re-obtain the original byte string, and then re-use the correct encoding into a string.
3.5 strings in the email
When a text or HTML segment is transmitted by email, the sent content is first converted to a byte string by a specified character encoding ", then, the "Byte string" is converted to another byte string through a specified content-transfer-encoding ". For example, open the source code of an email and you can see similar content:
Content-Type: text/plain;
Charset = "gb2312"
Content-transfer-encoding: base64
SBG + qcrquqo17cf4yee74bgjz9w7 + b3wudza7dbq0mqnc1_kvpkzxqo6uqo17cnnsapw0ndedqoncg =
The most commonly used content-transfer-encoding includes base64 and quoted-printable. When converting a binary file or a Chinese text, the "Byte string" produced by base64 is shorter than quoted-printable. When converting English text, quoted-printable gets a shorter "Byte string" than base64.
The mail title, in a shorter format to mark "character encoding" and "Transfer Encoding ". For example, if the title content is "medium", it is expressed:
// Correct title format
Subject: =? Gb2312? B? 1ta =? =
Where,
The first "= ?" And "?" The middle part specifies the character encoding, Which is gb2312 in this example.
"?" And "?" The "B" in the middle represents base64. If it is "Q", it indicates quoted-printable.
Last "?" And "? "=" Is the header content after gb2312 is converted into a byte string and base64 is converted.
If "Transfer Encoding" is changed to quoted-printable, similarly, if the title content is "medium ":
// Correct title format
Subject: =? Gb2312? Q? = D6 = D0? =
If garbled characters occur when reading the email, it is generally because the "character encoding" or "Transfer Encoding" is specified incorrectly, or it is not specified. For example, when some mail components send an email, the title "medium ":
// Incorrect title format
Subject: =? ISO-8859-1? Q? = D6 = D0? =
This indicates that the title is [0x00d6, 0x00d0], that is, "öhei", rather than "medium ".
4. Correction of several types of errors
Misunderstanding: "Is ISO-8859-1 an international code ?"
None. Iso-8859-1 is only the simplest of the single-byte character sets, that is, the encoding rules that match the "Byte number" and "Unicode Character number. When we want to convert a "Byte string" into a "string" without knowing which ANSI encoding it is, for the moment, "Every byte" is converted as "one character" without any loss of information. Then use bytes = string. getbytes ("iso-8859-1") to restore to the original byte string.
Misunderstanding: "How does one know the internal code of a string in Java ?"
In Java, the string class java. Lang. String processes Unicode strings rather than ANSI strings. We only need to treat the string as an abstract symbol string. Therefore, the internal code of the string does not exist.
Original article, reprinted Please retain or indicate the source: http://www.regexlab.com/zh/encoding.htm]