Characters, bytes, and encoding

Source: Internet
Author: User
Tags coding standards
[To] characters, bytes, and encoding

From: http://www.regexlab.com/zh/encoding.htm

Level: Intermediate

Abstract: This article describes the development process of character and encoding and the correct understanding of related concepts. Examples illustrate how to implement encoding in some practical applications. Next, this article describes several misunderstandings about character and encoding, the causes of garbled characters, and the methods to eliminate garbled characters. This article covers Chinese and garbled questions ".

The key to understanding coding is to correctly understand related concepts. The technology involved in coding is actually very simple. Therefore, when reading this article, you need to read more slowly and think more.

Introduction

"Character and encoding" is a frequently discussed topic. Even so, garbled characters often plague everyone. Although we have many ways to eliminate garbled characters, we do not necessarily understand the internal principles of these methods. Some causes of garbled code are actually caused by problems with the underlying code itself. Therefore, not only do beginners feel fuzzy about character encoding, but some underlying developers also lack an accurate understanding of character encoding.

 

1. The origin of the encoding problem, understanding of related concepts, 1.1 characters and Development of Encoding

From the perspective of computer support for multiple languages, there are roughly three phases:

  System internal code Description System
Phase 1 ASCII At the beginning, the computer only supports English, and other languages cannot be stored and displayed on the computer. English DoS
Phase 2 ANSI Encoding
(Localization)
To enable the computer to support more languages, we usually use 0x80 ~ 2 bytes in the 0xff range to 1 character. For example, in the Chinese operating system, the byte [0xd6, 0xd0] is used for storage.

Different countries and regions have developed different standards, resulting in respective coding standards such as gb2312, big5, and JIS. These two bytes are used to represent the extended encoding of each character.ANSI Encoding. In a simplified Chinese system, ANSI encoding represents gb2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding.

Different ANSI codes are incompatible with each other. When information is exchanged internationally, texts in the two languages cannot be stored in the same segment.ANSI Encoding.

Chinese dos, Chinese Windows 95/98, Japanese Windows 95/98
Phase 3 Unicode
(International)
To facilitate international information exchanges, international organizations have developedUnicode Character SetSet a uniform and unique number for each character in a variety of languages to meet the requirements of cross-language and cross-platform text conversion and processing. Windows NT/2000/XP, Linux, Java

How to store strings in memory:

In the ASCII stage,Single-byte stringUse one byte to store one character (sbcs ). For example, "bob123" is in the memory:

42 6f 62 31 32 33 00
B O B 1 2 3 /0

In the multi-language stage supported by ANSI encoding, each character is represented by one or more bytes (MBCS). Therefore, the characters stored in this mode are also calledMulti-byte characters. For example, "Chinese 123" is 7 bytes in the memory of Windows 95. Each Chinese Character occupies 2 bytes, and each English character and digit occupies 1 byte:

D6 D0 CE C4 31 32 33 00
Medium Text 1 2 3 /0

After Unicode is used, when the computer stores strings, it stores the sequence numbers of each character in the Unicode Character Set. Currently, a computer generally uses two bytes (16 bits) to store a sequence number (DBCS). Therefore, the characters stored in this way are also calledByte characters. For example, if the string "文123 2000" is in windows, the memory actually stores five serial numbers:

2d 4e 87 65 31 00 32 00 33 00 00 00 The slave is in the x86 CPU, and the low byte is in front
Medium Text 1 2 3 /0  

A total of 10 bytes.

 

1.2 characters, in bytes, string

The key to understanding encoding is to understand the concept of character and byte accurately. These two concepts are easy to confuse. Here we will make a distinction:

  Concept description Example
Character Mark used by people, a symbol in the abstract sense. '1', 'zhong', 'A', '$', '¥ ',......
Bytes The Unit for storing data in a computer. An 8-bit binary number is a very specific storage space. 0x01, 0x45, 0xfa ,......
ANSI
String
In memory, if the "character" isANSI EncodingA character may be expressed in one or more bytes, so we call this stringANSI stringOrMulti-byte string. "Chinese 123"
(7 bytes)
Unicode
String
In the memory, if the "character" exists as the serial number in UNICODE, we call this stringUnicode stringOrWide byte string. L "Chinese 123"
(10 bytes)

Because different ANSI encoding standards are different, for a givenMulti-byte stringWe must know which encoding rules it uses to know which "characters" it contains ". ForUnicode stringIn any environment, the content of the "character" it represents is always the same.

 

Character Set and encoding 1.3

Different ANSI coding standards set by countries and regions only specify the "characters" required by their respective languages ". For example, the Chinese Character Standard (gb2312) does not specify how to store Korean characters. The content specified by these ANSI coding standards has two meanings:

  1. Which characters are used. That is to say, which Chinese characters, letters and symbols will be included in the income standard. The set containing "characters" is called"Character Set".
  2. It specifies whether each "character" is stored in one or multiple bytes, and which bytes are used for storage. This rule is called"Encoding".

When coding standards are set for various countries and regions, both "character sets" and "encoding" are generally set at the same time. Therefore, what we usually call a "Character Set", such as gb2312, GBK, and JIS, besides the meanings of "Character Set" and "encoding.

"Unicode Character Set"Contains all the" characters "used in various languages ". There are many types of standards used to encode Unicode character sets, such as: UTF-8, UTF-7, UTF-16, unicodelittle, unicodebig and so on.

 

1.4 introduction to common coding

This section briefly introduces common encoding rules and prepares for the subsequent sections. Here, we divide all codes into three types based on the characteristics of the encoding rules:

Category Encoding standard Description
Single-byte character encoding ISO-8859-1 The simplest encoding rule. Each byte is a Unicode character. For example, when the two bytes [0xd6, 0xd0] are converted to a string through a iso-8859-1, two Unicode characters, namely, [0x00d6, 0x00d0], are obtained ".

Otherwise, the Unicode string through the iso-8859-1 into a byte string, only normal conversion 0 ~ A character in the range of 255.

ANSI Encoding Gb2312,
Big5,
Shift_jis,
ISO-8859-2 ......
When Unicode strings are converted to byte strings through ANSI encoding, a Unicode character may be converted into one or more bytes according to their respective encoding rules.

If you convert a byte string to a string, multiple bytes may also be converted into one character. For example, when the two bytes [0xd6, 0xd0] are converted to a string through gb2312, a [0x4e2d] character is obtained, that is, the word "medium.

Features of "ANSI encoding:
1. All these "ANSI encoding standards" can only process Unicode characters in their respective languages.
2. The relationship between the "Unicode Character" and "converted bytes" is defined by humans.

Unicode encoding UTF-8,
UTF-16, unicodebig ......
Similar to "ANSI encoding", when a string is converted to a "Byte string" Through unicode encoding, a Unicode character may be converted to one or more bytes.

What is different from "ANSI encoding" is:
1. These "Unicode codes" can process all Unicode characters.
2. "Unicode Character" and "converted bytes" can be calculated.

In fact, there is no need to go into the specific bytes of each encoding, we only need to know that the concept of "encoding" is to convert "characters" into "bytes. For "unicode encoding", since they can be calculated, we can understand what a "unicode encoding" rule is in special cases.

 

2. Character and encoding in the Program Implementation 2.1 characters and bytes in the program

In C ++ and Java, it is used to represent the data types of "characters" and "bytes" and the encoding method:

Type or operation C ++ Java
Character Wchar_t Char
Bytes Char Byte
ANSI string Char [] Byte []
Unicode string Wchar_t [] String
Byte string → string Mbstowcs (), multibytetowidechar () String = new string (bytes, "encoding ")
String → byte string Wcstombs (), widechartomultibyte () Bytes = string. getbytes ("encoding ")

Pay attention to the following points:

  1. In Java, char represents a "UNICODE character (wide byte character)", while char in C ++ represents a byte.
  2. Multibytetowidechar () and widechartomultibyte () are Windows API functions.

 

Related implementation methods in 2.2 C ++

Declare a String constant:

// ANSI string, with a content length of 7 bytes
Char SZ [20] = "Chinese 123 ";

// Unicode string, with a content length of 5 wchar_t (10 bytes)
Wchar_t wsz [20] = l "/x4e2d/x6587/x0031/x0032/x0033 ";

Unicode string I/O operations, character and byte conversion operations:

// Set the current ANSI encoding and VC format during runtime
Setlocale (lc_all, ". 936 ");

// GCC format
Setlocale (lc_all, "zh_cn.gbk ");

// Use lower case % s in Visual C ++ and output the code to the file according to setlocale
// Use uppercase % s in GCC
Fwprintf (FP, l "% s/n", wsz );

// Convert a unicode string to a byte according to the encoding specified by setlocale
Wcstombs (SZ, wsz, 20 );
// Convert the byte string to a unicode string according to the encoding specified by setlocale
Mbstowcs (wsz, SZ, 20 );

In Visual C ++, Unicode string constants have simpler representation methods. If the source code encoding is inconsistent with the current default ANSI code, use # pragma setlocale to tell the compiler the encoding used by the source program:

// If the source code is inconsistent with the current default ANSI code,
// This line is required. It is used to indicate the encoding used by the current source program during compilation.
# Pragma setlocale (". 936 ")

// Unicode String constant with a content length of 10 bytes
Wchar_t wsz [20] = l "Chinese 123 ";

Note that # the roles of Pragma setlocale and setlocale (lc_all, "") are different. # pragma setlocale works during compilation and setlocale () works during runtime.

 

2.3 related implementation methods in Java

The content of string is a unicode string:

// Java code to directly write Chinese Characters
String string = "Chinese 123 ";

// The length is 5 because it is 5 characters.
System. Out. println (string. Length ());

String I/O operations, character and byte conversion operations. In the Java package Java. io. *, the class ending with "stream" is generally used to operate the "Byte string" class, with "Reader ", the class ending with "Writer" is generally used to operate the "string" class.

// String and byte String Conversion

// Obtain the byte according to gb2312 (obtain the multi-byte string)
Byte [] bytes = string. getbytes ("gb2312 ");

// Obtain the Unicode string from the byte according to gb2312
String = new string (bytes, "gb2312 ");

// There are two methods to write a string into a text file according to certain encoding:

// Method 1: Use the stream class to write a byte string that has been converted according to the specified Encoding
Outputstream OS = new fileoutputstream ("1.txt ");
OS. Write (bytes );
OS. Close ();

// Method 2: Construct the specified encoded writer to write the string.
Writer ow = new outputstreamwriter (New fileoutputstream ("2.txt")," gb2312 ");
Ow. Write (string );
Ow. Close ();

/* The last obtained 1.txt and 2.txt are both 7 bytes */

If the source code of Java is inconsistent with the current default ANSI code, you must specify the source code during compilation. For example:

E:/> javac-encoding big5 hello. Java

The preceding code must distinguish between the source code and the I/O operation code. The former works during compilation and the latter works during runtime.

 

3. Several misunderstandings, as well as the causes of Garbled text and solutions 3.1 misunderstandings

  Misunderstanding of coding
Misunderstanding 1 When converting a "Byte string" into a "Unicode string", for example, when reading a text file or transmitting text over the network, it is easy to simply use the "Byte string"Single-byte stringEach "one byte" is a character.

In fact, in a non-English environment, the "Byte string" should be used as an ANSI string and appropriate encoding should be used to obtain the Unicode string, it is possible that "multiple bytes" can be used to obtain "one character ".

Generally, programmers who have been developing in English environments are prone to this misunderstanding.

Misunderstanding 2 In non-Unicode environments such as DOS and Windows 98, strings exist in bytes encoded in ANSI format. This byte string must know which encoding to use correctly. This leads us to an inertial thinking: "encoding strings ".

After Unicode is supported, the string in Java is stored as the "Serial Number" of characters, not as a "encoded Byte, therefore, the concept of "string encoding" does not exist. The concept of encoding is only available when "string" and "Byte string" are converted, or when "Byte string" is treated as an ANSI string.

Many people have this misunderstanding.

The first misunderstanding is often the cause of garbled code. The second misunderstanding often leads to more complicated garbled code problems that are easily corrected.

Here, we can see that the "misunderstanding 1", that is, every "one byte" is the "one character" conversion method, in fact it is equivalent to the use of iso-8859-1 for conversion. Therefore, we often use bytes = string. getbytes ("iso-8859-1") for reverse operations to obtain the original "Byte string ". Then use the correct ANSI encoding, such as string = new string (bytes, "gb2312") to obtain the correct "Unicode string ".

 

3.2 garbled characters during migration of non-Unicode programs in different language Environments

Strings in non-Unicode programs exist in some ANSI encoding format. If the language environment for running the program is different from that for development, the display of ANSI strings fails.

For example, when a non-Unicode Japanese program interface is developed in a Japanese environment and runs in a Chinese environment, garbled characters are displayed on the interface. If the Japanese program interface is changed to Unicode to record strings, the interface will display normal Japanese when running in a Chinese environment.

For objective reasons, sometimes we have to run non-Unicode Japanese software in the Chinese operating system. At this time, we can use some tools, such as Antarctic star and applocale, simulate different language Environments temporarily.

 

3.3 Web Page Submission string

When the form on the page submits a string, first convert the string into a byte string according to the encoding of the current page. Then, convert each byte into the "% xx" format and submit it to the web server. For example, for a page encoded as gb2312, when "medium" is submitted, the content submitted to the server is "% D6 % D0 ".

On the server side, the web server converts the received "% D6 % D0" into two bytes: [0xd6, 0xd0], and then obtains the "medium" word according to the gb2312 encoding rules.

When request. getparameter () is garbled on the Tomcat server, it is often caused by the aforementioned "misunderstanding 1. By default, when "% D6 % D0" is submitted to the Tomcat server, request. getparameter () Returns [0x00d6, 0x00d0] Two Unicode characters, instead of returning a "medium" character. Therefore, we need to use bytes = string. getbytes ("iso-8859-1") to get the original byte string, and then use string = new string (bytes, "gb2312") to get the correct string "in ".

 

3.4 read strings from the database

When reading strings from the database server through a database client (such as ODBC or JDBC), the client needs to learn the ANSI encoding used from the server. When the database server sends a byte stream to the client, the client is responsible for converting the byte stream into a unicode string according to the correct encoding.

If garbled characters are obtained when reading strings from the database, and the data stored in the database is correct, it is often caused by the aforementioned "misunderstanding 1. The solution is to use string = new string (string. getbytes ("iso-8859-1"), "gb2312") method, re-obtain the original byte string, and then re-use the correct encoding into a string.

 

3.5 strings in the email

When a text or HTML segment is sent by emailCharacter encodingConverts the string to a byte string, and then uses a specifiedTransfer Encoding(Content-transfer-encoding ". For example, open the source code of an email and you can see similar content:

Content-Type: text/plain;
Charset = "gb2312"
Content-transfer-encoding: base64

SBG + qcrquqo17cf4yee74bgjz9w7 + b3wudza7dbq0mqnc1_kvpkzxqo6uqo17cnnsapw0ndedqoncg =

The most commonly used content-transfer-encoding includes base64 and quoted-printable. When converting a binary file or a Chinese text, the "Byte string" produced by base64 is shorter than quoted-printable. When converting English text, quoted-printable gets a shorter "Byte string" than base64.

The mail title, in a shorter format to mark "character encoding" and "Transfer Encoding ". For example, if the title content is "medium", it is expressed:

// Correct title format
Subject: =? Gb2312? B? 1ta =? =

Where,

  • The first "= ?" And "?" The middle part specifies the character encoding, Which is gb2312 in this example.
  • "?" And "?" The "B" in the middle represents base64. If it is "Q", it indicates quoted-printable.
  • Last "?" And "? "=" Is the header content after gb2312 is converted into a byte string and base64 is converted.

If "Transfer Encoding" is changed to quoted-printable, similarly, if the title content is "medium ":

// Correct title format
Subject: =? Gb2312? Q? = D6 = D0? =

If garbled characters occur when reading the email, it is generally because the "character encoding" or "Transfer Encoding" is specified incorrectly, or it is not specified. For example, when some mail components send an email, the title "medium ":

// Incorrect title format
Subject: =? ISO-8859-1? Q? = D6 = D0? =

This indicates that the title is [0x00d6, 0x00d0], that is, "öhei", rather than "medium ".

 

4. Correct misunderstanding of Several misunderstandings: "Is ISO-8859-1 an international code ?"

None. Iso-8859-1 is only the simplest of the single-byte character sets, that is, the encoding rules that match the "Byte number" and "Unicode Character number. When we want to convert a "Byte string" into a "string" without knowing which ANSI encoding it is, for the moment, "Every byte" is converted as "one character" without any loss of information. Then use bytes = string. getbytes ("iso-8859-1") to restore to the original byte string.

Misunderstanding: "How does one know the internal code of a string in Java ?"

In Java, the string class java. Lang. String processes Unicode strings rather than ANSI strings. We only need to treat the string as an abstract symbol string. Therefore, the internal code of the string does not exist.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.