ASCII, ANSI, Unicode encoding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

3.1 ASCII encoding

The following are from "Wikipedia":

ASCII (American Standard Code for information Interchange, US Information Interchange standards Codes) is a set of computer coding systems based on the Latin alphabet. It is mainly used to display modern English, and its extended version Eascii can barely display other Western European languages. It is the most versatile single-byte encoding system available today (but with signs of Unicode tracking) and is equivalent to ISO/IEC 646.

ASCII was first published in canonical form in 1967, the last update was in 1986, so far a total of 128 characters were defined, of which 33 characters could not be displayed (this is based on the current operating system, but in DOS mode can show some such as Smiley, Poker, such as 8-bit symbols), and these 33 characters are mostly obsolete control characters. The main purpose of controlling characters is to manipulate the text that has been processed. In addition to 33 characters, there are 95 characters that can be displayed, including white space characters that are created by tapping a blank key with the keyboard, which also counts as 1 display characters (blank).

ASCII table: See Http://zh.wikipedia.org/zh-cn/ASCII

ASCII disadvantage:

The biggest disadvantage of ASCII is that it can only display 26 basic Latin letters, Arabic numbers, and English punctuation marks, so it can only be used to display modern American English (and when dealing with loanwords in English such as nave, café, élite, etc., all the accents have to be removed, Even if this violates the spelling rules). While Eascii has solved some of the problems of display in Western European languages, there is still nothing to do with more languages. So now Apple computers have abandoned ASCII and switched to Unicode.

The first in-system code for the English DOS operating system is: ASCII. The computer only supports English at this time, and other languages cannot be stored and displayed on the computer.

At this stage, a single-byte string holds one character (Sbcs,single byte Character System) using one byte. For example, "Bob123" accounts for 6 bytes.

3.2 ANSI Encoding

To enable the computer to support more languages, you typically use the 0x800~xff range of 2 bytes to represent 1 characters. For example: Chinese characters ' in ' in the Chinese operating system, using [0xd6,0xd0] These two bytes of storage.

Different countries and regions have set different standards, resulting in the Gb2312,big5,jis and other coding standards. These use 2 bytes to represent a character of a variety of Chinese character extension encoding, called ANSI encoding. Under the Simplified Chinese system, ANSI encoding represents GB2312 encoding, and in Japanese operating system, ANSI encoding represents JIS code.

Different ANSI encodings are incompatible, and when information is exchanged internationally, text that is in two languages cannot be stored in the same piece of ANSI-encoded text.

Chinese dos, Chinese/Japanese Windows 95/98 ERA system internal code using ANSI encoding (localization)

When using ANSI encoding to support multi-lingual stages, each character is represented by one byte or more bytes (mbcs,multi-byte Character System), so the characters stored in this way are also called multibyte characters. For example, "Chinese 123" in Chinese Windows 95 memory is 7 bytes, each Kanji Account 2 bytes, each English and numeric characters accounted for 1 bytes.

In non-Unicode environments, there is a good chance that all characters will not be displayed properly due to inconsistent character sets adopted by different countries and regions. Microsoft has used the technology of the code page (Codepage) conversion table to solve this problem by converting non-Unicode character encodings into Unicode encodings used internally by the same character by the specified conversion table. You can select a code page in language and locale as the default encoding for non-Unicode encodings, such as 936 for the Simplified Chinese gbk,950 to Traditional Chinese Big5 (referred to as used on a PC). In this case, the software and documentation written in some non-English European languages are likely to be garbled. The problem occurs when you set the code page to the appropriate language for the Chinese processing, which is unavoidable. Fundamentally, the full adoption of unified coding is the solution, but it is not yet possible to do so.

Code page technology is now widely used by various platforms. The code page for UTF-7 is the code page for 65000,utf-8 is 65001.

3.3 Unicode encoding

In order to facilitate international exchange of information, international organizations have developed a UNICODE character set that sets a uniform and unique numeric number for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing.

The Unicode character set can be abbreviated to UCS (Unicode Character set). The early Unicodeunicode standard has UCS-2, UCS-4 's saying. The UCS-2 is encoded in two bytes, and the UCS-4 is encoded in 4 bytes.

After UNICODE is adopted, when the computer holds the string, it instead holds the ordinal number of each character in the UNICODE character set. Currently the computer typically uses 2 bytes (16 bits) to hold a sequence number (dbcs,double byte Character System), so the characters stored in this way are also referred to as wide-byte characters. For example, the string "Chinese 123" under Windows 2000, the memory is actually stored in 5 sequence numbers, total 10 bytes.

The Unicode character set contains all the "characters" that are used in various languages. There are many criteria for encoding UNICODE character sets, such as: UTF-8, UTF-7, UTF-16, Unicodelittle, Unicodebig, and so on.

4. Common coding Rules

4.1 Single-byte character encoding

(1) Coding standard: Iso-8859-1.

(2) Description: The simplest encoding rules, each byte directly as a UNICODE character. For example, [0xd6, 0xD0] Two bytes, when converted to a string by iso-8859-1, will be directly [0x00d6, 0x00d0] Two UNICODE characters, that is, "".

Conversely, when converting a UNICODE string through iso-8859-1 to a byte string, only the characters of the 0~255 range are converted normally.

4.2 ANSI Encoding

(1) GB2312, BIG5, Shift_JIS, Iso-8859-2.

(2) When converting Unicode strings to "byte strings" through ANSI encoding, a Unicode character may be converted to one byte or more bytes, according to the respective encoding.

Conversely, when a byte string is converted to a string, it is possible to convert multiple bytes into one character. For example, [0xd6, 0xD0] These two bytes, through GB2312 conversion to a string, will get [0x4e2d] a character, the word ' medium '.

Features of "ANSI encoding":

(1) These "ANSI coding standards" can only handle UNICODE characters within their respective language ranges.

(2) The relationship between "UNICODE characters" and "converted bytes" is artificially defined.

4.3 Unicode encoding

(1) Coding standard: UTF-8, UTF-16, Unicodebig.

(2) similar to "ANSI encoding", when a string is converted to a "byte string" by Unicode encoding, a Unicode character may be converted to one byte or more bytes.

Unlike ANSI encoding:

(1) These "Unicode encodings" are capable of processing all Unicode characters.

(2) between "UNICODE characters" and "converted bytes" can be computed.

We don't really need to delve into the exact number of bytes encoded in each encoding, we just need to know that the concept of "coding" is to convert "character" to "byte". For "Unicode encoding", because they can be computed, so in a special situation, we can understand a certain kind of "UNICODE encoding" is the rule.

5, the difference between the code

5.1 GB2312, GBK and GB18030

(1) GB2312

When the Chinese people get the computer, there is no available byte state to represent Chinese characters, and there are more than 6,000 commonly used Chinese characters need to be preserved, and then think of those ASCII code in the 127th after the singular symbols are directly canceled, the rule: a character less than 127 is the same meaning as the original, However, when two characters greater than 127 connect prompt together, it represents a Chinese character, and a byte in front (called a high byte) is used from 0xa1 to 0xf7, followed by a byte (low byte) from 0xa1 to 0xFE, so that we can assemble about 7,000 more simplified Chinese characters. In these codes, we also put mathematical symbols, Roman Greek alphabet, Japanese kana have been compiled into, even in ASCII, the number, punctuation, letters are all re-compiled two bytes long code, this is often said "full-width" character, and the original under 127th is called "Half-width" character. This scheme of Chinese characters is called "GB2312". GB2312 is a Chinese extension to ASCII. Compatible with ASCII.

(2) GBK

But Chinese characters are too many, we soon found that there are many people's names there is no way to play here, have to continue to GB2312 not used to find the code point to use. Later still not enough, so simply no longer require that the low byte must be 127th after the inner code, as long as the first byte is greater than 127 fixed indicates that this is the beginning of a Chinese character, whether followed by the expansion of the character set in the content. The result of the expanded encoding scheme is called the "GBK" standard, and GBK includes all the contents of the GB2312, while adding nearly 20,000 new Chinese characters (including traditional characters) and symbols.

(3) GB18030

Later, the minority also to use the computer, so we expanded, and added thousands of new minority characters, GBK expanded into a GB18030. Since then, the Chinese nation's culture can be passed on in the computer age.

Chinese handlers see this series of Chinese character coding standards as good, so they are called "DBCS" (Double byte Charecter set DWORD character set). In the DBCS series of standards, the biggest feature is the two-byte long Chinese characters and one-byte long English characters coexist in the same set of coding scheme, so they write the program in order to support the Chinese processing, must pay attention to the string of each byte value, if this value is greater than 127, Then it is assumed that a character in a double-byte character set appears. In this case, "a Chinese character counts two English characters!" "。 However, this is not always the case in the Unicode environment.

5.1 Unicode and Bigendianunicode

These two indicate different storage order, such as the Unicode encoding for "A" is 6500, and the Bigendianunicode encoding is 0065.

5.2 UTF-7, UTF-8 and UTF-16

In Unicode, all characters are treated equally. Chinese characters no longer use "two extended ASCII", but instead use "1 Unicode", note that now the Chinese character is a "one character", so, chaizi, statistical words of these problems will naturally solve.

However, the world is not ideal, it is not possible overnight all systems use Unicode to process characters, so Unicode must consider a serious problem on the date of birth: incompatibility with the ASCII character set.

We know that the ASCII character is a single byte, such as the ASCII of "A" is 65. Unicode is a double-byte, such as the Unicode of "A" is 0065, which creates a very big problem: the same mechanism that previously handled ASCII cannot be used to handle Unicode.

Another more serious problem is that the C language uses ' + ' as the end of the string, and Unicode has a lot of characters that have a byte of 0, so that the C-string function will not handle Unicode properly. Unless all of the world's programs written in C and the libraries they use are all replaced.

Thus, the greater things than Unicode were born, and the reason that it is greater is because it allows Unicode to no longer exist on paper, but is true in all of our computers. That's it: UTF.

utf= UCS Transformation Format, UCS conversion (transfer).

It is a rule that corresponds to the actual encoding of the Unicode encoding rules and the computer. There are 2 types of UTF that are popular now: UTF-8 and UTF-16.

Both of these are Unicode-encoded implementations.

5.2.1 UTF-8

UCS-2 encoding (16 binary) UTF-8 byte stream (binary)

0000-007f 0xxxxxxx

0080-07FF 110xxxxx 10xxxxxx

0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding of the word "Han" is 6c49. 6c49 is between 0800-ffff, so I'm sure to use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. The 6c49 is written as binary: 0110 110001 001001, using this bitstream in turn instead of the template x, get: 11100110 10110001 10001001, that is, E6 B1 89.

Visible UTF-8 are variable-length characters that encode Unicode as 00000000-0000007f, are represented by a single byte, 00000080-000007ff characters are represented by two bytes, and 00000800-0000ffff characters are represented by 3 bytes. Since the UNICODE-16 specification does not specify more than FFFF characters so far, UTF-8 uses up to 3 bytes to represent one character. In theory, however, UTF-8 requires a maximum of 6 bytes to represent a character.

UTF-8 is compatible with ASCII.

5.2.2 UTF-16 (standard Unicode becomes UTF-16)

The encoding specification for the UTF-16 and Unicode itself is consistent.

The UTF-16 encodes the UCS as a 16-bit unit. For UCS codes smaller than 0x10000, the UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. For UCS codes that are not less than 0x10000, an algorithm is defined. However, because the actual use of UCS2, or UCS4 bmp is necessarily less than 0x10000, so for the time being, UTF-16 and UCS-2 can be considered basically the same. But UCS-2 is just a coding scheme, UTF-16 is used for actual transmission, so we have to consider the problem of byte order.

UTF-16 is not compatible with ASCII.

5.2.3 UTF-7

The UTF-7 (7-bit Unicode conversion format (Unicode Transformation Format, abbreviated UTF)) is a variable-length character encoding that renders Unicode characters in an ASCII-encoded string that can be applied to an e-mail message Applications such as loss.

UTF-7 is not one of the Unicode standards. For more information, you can refer to the relevant materials.

6. Unicode and UTF

Unicode is a memory-encoded representation scheme (a specification), and UTF is a scheme for how to save and transmit Unicode (is an implementation).

6.1 UTF byte-order and BOM

6.1.1-byte sequence

UTF-8 is a byte-coded unit with no byte order problem. UTF-16 takes two bytes as the encoding unit, before interpreting a UTF-16 text, it is first to clarify the byte order of each coding unit. For example, the Unicode encoding for receiving a "Kui" is 594E, and the Unicode encoding for "B" is 4E59. If we receive the UTF-16 byte stream "594E", then is this "Kui" or "B"?

The recommended method for labeling byte order in the Unicode specification is the BOM. The BOM is not a BOM for "Bill of Material", but a byte Order Mark. The BOM is a bit of a smart idea:

There is a character called "ZERO WIDTH no-break SPACE" in the UCS encoding, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted.

This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM.

The UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate the encoding using a BOM. The UTF-8 code for the character "ZERO WIDTH no-break SPACE" is the EF BB BF (readers can verify it with the coding method we described earlier). So if the receiver receives a byte stream beginning with the EF BB BF, it knows that this is UTF-8 encoded.

6.1.2 BOM

(1) The origin of the BOM

To identify Unicode files, Microsoft recommends that all Unicode files begin with a ZERO WIDTH nobreak SPACE (U+feff) character. This acts as a "signature" or "byte order mark (Byte-order Mark,bom)" To identify the encoding and byte order used in the file.

(2) different system support for BOM

Because some systems or programs do not support BOMs, Unicode files with BOMs can sometimes cause problems.

①jdk1.5 and previous reader cannot process UTF-8 encoded files with a BOM, and when parsing an XML file of this format, an exception is thrown: Content is not allowed in Prolog. "For the workaround, I'll write an article to discuss the issue." ”

②linux/unix does not use the BOM because it destroys the syntax conventions of existing ASCII files.

③ different editing tools to deal with the BOM. When you save a file as UTF-8 encoding using the Windows-brought Notepad, Notepad automatically inserts the BOM at the beginning of the file (although the BOM is not required for UTF-8). Many other editors can choose to use the BOM instead. This is true of UTF-8 and UTF-16.

(3) BOM and XML

When XML parsing reads an XML document, the Consortium defines 3 rules:

① If there is a BOM in the document, the file encoding is defined;

② If there is no BOM in the document, view the encoding attribute in the XML declaration;

③ If neither of the above is true, the XML document is assumed to be UTF-8 encoded.

6.2 Determining the character set and encoding of text

Software typically has three ways to determine the character set and encoding of text.

(1) The most standard way for Unicode text is to detect the first few bytes of text. Such as:

Opening byte charset/encoding

EF BB BF UTF-8

FE FF Utf-16/ucs-2, little Endian (Utf-16le)

FF FE utf-16/ucs-2, big endian (UTF-16BE)

FF FE xx utf-32/ucs-4, little endian.

FE FF utf-32/ucs-4, Big-endia

(2) Adopt a more secure way to determine the character set and its encoding, that is, pop up a dialog box to consult the user.

However, the MBCS text (ANSI) does not have these character set tags at the beginning, and now many software save text is Unicode, you can choose whether to save these character set marks at the beginning. Therefore, software should not rely on this approach. At this point, the software can take a more secure way to determine the character set and its encoding, that is, pop up a dialog box to consult the user.

(3) Take your Own "guess" method.

If the software does not want to bother the user, or it is inconvenient to ask the user, it can only take its own "guess" method, the software can be based on the characteristics of the whole text to guess which charset it may belong to, which is probably not allowed. This is the case when you use Notepad to open the "Unicom" file. (the original ANSI-encoded file as UTF-8 processing, detailed instructions see: http://blog.csdn.net/omohe/archive/2007/05/29/1630186.aspx)

6.3 Several codes for Notepad

(1) ANSI code

Notepad saves the encoding format by default: ANSI, which is the default internal code of the local operating system, Simplified Chinese is generally GB2312. How does this prove? When you save with Notepad, you open it with a text editor such as EmEditor, EditPlus, and UltraEdit. It is recommended to use EmEditor, after opening, in the lower corner will show the encoding: GB2312.

(2) Unicode encoding

Save with Notepad, encode select "Unicode", open the file with EmEditor, and find that the encoding format is: Utf-16le+bom (signed). Viewed in hexadecimal, the first two bytes were found to be: FF FE. This is the BOM.

(3) Unicode big endian

Save with Notepad, encode select "Unicode", open the file with EmEditor, and find that the encoding format is: Utf-16be+bom (signed). Viewed in hexadecimal, the first two bytes were found to be: FE FF. This is the BOM.

(4) UTF-8

Save with Notepad, encoding select "UTF-8", open the file with EmEditor, found that the encoding format is: UTF-8 (signed). Viewed in hexadecimal, the first three bytes were found to be: EF BB BF. This is the BOM.

7, several misunderstandings, as well as the causes of garbled and solutions

7.1 Misunderstanding One

When the "byte string" is converted to "UNICODE string", such as when reading a text file, or transmission of text over the network, it is easy to use "byte string" as a single-byte string, with each "one byte" is a "one character" method for conversion.

In fact, in a non-English environment, the "byte string" should be used as an ANSI string, with the appropriate encoding to get the UNICODE string, it is possible "multiple bytes" to get "one character."

Usually, developers who have been developing in the English environment are prone to this misunderstanding.

7.2 Misunderstanding Two

In non-UNICODE environments such as Dos,windows 98, strings exist in the form of ANSI-encoded bytes. This string, which exists as a byte, must know which encoding is used correctly. This gives us a sense of inertia: "The encoding of a string."

When UNICODE is supported, the string in Java is stored as the "ordinal" of a character, not as "a coded byte", so there is no longer a concept of "encoding the string." The concept of encoding is only available when the string and byte strings are converted, or when a "byte string" is treated as an ANSI string.

A lot of people have this misunderstanding.

7.3 Analysis and resolution

The first misunderstanding, is often caused by garbled causes. The second misconception often leads to more complex garbled problems that could easily be corrected.

Here, we can see that the "misunderstanding one", that is, the use of each "one byte" is "one character" conversion method, in fact, it is equivalent to the use of iso-8859-1 conversion. Therefore, we often use bytes = String.getbytes ("iso-8859-1") to reverse the operation and get the original "byte string". Then use the correct ANSI encoding, such as String = new string (bytes, "GB2312"), to get the correct "UNICODE string".

8. Reference and in-depth reading study materials

8.1 characters, Bytes and encodings http://www.regexlab.com/zh/encoding.htm (strongly recommended)

8.2 "About encoding: ASCII (ANSI), gb-2312, Unicode, UTF8" http://blog.csdn.net/omohe/archive/2007/05/29/1630186.aspx

8.3 "The difference of ansi,utf8,unicode,ascii coding" http://hi.baidu.com/%D6%F0%C4%BE/blog/item/772c5944d5e77e8bb3b7dcab.html

8.4 Baidu Encyclopedia "Unicode" http://baike.baidu.com/view/40801.htm

8.5 What is the connection or difference between Unicode and utf-8/utf-16? "Http://zhidao.baidu.com/question/52532619.html?fr=ala0

ASCII, ANSI, Unicode encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More