Programmers interesting reading about Unicode coding _ related Tips

Source: Internet
Author: User
Tags rfc
Question one:

Use the Save As for Windows Notepad to convert between GBK, Unicode, Unicode big endian, and UTF-8 encoding methods. Also is TXT file, how does Windows recognize the encoding way?

I found out earlier that Unicode, Unicode big endian and UTF-8 encoded TXT files are preceded by a few more bytes, namely FF, Fe (Unicode), Fe, FF (Unicode big endian), EF, BB, BF (UTF-8). But what criteria are these tags based on?

  question two:

Recently saw a CONVERTUTF.C on the net, realized the UTF-32, UTF-16 and UTF-8 these three kinds of coding way conversion mutually. For Unicode (UCS2), GBK, UTF-8 these encodings, I used to know. But this program makes me a little confused, can not think of UTF-16 and UCS2 have any relationship.

Check the relevant information, and finally make the problem clear, incidentally also understand some of the details of Unicode. Write an article for a friend who has had similar questions. This article tries to be easy to understand when writing, but asks the reader to know what is byte and what is hexadecimal.

0, Big endian and little endian

The big endian and little endian are different ways the CPU handles multibyte numbers. For example, the Unicode encoding of the word "Han" is 6c49. So when you write to a file, do you write 6C in front of it or write 49 in front? If you write 6C in front, it's big endian. Or the 49 written in front, is little endian.

The word "endian" is derived from Gulliver's Travels. Lilliput's civil war stems from eating eggs from the big Head (Big-endian) knock Open or from the beginning (Little-endian) knock Open, which has occurred six times, one of the Emperor gave life, the other lost the throne.

We generally translate endian into "byte order", the big endian and little endian called "large tail" and "small tail".

1, character code, inside code, incidentally introduced encoding

Characters must be encoded before they can be processed by the computer. The default encoding used by your computer is the internal code of your computer. Early computers used 7-bit ASCII encoding, and in order to handle Chinese characters, programmers designed GB2312 for Simplified Chinese and big5 for traditional Chinese.

GB2312 (1980) contains a total of 7,445 characters, including 6,763 characters and 682 other symbols. The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, and the code position occupied is 72*94=6768. 5 of these vacancies are d7fa-d7fe.

GB2312 supports too few Chinese characters. In the 1995, the Chinese character extension standard GBK1.0 21,886 symbols, which are divided into Chinese character area and graphic symbol area. The Chinese character area consists of 21,003 characters. The 2000 GB18030 is the official national standard to replace GBK1.0. The standard contains 27,484 Chinese characters, and also includes Tibetan, Mongolian, Uighur and other major minority languages. Now the PC platform must support the GB18030, the embedded product temporarily does not make the request. So mobile phone, MP3 generally only support GB2312.

From ASCII, GB2312, GBK to GB18030, these coding methods are backward-compatible, that is, the same character always has the same encoding in these scenarios, and the following standard supports more characters. In these codes, English and Chinese can be treated in a uniform manner. The method for distinguishing Chinese encodings is that the highest bit of high byte is not 0. According to the programmer's name, GB2312, GBK to GB18030 are both Double-byte character sets (DBCS).

Some Chinese Windows default internal code or GBK, you can upgrade to GB18030 through the GB18030 upgrade package. However, GB18030 relative GBK characters, ordinary people are very difficult to use, usually we use the GBK to refer to the Chinese Windows code.

Here are some details:

GB2312 of the original or location code, from the location code to the inner code, the need for high byte and low byte, respectively, plus A0.

In DBCS, GB inner code storage format is always big endian, that is, high in front.

The highest digits of the GB2312 's two bytes are 1. But the code position that meets this condition only 128*128=16384. So the highest bit of GBK and GB18030 are probably not 1. However, this does not affect the parsing of DBCS character streams: When reading a DBCS character stream, you can encode the next two bytes as a double-byte if you encounter a byte with a high of 1, regardless of what the low byte high is.

2. Unicode, UCS and UTF

Previous coding methods from ASCII, GB2312, GBK to GB18030 are backward-compatible. Unicode is compatible with ASCII only (or, more accurately, iso-8859-1 compatible), and is not compatible with GB code. For example, the Unicode encoding of the word "Han" is 6c49, while the GB code is baba.

Unicode is also a method of character encoding, but it is designed by an international organization to accommodate coding schemes for all languages in the world. The scientific name of Unicode is "universal Multiple-octet coded Character Set", referred to as UCS. UCS can be considered as an abbreviation for "Unicode Character Set."

According to the Wikipedia Encyclopedia (http://zh.wikipedia.org/wiki/): There are two organizations in history that try to design Unicode independently, namely the International Organization for Standardization (ISO) and the Association of Software Manufacturers (unicode.org). ISO 10646 project was developed and the Unicode Association developed the Unicode Project.

Around 1991, both sides realized that the world did not need two incompatible character sets. They began to combine the work of both sides and work together to create a single coding table. Starting with Unicode2.0, the Unicode project uses the same font and codewords as the ISO 10646-1.

Currently, two projects are still in existence and their respective standards are published independently. The latest version of the Unicode Association now is the 2005 Unicode 4.1.0. The newest standard of ISO is 10646-3:2003.

UCS stipulates how to use multiple bytes to represent various types of text. How these encodings are transmitted is regulated by the UTF (UCS Transformation Format) specification, and common UTF specifications include UTF-8, UTF-7, and UTF-16.

The RFC2781 and RFC3629 of the IETF describe the coding methods of UTF-16 and UTF-8 in a consistent style of RfC, clear, crisp, and rigorous. I always remember that the IETF is an abbreviation for the Internet Engineering Task force. However, the RFC is responsible for maintaining the RFCs as the basis for all the specifications on the Internet.

3, UCS-2, UCS-4, BMP

UCS has two forms: UCS-2 and UCS-4. As the name suggests, UCS-2 is encoded with two bytes, UCS-4 is encoded with 4 bytes (actually only 31 bits and the highest bit must be 0). Now let's do some simple math games:

UCS-2 has 2^16=65536 code position, UCS-4 has 2^31=2147483648 code position.

UCS-4 is divided into 2^7=128 group based on the highest byte of the highest bit of 0. Each group is then divided into 256 plane according to the secondary high byte. Each plane is divided into 256 rows (rows) based on the 3rd byte, and each row contains 256 cells. Of course, the cells of the same line is only the last byte, the rest is the same.

Plane 0 of group 0 is called Basic multilingual Plane, or BMP. Or, in UCS-4, a code position with a height of two bytes 0 is called BMP.

The UCS-4 bmp is removed from the previous two 0 bytes to get UCS-2. In the UCS-2 two bytes before adding two 0 bytes, you get UCS-4 bmp. However, none of the characters in the current UCS-4 specification are allocated outside of BMP.

4, UTF code

The UTF-8 is to encode the UCS in 8-bit units. The encoding method from UCS-2 to UTF-8 is as follows:

UCS-2 encoding (16-in-system) UTF-8 byte stream (binary)
0000-007f 0xxxxxxx
0080-07ff 110xxxxx 10xxxxxx
0800-ffff 1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding of the word "Han" is 6c49. 6c49 between 0800-FFFF, so sure to use the 3-byte template:1110xxxx xxxxxx xxxxxx . The 6c49 written as binary is: 0110 110001 001001, using this bit stream in turn instead of X in the template, get:11100110 110001 001001 , That is, E6 B1 89.

Readers can use Notepad to test whether our coding is correct.

The UTF-16 encodes the UCS in 16-bit units. For UCS codes less than 0x10000, the UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. For a UCS code that is not less than 0x10000, an algorithm is defined. However, because the actual use of UCS2, or UCS4 bmp is necessarily less than 0x10000, so for now, can be considered UTF-16 and UCS-2 basically the same. But UCS-2 is just a coding scheme, UTF-16 is used for actual transmission, so we have to consider the problem of byte-order.

5, UTF byte order and BOM

UTF-8 is a byte-encoded unit with no byte-order problem. UTF-16 is a two-byte coding unit, before interpreting a UTF-16 text, first figure out the byte order of each encoding unit. For example, a "Kui" Unicode encoding is 594E, and the Unicode encoding for "B" is 4E59. If we receive the UTF-16 Word throttle "594E", then this is "Kui" or "B"?

The recommended method of marking byte order in the Unicode specification is the BOM. The BOM is not a BOM for "Bill of Material", but a byte order Mark. BOM is a bit of a clever idea:

In the UCS code there is a character called ZERO WIDTH No-break Space, and its encoding is Feff. Fffe is not present in UCS, so it should not appear in the actual transmission. UCS specification recommended that we transfer the byte stream before the transmission of the character "ZERO WIDTH no-break space."

This means that if the recipient receives the Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. Therefore, the character "ZERO WIDTH No-break Space" is also called the BOM.

UTF-8 does not require a BOM to indicate byte order, but you can use a BOM to indicate how the encoding is encoded. The UTF-8 code for the character "ZERO WIDTH no-break Space" is the EF BB BF (The reader can verify this using the coding method we described earlier). So if the receiver receives the byte stream at the beginning of the EF BB BF, it will know that this is UTF-8 code.

Windows uses a BOM to mark the encoding of a text file.

6. Further reference materials

The main reference information in this article is "Short overview of Iso-iec 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html).

I also looked for two articles that looked good, but because I got the answers to the questions I started, I didn't read them:

"Understanding Unicode A General Introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php? SITE_ID=NRSI&ITEM_ID=IWS-CHAPTER04A)
"Character Set Encoding Basics Understanding Character Set encodings and Legacy encodings" (HTTP://SCRIPTS.SIL.ORG/CMS/SC RIPTS/PAGE.PHP?SITE_ID=NRSI&ITEM_ID=IWS-CHAPTER03)

I've written UTF-8, UCS-2, GBK software packages, including versions that use Windows APIs and do not use Windows APIs. If you have time later, I will organize a delegation to my personal homepage (http://fmddlmyy.home4u.china.com/).

I was thinking about all the questions before I started writing this article, I thought I would be able to write it in a few minutes. It took a long time to think about the wording and verify the details, and it was written from 1:30 to 9:00. Hope that some readers can benefit from it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.