Java Chinese garbled solution (c)-----Coding details: Great Genesis---Unicode encoding

Source: Internet
Author: User

With the development of the computer, the world in order to adapt to their own language and character will design a set of their own coding style, it is because of this disorder, resulting in a number of coding methods, so that the same binary numbers may be interpreted as different symbols. In order to solve this incompatibility problem, the great creators want Unicode encoding in the timely life!!

Unicode

Unicode, also known as the Unified Code, the universal code, a single code, it is to solve the limitations of the traditional character encoding scheme, it has a unified and unique binary encoding for each character in each language to meet the requirements of cross-language, cross-platform text conversion, processing. You can imagine Unicode as a "large container of characters," which contains all the symbols in the world, and each symbol has its own unique encoding, which fundamentally solves the problem of garbled. So Unicode is an encoding of all symbols [2].

Unicode is accompanied by the standard of the universal character set, but also published in the form of books, it is the industry standard, the world's most of the word system is organized, coded, so that the computer can be used in a more simple way to present and process text. Unicode is still being continuously revised to date and has earned more than 100,000 characters, which has been recognized by the industry and widely used in the internationalization and localization of computer software.

We know that Unicode is designed to address the limitations of traditional character coding schemes, and they all have a common problem with traditional coding methods: they can't support a multi-lingual environment, which is not allowed for the open environment of the Internet. Almost all computer systems currently support the basic Latin alphabet, and each supports different encoding methods. In order to be compatible with them, the first 256 characters of the Fu Pao are reserved for the character defined by ISO 8859-1, so that the conversion of the existing Western European languages is not a special consideration, and a large number of identical characters are repeated into different character codes. [1] It allows the old, distracting encoding to be converted directly from one another to Unicode encoding without losing any information.

Implementation Method

Unicode encoding of one character is deterministic, but in the actual transmission process, the implementation of Unicode encoding differs depending on the design of different system platforms, and for space-saving purposes. The implementation of Unicode is known as the Unicode conversion format (Unicode Transformation format, referred to as UTF) [1].

Unicode is a character set, which mainly has three implementations of UTF-8, UTF-16, UTF-32. Because UTF-8 is the mainstream of the current implementation, UTF-16, UTF-32 relatively less use, so the following is the main introduction UTF-8.

UCS

Referring to Unicode may be necessary to understand the next UCS. UCS (Universal Character set, universal Character set), is a standard character set defined by ISO 10646(or ISO/ IEC 10646) standards developed by the ISOs. It includes all other character sets, guaranteeing bidirectional compatibility with other character sets, that is, if you translate any text string into the UCS format and then translate back to the original encoding, you will not lose any information.

UCS not only assigns one code to each character, but also gives it a formal name. The hexadecimal number that represents a UCS or Unicode value is usually preceded by "u+", such as "u+0041" for the character "A".

Little Endian & Big endian

Due to the different design of each system platform, some platforms may have different understanding of characters (such as the understanding of byte order). This will cause the consent byte stream to be interpreted as a different content. If the hexadecimal of a character is 4E59, split into 4E, 59, read on the Mac is Ono low start, then the Mac encountered the byte stream will be resolved to 594E, found the character "Kui", but the Windows platform is read from the high byte, 4E59, found the word identifier "B". This means that the "B" saved on the Windows platform becomes "Kui" on the Mac platform. This is bound to cause confusion, so in Unicode encoding using the Big endian, small head (Little endian) two ways to differentiate. That is, the first byte in front, is the big head way, the second byte in front is the small way. Then there is a question: How does the computer know which encoding to use for a particular file?

Defined in the Unicode specification, each file is preceded by a character that represents the encoding sequence, which is named "0-width non-newline space" (ZERO wide no-break space), denoted by Feff. This happens to be two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are Fe FF, it means that the file is in a large head, and if the first two bytes are FF FE, it means that the file is in a small way.

UTF-8

UTF-8 is a variable-length character encoding for Unicode that can represent a symbol using 1~4 bytes, varying the length of a byte depending on the symbol. It can be used to represent any character in the Unicode Standard, and the first byte in its encoding is still compatible with ASCII, which makes it possible for the original system to work with ASCII characters to continue using without or requiring little modification. As a result, it is gradually becoming the preferred encoding for e-mail, Web pages, and other applications that store or transmit text.

UTF-8 uses one to four bytes for each character encoding, and the encoding rules are as follows:

1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.

2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

The conversion table is as follows:

Unicode

UTF-8

0000 ~007f

0 XXX XXXX

0080 ~07ff

the X xxxx XX xxxx

0800 ~FFFF

1110 xxxx tenxx xxxx xx xxxx

1 0000 ~1f FFFF

1111 0 XXX tenxx xxxx xx xxxx xx xxxx

0000 ~3ff FFFF

1111 Ten XX xx xxxx x xx xxxx xx xxxx xx xxxx

0000 ~7FFF FFFF

1111 X tenxx xxxx xx xxxx xx xxxx xx xxxx xx xxxx

According to the conversion table above, it is very simple to understand the UTF-8 encoding rules: the first bit of the first byte, if it is 0, means that the byte is a single character, and if 1, the number of consecutive 1 means that the character occupies a number of bytes.

Take the Chinese character "Yan" as an example, show how to implement UTF-8 code [3].

Known as "Strict" Unicode is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is " 1110xxxx 10xxxxxx 10xxxxxx ". Then, starting from the last bits of "Yan", the X in the format is filled in sequentially, and the extra bits complement 0. This gets, "strict" UTF-8 code is "11100100 10111000 10100101", converted into 16 binary is e4b8a5.

conversion between Unicode and UTF-8

Through the above example we can see that the "strict" Unicode code for 4E25,UTF-8 encoded as E4B8A5, they are not the same, they need to be implemented through the conversion of the program, the simplest intuitive way on the window platform is Notepad.

There are four options at the bottom of the code (E): ANSI, Unicode, Unicode big endian, UTF-8.

ANSI: The default encoding of Notepad, for English files is ASCII encoding, for Simplified Chinese files are GB2312 encoded. Note: Different ANSI encodings are incompatible, and when information is exchanged internationally, text that is in two languages cannot be stored in the same piece of ANSI-encoded text

UNICODE:UCS-2 encoding, which is a Unicode code that is stored directly in characters with two bytes. The way is "small head" little endian way.

Unicode Big Endian:ucs-2 encoding method, "large head" mode.

UTF-8: Read above (UTF-8).

>>> Example: Enter the word "strict" in Notepad, select ANSI, Unicode, Unicode big endian, UTF-8 four coding styles, then save as, use the EditPlus Text tool is viewed using the 16 viewer, resulting in the following results:

ANSI: Two bytes "D1 CF" is the GB2312 code of "strict".

Unicode: Four bytes "ff fe 4E", where "FF Fe" represents the small head storage method, the true encoding is "4E".

Unicode big endian: four bytes "Fe ff 4E", "Fe ff" means large head storage, really encoded as "4E 25".

UTF-8: The encoding is six bytes "EF BB bf E4 B8 A5", the first three bytes "EF BB bf" indicates that this is UTF-8 encoding, and the last three "E4b8a5" is the specific code of "strict", and its storage order is consistent with the encoding order.

References & Read more

1. Unicode Wikipedia: Http://zh.wikipedia.org/wiki/Unicode

2, Unicode Baidu Encyclopedia: http://baike.baidu.com/view/40801.htm

3. Character code notes: Ascii,unicode and utf-8:http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

4, UTF-8 Baidu Encyclopedia: http://baike.baidu.com/view/25412.htm

-----Original from: http://cmsblogs.com/?p=1458, please respect the author's hard work results, reproduced the source of the explanation.

-----Personal site: http://cmsblogs.com

Java Chinese garbled solution (c)-----Coding details: Great Genesis---Unicode encoding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.