Java Chinese garbled solution (3) ----- encoding details: great creative --- Unicode encoding, -------- unicode

Source: Internet
Author: User

Java Chinese garbled solution (3) ----- encoding details: great creative --- Unicode encoding, -------- unicode

With the development and popularization of computers, all countries in the world will design their own encoding styles to adapt to their own languages and characters. Due to this disorder, there are many encoding methods, so that the same binary number may be interpreted as different symbols. To solve such incompatibility problems, great creative ideasUnicodeCode should be generated !!

Unicode

Unicode is also known as the unified code, universal code, and single code. It is generated to solve the limitations of traditional character encoding solutions, it sets a unified and unique binary code for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. As you can imagine, Unicode is a "character container" that contains all the symbols in the world and each symbol has its own unique encoding, in this way, the garbled problem is fundamentally solved. SoUnicode is the encoding of all symbols.[2].

Unicode has developed along with the standard of general character sets and is also published in the form of books. It is a standard in the industry. It organizes and encodes most of the world's text systems, this allows computers to present and process text in a simpler way. Unicode is still being improved, and so far it has earned more than 100,000 characters. It is recognized by the industry and widely used in the internationalization and localization of computer software.

We know that Unicode is produced to solve the limitations of traditional character encoding solutions. For traditional encoding methods, they all have a common problem: they cannot support multi-language environments, this is not allowed in the open environment of the Internet. Currently, almost all computer systems support basic Latin letters and support different encoding methods. In order to be compatible with them, Unicode retains its first 256 characters to the characters defined by ISO 8859-1, so that the conversion of existing western European languages does not require special consideration; in addition, a large number of identical characters are repeatedly encoded into different encoding codes, so that the old and complicated encoding methods can be directly converted to Unicode encoding without losing any information [1].

Implementation Method

The Unicode encoding of a character is definite, but in the actual transmission process, because the design of different system platforms is not necessarily consistent, and for the purpose of saving space, unicode encoding is implemented in different ways. The Unicode implementation method is calledUnicode conversion format(Unicode Transformation Format, UTF for short) [1].

Unicode is a character set, which has three major implementations: UTF-8, UTF-16, and UTF-32. Because UTF-8 is the current mainstream implementation method, UTF-16, UTF-32 is relatively less used, so the following is the main introduction of UTF-8.

UCS

It may be necessary to know about Unicode. UCS (Universal Character Set) is developed by ISO.ISO 10646(OrISO/IEC 10646) The standard character set defined by the standard. It includes all other character sets and ensures two-way compatibility with other character sets. That is, if you translate any text string to the ucsformat and then translate it back to the original encoding, you will not lose any information.

In addition to assigning a code to each character, the UCOS also gives a formal name. It indicates that the hexadecimal number of a ucs or Unicode value is always preceded by "U +". For example, "U + 0041" indicates the character "".

Little endian & Big endian

Different system platforms may have different understandings of characters (such as byte order ). In this case, the consent of the byte stream may be interpreted as different content. For example, if the hexadecimal format of a character is 4E59 and the characters are split into 4E59 and 59, it starts when the MAC reads the string at the low position of the Oracle program, in this case, the MAC will be parsed as 594E when it encounters this byte stream, and the character found is "Kui". However, on the Windows platform, it is read from the high byte, Which is 4E59, the string is "B ". That is to say, the "B" stored on the Windows platform becomes "Kui" on the MAC platform ". This will inevitably lead to confusion, so Unicode encoding uses the Big endian and Little endian methods to distinguish. That is, the first byte is in the front, that is, the big head mode, and the second byte is in the front of the Small Header mode. Then there was a problem: how does the computer know which encoding method a file uses?

As defined in the Unicode specification, a character indicating the encoding sequence is added at the beginning of each file. The name of this character is "zero width, non-line feed SPACE" (zero width, NO-break space ), expressed in FEFF. This is exactly two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are fe ff, it indicates that the file adopts the big header mode. If the first two bytes are ff fe, it indicates that the file adopts the Small Header mode.

UTF-8

UTF-8 is a variable-length character encoding for Unicode, can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, which makes the system that originally processes ASCII characters do not need to or only need to make a few modifications, you can continue to use it. As a result, it has gradually become an application for storing or transmitting text in e-mails, web pages, and other applications, with the preferred encoding.

The UTF-8 uses one to four bytes to encode each character. The encoding rules are as follows:

1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.

2) for the n-byte symbol (n> 1), the first n bits of the first byte are set to 1, and the n + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all unicode codes of this symbol.

The conversion table is as follows:

Unicode

UTF-8

0000 ~ 007F

0XXX XXXX

0080 ~ 07FF

110X XXXX10XX XXXX

0800 ~ FFFF

1110XXXX10XX XXXX10XX XXXX

1 0000 ~ 1F FFFF

1111 0XXX10XX XXXX10XX XXXX10XX XXXX

20 0000 ~ 3FF FFFF

1111 10XX10XX XXXX10XX XXXX10XX XXXX10XX XXXX

400 0000 ~ 7FFF FFFF

1111 110X10XX XXXX10XX XXXX10XX XXXX10XX XXXX10XX XXXX

According to the above conversion table, understanding the conversion Encoding Rules of The UTF-8 becomes very simple: the first byte of the first if it is 0, it indicates that this byte is a single character; if it is 1, the number of consecutive 1 bytes indicates the number of bytes occupied by the character.

Take Chinese character "Yan" as an example to demonstrate how to implement UTF-8 coding [3].

It is known that the unicode of "strict" is 4E25 (100111000100101). According to the preceding table, we can find that 4E25 is in the range of the third row (0000-0800 FFFF ), therefore, the "strict" UTF-8 encoding requires three bytes, that is, the format is "1110 xxxx 10 xxxxxx 10 xxxxxx ". Then, from the last binary bit of "strict", enter x in the format from the back to the front, and fill the extra bit with 0. In this way, the "strict" UTF-8 code is "11100100 10111000 10100101", converted to hexadecimal is E4B8A5.

Conversion between Unicode and UTF-8

Through the above example, we can see that the Unicode code of "strict" is 4E25, The UTF-8 code is E4B8A5, they are not the same, need to implement through the program conversion, the simplest and most intuitive method on the Window platform is notepad.

There are four options at the bottom of encoding (E): ANSI, Unicode, Unicode big endian, UTF-8.

ANSI: the default encoding method of notepad. For English files, it is ASCII encoding, and for simplified Chinese files, it is GB2312 encoding. Note: Different ANSI encodings are incompatible. When information is exchanged internationally, the texts in the two languages cannot be stored in the same ANSI encoded text.

Unicode: UCS-2 encoding, that is, the Unicode code that stores characters in two bytes directly. This is the "Small Header" little endian method.

Unicode big endian: UCS-2 encoding, "big Head" mode.

UTF-8: read above (UTF-8 ).

>>> Example: Enter the word "strict" in notepad, select ANSI, Unicode, Unicode big endian, UTF-8 four encoding styles, and then save, run the EditPlus text tool and use the hexadecimal viewer. the following result is displayed:

ANSI: The two bytes "D1 CF" are exactly the "strict" GB2312 encoding.

Unicode: Four bytes "ff fe 25 4E", where "ff fe" indicates the Small Header storage method and the real encoding is "25 4E ".

Unicode big endian: Four bytes, "fe ff 4E 25", and "fe ff", indicate the storage mode of the big data, which is actually encoded as "4E 25 ".

UTF-8: the encoding is six bytes "ef bb bf E4 B8 A5", the first three bytes "ef bb bf" indicate this is UTF-8 encoding, the last three "E4B8A5" are "strict" encoding, and their storage sequence is consistent with the encoding sequence.

References & more

1. Unicode Wikipedia: http://zh.wikipedia.org/wiki/Unicode

2, Unicode Baidu Encyclopedia: http://baike.baidu.com/view/40801.htm

3. character encoding notes: ASCII, Unicode and UTF-8: http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

4, UTF-8 Baidu Encyclopedia: http://baike.baidu.com/view/25412.htm

----- Original from: http://cmsblogs.com /? P = 1458Please respect the author's hard work and repost the source.

----- Personal site:Http://cmsblogs.com

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.