Java text encoding: ASCII, Unicode, UTF-8

Source: Internet
Author: User

1. Development of character encoding

Stage 1: ASCII (American Standard Code for information interchange). At that time, the computer only supports English, characters are stored as 0 and 1 in the computer. 52 letters (including uppercase letters) such as A, B, C, and D, as well as numbers such as 0, 1, and 2, as well as some common symbols (such as *, #, @, etc) when storing data in a computer, binary numbers must also be used for representation. Specific binary numbers must be used for representation of symbols, as a result, the American Standardization Organization issued the so-called ASCII code, which specifies the binary number used to represent the commonly used symbols. (From Baidu encyclopedia). The ASCII Code specifies that each character, for example, "A", is expressed in 1 byte, that is, an octal binary combination. Therefore, there are a total of 256 combinations of October-11111111 characters, that is, it can represent 256 different characters.

0-31: it is a control character or communication special character (the character cannot be displayed, the other can be displayed characters), such as the control character: LF (line feed), Cr (carriage return) and so on.

32-126: it is a character, of which 32 is a space, 48-57 is a 0-9 Arabic number,-90 is 26 uppercase English letters,-is 26 lowercase English letters, the rest are some punctuation marks, operator numbers, and so on.

There are a total of 128 asscii, ranging from 0 to 127, that is, from 00000000-01111111, and the highest bit is 0.

Stage 2: In the ANSI encoding (localization) stage, ASCII can only represent English characters. How do we express other characters? This solution is used in Chinese. Two ASCII characters are used to represent one Chinese character, and the first 128 Chinese characters are not used. Why not have I mentioned them in the previous article because the first 128 Chinese characters are used in English. It cannot be occupied. Otherwise it will be chaotic. For example, the Chinese character "medium" uses the [0xd6, 0xd0] Byte storage in the Chinese operating system. Why? Here is a simple explanation: the "medium" partition code is 54 48, then, the "medium" country code is the 54-48 hexadecimal system + 2020 H = 5650 H, then the inner code of "medium" is = "China Standard code + 8080 H = d6d0h (this is the content of the previous article. If you do not understand it, you can refer to the previous article ), in this way, each Chinese character has its own encoding, and the Chinese character encoding is solved. This is the Chinese gb2312 encoding standard, but this is the Chinese character encoding. What about other countries? The [0xd6, 0xd0] bytes may be stored in the computer operating system of other countries as their text, rather than "medium ", different countries and regions have different standards, which use two bytes to represent the extended encoding of a character.ANSI Encoding. In a simplified Chinese system, ANSI encoding represents gb2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding. Different ANSI codes are incompatible with each other. When information is exchanged internationally, texts in the two languages cannot be stored in the same segment.ANSI Encoding.

Stage 3: Unicode (International), international organizations have developedUnicode Character SetSet a uniform and unique number for each character in a variety of languages to meet the requirements of cross-language and cross-platform text conversion and processing. Unicode uses numbers 0-0x10ffff to list these characters. It can contain up to 1114112 characters, or contain 1114112 characters. The bitwise is the number that can be allocated to characters. UTF-8, UTF-16, UTF-32 are all convert numbersProgramData encoding scheme.

Next, the Unicode encoding Utf-8, UTF-8 is a unicode implementation of a way, Unicode rules for the world each character corresponding to the encoding number, UTF-8 defines how to store characters

Their conversion rules are as follows:

UnicodeSymbol rangeUTF-8Encoding Method
(Hexadecimal) |(Binary)
-------------------- + ---------------------------------------------
0000 0000-0000 007f | 0 xxxxxxx
0000 0080-0000 07ff | 110 XXXXX 10 xxxxxx
0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10 xxxxxx
0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

 

1) For single-byte symbols, the first byte is set0, Followed7Bit for this symbolUnicode. Therefore, for English letters,UTF-8Encoding andASCIIThe codes are the same.

 

2)NByte symbol (N> 1), Before the first byteNBITs are set1, NoN + 1Bit0, The first two digits of the next byte are all set10. The rest of the unmentioned binary bits are all ofUnicode.

 

The following example shows how to check the UTF-8 encoding of "connectivity": the Unicode encoding of "connectivity" is:8fde 901a can be queried from the Internet (you can enter Chinese Characters in word and convert them to Unicode by Alt + x). This is required.8fde and 901a are in the third row of the table above, that is, occupying 3 bytes. Convert according to the rules, and the UTF-8 encoding of "connectivity" is

 

E8 BF 9e E9 80 9A, that is, the encoding stored in computing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.