Java Core Learning Note--3.char/unicode/Code point/code unit

Source: Internet
Author: User

Universal Character Set (UCS)

UCS is a standard character set developed by ISO 10646(or ISO/IEC 10646) standards.

The UCS includes all other character sets (which contain the known language's characters).

ISO/IEC 10646 defines a 31-bit character set (the first constant is 0, which takes up 4 bytes).

Unicode (Universal code, International Code, Uniform Code, single code) Encoding method:

Unicode encoding space from "u+0000" to "u+10ffff"( A total of 1,112,064 code bits), Unicode encoding space is divided into 17 planes, Each plane consists of 216 (65536) code bits. 17 Flat code bits can be represented as "u+xx0000" to "u+xxffff" (xx means hexadecimal from 0016 to 1016, total 17 planes). The first plane is called the basic multilingual plane, and the other plane is the secondary plane.

The currently applied version of Unicode corresponds to UCS-2, which uses 16-bit encoded space (2 bytes, up to 216 = 65,536 characters), but the current version does not fully use this 16-bit encoding and still retains a lot of space for special use or future expansion.

The above 16-bit Unicode characters form the basic multilingual plane (belonging to UCS level 3, referred to as BMP, also known as "0th"). The latest (not actually widely used) Unicode version also defines 16 auxiliary planes, with a total of 4 bytes of encoded space, with UCS-4 (future version will be reached, UCS-4 is 31-bit character set, plus the first 0, total 32-bit, 4-byte, up to 231 characters) consistent.

The characters of the base multilingual plane are encoded as "u+hhhh"(h is a hexadecimal number) and are identical to the UCS-2 encoding. The same character UCS-2 corresponds to the UCS-4 encoded with the same two bytes, with the first two bytes being 0.

Implementation method (UTF):

In different platforms, Unicode encoding is implemented differently in order to save space.

For example, for a Unicode file that contains only basic 7-bit Ascⅱ characters, it would waste a lot of space if each character was transferred in the original Unicode encoding of 2 bytes. In this case, you can use UTF-8 encoding (one byte, the last 7 bits is the original Ascⅱ code, the first 0). For Ascⅱ with other Unicode characters, the algorithm converts each character using 1-4-byte encoding.

Code point/code unit (JDK 5.0)

A code point is a code value (Unicode encoding) that corresponds to a character in an encoded table.

In Java, the char type describes a code unit with UTF-16 encoding.

UTF-16 uses a different length of encoding to represent all Unicode code points . In the basic multilingual plane, each character is represented by 16 bits (often referred to as a unit of code), and the auxiliary characters are represented by a pair of contiguous code units (32 bits).

Reference links
    • Unicode:http://zh.wikipedia.org/wiki/unicode
    • Ucs:http://zh.wikipedia.org/wiki/%e9%80%9a%e7%94%a8%e5%ad%97%e7%ac%a6%e9%9b%86
    • Utf-16:http://zh.wikipedia.org/wiki/utf-16

Written in the last

The layout of the page for reference to the Wikipedia layout, the first three notes, each typesetting is not the same XD

UCS, Unicode, UTF, the more you look dizzy, the more you write the more you don't know what to write. How to write a little relationship with Java do not AH!

It seems to be possible to summarize directly in a single sentence: Thestring in Java is a sequence of char type, a code point, most of the characters commonly used (basic multilingual plane) code point uses a code unit (2 bytes), the auxiliary character is a pair of code units (4 bytes).

Java Core Learning Note--3.char/unicode/Code point/code unit

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.