Getting started with character set encoding in Java (vi) supplemental characters in Java

Last Update:2017-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Java, which boasts a natural support for Unicode, has been fake for a long, long time (though it was true), and in fact, until JDK5.0, Java has just followed the footsteps of Unicode and started providing support for additional characters.

Now the Unicode code space for u+0000 to U+10FFFF, a total of 1,114,112 yards, of which only 1,112,064 yards is legal (I'm here to do the math for you, there are 2048 code bit illegal), but not that now Unicode has so many characters, In fact, many of them are still idle, and only 96,382 code bits have been assigned to the Unicode 4.0 specification (but, in any case, are still much more than the 65,536 characters that many people think). The portion of the u+0000 to U+ffff is referred to as the Basic Multilingual plane (elementary multilingual plane,bmp). The characters u+10000 and above are called supplementary characters. In Java (after Java1.5), the supplemental characters are represented by two char variables, and these two char variables form the so-called surrogate pair (which is actually represented by an int at the bottom). The range of the first char variable is called the High proxy section (High-surrogates range, from "uD800 to" UDBFF, a total of 1024 code digits), and the range of the second char variable is called the Low-surrogates range (from " uDC00 to "udfff, a total of 1024 code places", so that the use of surrogate pair can represent the number of characters is 1024 square meter 1,048,576, plus bmp 65,536 yards, remove 2048 illegal code bit, exactly is 1, 112,064 yards and a bit.

Code space on Unicode there are actually some places where you can make mistakes with a little bit of carelessness. For example, we all know that the section from u+0000 to U+FFFF is called the Basic multilingual plane,bmp, and that the characters in this range can be saved with only one char variable when using UTF-16 encoding. Take a closer look at this range, it should be 65536 so large, so you would say that Single-byte UTF-16 encoding can represent 65,536 characters, and you will also say that the basic multilingual surface of Unicode contains 65,536 characters, but think again about the surrogate pair that you just said, How can a UTF-16-represented additional character (again, a character that requires two char variables) be correctly identified as an additional character instead of two ordinary characters? You know the answer, too. is by looking at its first char is not within the high proxy range, the second char is not in the low proxy range, which means that the high agent and the low proxy of a total of 2048 code bit (from 0xd800 to 0xDFFF) can not be assigned to other characters.

But is this the encoding method for UTF-16, and the character set for Unicode? In the number of Unicode, u+d800 to U+dfff is there a character assignment? And the answer is no! This is the typical character set arrangement for easy coding (do you ask them for the purpose of doing so?) Of course, it is hoped that the characters in the basic multilingual plane and a char-type UTF-16 encoded characters can be one by one corresponding, less trouble, from which we can also see UTF-16 and Unicode very deep source and union. That is, Unicode or UTF-16-encoded characters, in the range of 0x0000 to 0xFFFF, are only 63,488 characters. This is like the initial CPU was forced to do multimedia applications, much more, the CPU has to modify its own hardware from the support of multimedia applications.

I don't want to, but there's always something to be said about the concept: code points and code units.

Code point refers to the number assigned to characters in Unicode, a character that occupies only one point of code, such as when we speak of the character "Han", and its code point is u+6c49.

The code unit, which is the encoding method, refers to the smallest storage unit in the encoding method after a character encoding. In UTF-8, for example, a code unit is a byte, because one character can be encoded as 1, 2, or 3 4 bytes; in UTF-16, the unit of code becomes two bytes (that is, a char), Because a character can be encoded as 1 or 2 char (you cannot find a UTF-16 encoded character that is smaller than a char, hehe). A bit more verbose, a character that corresponds only to one code point, but may have multiple code units (that is, it could be encoded as 2 char).

The above concepts are not academic tongue twisters, this means that when you want to specify what characters to use in a uniform way, it is always better to use the code point (that is, the first few characters in Unicode that you are telling your program) than to use a code unit (because you have to distinguish between the cases, Sometimes a 16-digit number is provided, sometimes two are provided.

For example, we have an additional character??? (Haha, you see three question marks, right?) Because my system doesn't show this character, its number in Unicode is u+2f81a, and when you need to use this character in your program, you can write this:

String s=string.valueof (Character.tochars (0x2f81a));
Char[]chars=s.tochararray ();
for (char c:chars) {
    System.out.format ('%x ', (short) c);
}

A later for loop prints out the UTF-16 encoding of the character, and the result is

D87edc1a

Have you noticed? This character becomes a two char variable, where 0xd87e is the value of the high proxy portion, and 0XDC1A is the low proxy value.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Getting started with character set encoding in Java (vi) supplemental characters in Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Getting started with character set encoding in Java (vi) supplemental characters in Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support