Introduction to Unicode supplementary characters (secondary plane) related usages in Java

Source: Internet
Author: User
Tags java se

Preface

Java started with the 1.5 release and added support for the Unicode secondary plane. This article is tested on JDK1.6.

The associated APIs are mainly in the character and string classes. The following paragraph is character's document description excerpt.

==================================================================================================

    The Character class wraps the value of a base type char in an object. An object of type Character contains a single field of type Char.   In addition, the class provides several methods to determine the categories of characters (lowercase letters, numbers, and so on), and to convert characters from uppercase to lowercase, and vice versa.   character information is based on the Unicode standard, version 4.0. The methods and data for the  character class are defined by the information in the Unicodedata file, which is part of the Unicode Character Database maintained by Unicode Consortium. This file specifies various properties, including the name and general category of each defined Unicode code point or character range.   This file and its description can be obtained from the Unicode Consortium at the following URL:  http://www.unicode.org. The Char data type (and the value encapsulated by the Character object) is defined as a fixed-width 16-bit entity based on the original Unicode specification. The Unicode standard has been modified to allow for characters whose representations require more than 16 bits. The range of legitimate code points is now from u+0000 to U+10FFFF, which is commonly referred to as Unicode scalar values. (see the definition of u+n notation in the Unicode standard.)   Character sets from u+0000 to U+FFFF are sometimes also referred to as Basic multilingual Plane (BMP). A character with a code point greater than U+FFFF is called a supplementary character. The Java 2 platform uses the UTF-16 representation in a char array as well as in the String and StringBuffer classes. In this representation, the supplementary character is represented as a pair of char values, the first value is taken from the High Surrogate range (\UD800-\UDBFF), and the second value is taken from the low surrogate range, i.e. (\UDC00-\UDFFF).   Therefore, a char value represents a Basic multilingual Plane (BMP) code point, which includes a surrogate code point, or UTF-16 encoded code unit. An int value represents all Unicode code points, including supplemental code points. The 21 low (least significant bit) of int is used to represent the Unicode code point, and 11 highs (the most significant bit) must be zero. Unless otherwise specified, the supplementary characters andThe behavior of the surrogate char value is as follows:  a method that accepts only one char value cannot support supplementary characters. They treat the char values in the range of surrogate characters as undefined characters. For example, Character.isletter (' \ud840 ') returns false, even if a specific value, if followed by any low surrogate value after the string, will represent a letter.   The method that accepts an int value supports all Unicode characters, including supplementary characters. For example, Character.isletter (0x2f81a) returns true because the code point value represents a letter (a CJK glyph).   in the Java SE API documentation, Unicode code points are used for character values that range between u+0000 and U+10FFFF, while Unicode code points are used as 16-bit char values for code units that are UTF-16 encoded. For more information about Unicode technologies, see Unicode Glossary.  

====================================================================================================

As you can see, the supplemental characters are represented by a char array of length 2, representing both high and low surrogate. The usage can refer to the following example

Example Onethe source code for the Codepointat method is as follows:

<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:14PX;" > public    static int Codepointat (char[] A, int index) {return Codepointatimpl (A, index, a.length);    } </span>
<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:14PX;" >    static int Codepointatimpl (char[] A, int index, int limit) {        char C1 = a[index++];        if (Ishighsurrogate (C1)) {            if (Index < limit) {                char c2 = A[index];                if (islowsurrogate (C2)) {                    return Tocodepoint (c1, C2);        }}} return C1;    } </span>
<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:14PX;" > public    static int Tocodepoint (char high, char low) {        return (high-' \ud800 ') <<            + (Low-') \udc00 ') + 65536;    } </span>

As you can see, if you enter an array of supplementary characters, when the incoming index is 0, the code point of the entire supplementary character is returned, and when the incoming index is 1, the code point of the second character in the supplementary character array is returned.

<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:14PX;" >public static void Main (string[] args) {        char[] c = character.tochars (Integer.parseint ("1d306", 16));// 1D306 is an auxiliary plane character        System.out.println (Character.codepointat (c, 0));//Output 119558, this is the 1d306 corresponding 10 binary value        System.out.println (Character.codepointat (c, 1));//Output 57094, this is c[1] 10 binary value of the corresponding character    }</span>

When an incoming character array is a character that is both a basic plane, a code point that directly returns the underlying plane character that corresponds to the passed-in index.

<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:14PX;" > public    static void Main (string[] args) {        char[] c = {' A ', ' B ', ' Test ', ' try '};        System.out.println (Character.codepointat (c, 0)),//97        System.out.println (Character.codepointat (c, 1));//98        System.out.println (Character.codepointat (c, 2));//27979        System.out.println (Character.codepointat (c, 3));//35797        System.out.println ((char));//a        System.out.println (char) 98);//b        System.out.println ((char) 27979);        //System.out.println ((char) 35797); /Test    }</span>

Example TwoThe length and Codepointcount methods of the string class, when processing supplementary characters, return data that is not the same, and for the base plane, the return value is the same. Length Returns the string size, Codepointcount put back the number of code points.
<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:14PX;" > public    static void Main (string[] args) {        char[] c = character.tochars (Integer.parseint ("1d306", 16));// 1D306 is an auxiliary plane character        System.out.println (Character.codepointat (c, 0));//Output 119558, this is the 1d306 corresponding 10 binary value        System.out.println (Character.codepointat (c, 1));//Output 57094, this is the 10 binary value System.out.println of the corresponding character of c[1]        (new String (c) . Codepointat (0));//Output 119558, this is the 1d306 corresponding to the 10 binary value        System.out.println (new String (c). Codepointat (1));//Output 57094, This is c[1] the 10 binary value of the corresponding character        string str = "ABCDEFG" + new string (c);        System.out.println (Str.length ());//9        System.out.println (str.codepointcount (0, Str.length ()));//8    }</ Span>
in the above example, the string length is 9, because the character u+1d306 needs a character array of length 2, while the code point is actually only 1, so it returns 9 and 8 respectively.



Introduction to Unicode supplementary characters (secondary plane) related usages in Java

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.