Unicode agent programming using the Java language

Source: Internet
Author: User
Tags memory usage

The early Java version uses a 16-bit char data type to represent Unicode characters. This design method is sometimes reasonable because all Unicode characters have a value less than 65,535 (0xFFFF) and can be represented by 16 digits. However, Unicode later increased the maximum value to 1,114,111 (0X10FFFF). Because 16 bits are too small to represent all Unicode characters in Unicode version 3.1, 32-bit values-called code point-are used in UTF-32 encoding mode.

However, with a 32-bit value, the 16-bit memory usage is more efficient, so Unicode introduces a new design method to allow the use of 16-bit values. This design method used in UTF-16 allocates a value of 1,024 to 16-bit high surrogate, assigning an additional 1,024 value to the 16-bit low proxy (lower surrogate). It uses a high proxy plus a low proxy-one proxy pair (surrogate pair)-to represent 1,114,111 (0X10FFFF) values between 65,536 (0x10000) and 1,048,576 (0x100000) (1,024 and Product of 1,024).

Java 1.5 Preserves the behavior of the char type to represent the UTF-16 value (in order to be compatible with an existing program), and it implements the concept of code bit to represent the UTF-32 value. This extension (implemented according to the JSR 204:unicode supplementary Character Support) does not require remembering the exact value of Unicode code bits or conversion algorithms-but it is important to understand the correct usage of the proxy API.

East Asian countries and regions have increased the number of characters in their character sets in recent years to meet user needs. These standards include GB 18030 from China's National standards Organization and JIS X 0213 from Japan. Therefore, procedures that seek to comply with these standards are more necessary to support Unicode proxy pairs. This article explains the relevant Java APIs and coding options, and plans to redesign their software from characters that can only use char type to readers who can handle new versions of proxy pairs.

Sequential access

Sequential access is a basic operation that handles strings in the Java language. In this way, each character in the input string is accessed sequentially from beginning to end, or sometimes from the tail to the head. This section discusses the use of sequential access methods to create 7 technical examples of a 32-bit code-bit array from a string and estimates their processing time.

Example 1-1: Benchmark (does not support proxy pairs)

Listing 1 assigns the 16-bit char type value directly to the 32-bit code-bit value and does not consider the proxy pair at all:

Listing 1. Proxy pair is not supported

int[]  toCodePointArray(String str) { // Example 1-1
  int len  = str.length();     // the length of str
  int[] acp  = new int[len];    // an array of code points

  for (int i = 0, j = 0; i < len; i++) {
     acp[j++] = str.charAt(i);
  }
  return acp;
}

Although this example does not support proxy pairs, it provides a processing time datum to compare subsequent sequential access examples.

Example 1-2: Using Issurrogatepair ()

Listing 2 uses Issurrogatepair () to calculate the total number of agents. After counting, it allocates enough memory to store this value in a code-bit array. It then enters a sequential access loop, using Ishighsurrogate () and Islowsurrogate () to determine whether each agent is a high agent or a low proxy for the character. When it finds a high proxy with a low proxy, it uses Tocodepoint () to convert the proxy pair to a code bit value and increase the current index value by 2. Otherwise, it assigns this char type value directly to a code bit value and increases the current index value by 1. This example has a processing time of 1.38 times times longer than the example 1-1.

Listing 2. Limited support

int[]  toCodePointArray(String str) { // Example 1-2 
  int len  = str.length();     // the length of str
  int[]  acp;            // an array of code points
   int surrogatePairCount = 0;   // the count of surrogate  pairs

  for (int i = 1; i < len; i++) {
    if (Character.isSurrogatePair(str.charAt(i - 1),  str.charAt(i))) {
      surrogatePairCount++;
       i++;
    }
  }
  acp = new int[len -  surrogatePairCount];
  for (int i = 0, j = 0; i <  len; i++) {
    char ch0 = str.charAt(i);     //  the current char
    if (Character.isHighSurrogate(ch0)  && i + 1 < len) {
      char ch1 =  str.charAt(i + 1); // the next char
      if  (Character.isLowSurrogate(ch1)) {
        acp[j++] =  Character.toCodePoint(ch0, ch1);
        i++;
         continue;
      }
    }
     acp[j++] = ch0;
  }
  return acp;
}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.