How to correctly handle Unicode characters in Java

Source: Internet
Author: User

Recently in the development of input Method program encountered a small problem, is to delete a emoji, can not be deleted once, you need to perform two operations. Intuitively, this must be the Java operation Unicode character problem, so find the official Java document reference, solve the problem, here to do a simple summary. The original is here, interested to see for themselves.



Http://www.oracle.com/technetwork/articles/java/supplementary-142654.html


Note: The "Java bytes" mentioned in the article refer to the 16-bit bytes of the Java platform, please do not mix with C's 8-bit bytes.


The first thing you need to know is that the standard Unicode character is a 16-bit character (nonsense), so the standard Unicode character set contains 65,536 characters (2 of 16 times, the math of the elder brother is very good). But just 65,536 characters is not enough, especially to contain the colorful words of our great Chinese nation (I don't know anyway), the Unicode character set expands to 24 bits, which can contain up to 1,112,064 characters. Here we call the standard Unicode character set (that is, the first 65,536 characters) as the basic multilingual Plane (BMP), which is called supplementary characters for more than 16 bits of the extended character set.


UTF-16 is a coding method that encodes Unicode characters in 16-bit unsigned units, and if a standard Unicode character is encoded, it requires only one UTF-16 unit, and if the extended Unicode character is encoded, it consumes two UTF-16 cells.


We all know that in C, a primitive char occupies a byte, which is 8 bits. In Java, however, a primitive char occupies 16 bits and is equal to the length of a standard Unicode character because the Java platform is encoded with UTF-16. This makes it easy for Java to handle Unicode basic characters. However, for Unicode extended characters, it takes up to two char, or two UTF-16 units, in Java.


For a chestnut, the Unicode value of capital A is u+0041, which belongs to the basic character set of Unicode, so it occupies only one UTF-16 unit, denoted as [0041]. and character 650) this.width=650; "Width=" height= "align=" Bottom "src=" http://www.oracle.com/ocom/groups/public/@otn/ Documents/digitalasset/145818.gif "style=" margin:0px;padding:0px;list-style:none;border:0px;font-family:arial, Helvetica, Sans-serif;font-size:12px;white-space:normal;background-color:rgb (255,255,255); The Unicode value of/> is u+ 10400, it belongs to the extended Unicode character set, so it occupies two UTF-16 encoding units, expressed as [d801][dc00] (a square bracket represents a UTF-16 cell, the brackets themselves are meaningless). The first unit is called High-surrogates, ranging from u+d800 to U+DBFF, and the second unit is called Low-surrogates, ranging from u+dc00 to U+DFFF, which looks similar to multibyte encoding. But, what I'm saying is, however, there is a very important difference, from u+d800 to U+dfff, which is actually a reserved range of UTF-16, specifically used as the coded Unicode extended character set, which is not given any actual standard Unicode characters. That is, in your program just determine whether a character is within this range, you can know if the character should be treated as a separate Unicode basic character, or as a half extended Unicode character.


Back to the beginning I mentioned the problem, when I developed the Android input method, when the need to delete a emoji, always need to delete two times in order to erase clean. This is because the emoji that I use are all of the Unicode extended character sets, and the edit box occupies two UTF-16 cells, and each delete operation deletes only one UTF-16 unit, one Java byte, and the other unit is not deleted, so an exception is displayed. Just delete the other byte again, even if it really removes the emoji.


In real-world applications, the Unicode standard character set and the extended character set are used in combination, meaning that some characters occupy 1 Java bytes, and some characters occupy two Java bytes. Then I have to do the deletion of a character operation must also first to determine whether I want to delete one byte at a time or delete two bytes. This is very simple, according to the rules I mentioned earlier, first determine whether the manipulated character value is in [U+D800,U+DFFF] (inclusive), if so, it means that the manipulated character is not a valid standard Unicode character, but a half Unicode extended character , you only need to automatically delete two Java bytes in the program. Conversely, if the character being processed does not fall within the [U+D800,U+DFFF] range, then the character is a standard Unicode character that needs to be handled in a single Java byte.


After the theory is clear (I assume you know, in fact, it is not difficult, is I express unclear, no way, engineering people, always thinking beyond the language), I would like to briefly introduce the related processing functions in Java.



    • Character.tochars (int codepoint): The parameter codepoint is a Unicode value that converts the Unicode value to a standard Java byte array. If Codepoint is a Unicode standard character, the return value contains only one char, and if Codepoint is an extended Unicode character, the return value consists of 2 char.

      Example: In a program if you want to output a Unicode extended character, you can do so string.valueof (Character.tochars (0x1f60e))


    • Character.islowsurrogate (char ch): Determines whether a character is a low 16-bit encoding of a Unicode extended character.

    • Character.ishighsurrogate (char ch): Determines whether a character is a high 16-bit encoding of a Unicode extended character.

This article is from the "Yaorugang" blog, make sure to keep this source http://yaorugang.blog.51cto.com/10286709/1704142

How to correctly handle Unicode characters in Java

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.