How to correctly handle Unicode characters in Java

Last Update:2015-10-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently in the development of input Method program encountered a small problem, is to delete a emoji, can not be deleted once, you need to perform two operations. Intuitively, this must be the Java operation Unicode character problem, so find the official Java document reference, solve the problem, here to do a simple summary. The original is here, interested to see for themselves.

Http://www.oracle.com/technetwork/articles/java/supplementary-142654.html

Note: The "Java bytes" mentioned in the article refer to the 16-bit bytes of the Java platform, please do not mix with C's 8-bit bytes.

The first thing you need to know is that the standard Unicode character is a 16-bit character (nonsense), so the standard Unicode character set contains 65,536 characters (2 of 16 times, the math of the elder brother is very good). But just 65,536 characters is not enough, especially to contain the colorful words of our great Chinese nation (I don't know anyway), the Unicode character set expands to 24 bits, which can contain up to 1,112,064 characters. Here we call the standard Unicode character set (that is, the first 65,536 characters) as the basic multilingual Plane (BMP), which is called supplementary characters for more than 16 bits of the extended character set.

UTF-16 is a coding method that encodes Unicode characters in 16-bit unsigned units, and if a standard Unicode character is encoded, it requires only one UTF-16 unit, and if the extended Unicode character is encoded, it consumes two UTF-16 cells.

We all know that in C, a primitive char occupies a byte, which is 8 bits. In Java, however, a primitive char occupies 16 bits and is equal to the length of a standard Unicode character because the Java platform is encoded with UTF-16. This makes it easy for Java to handle Unicode basic characters. However, for Unicode extended characters, it takes up to two char, or two UTF-16 units, in Java.

For a chestnut, the Unicode value of capital A is u+0041, which belongs to the basic character set of Unicode, so it occupies only one UTF-16 unit, denoted as [0041]. and character 650) this.width=650; "Width=" height= "align=" Bottom "src=" http://www.oracle.com/ocom/groups/public/@otn/ Documents/digitalasset/145818.gif "style=" margin:0px;padding:0px;list-style:none;border:0px;font-family:arial, Helvetica, Sans-serif;font-size:12px;white-space:normal;background-color:rgb (255,255,255); The Unicode value of/> is u+ 10400, it belongs to the extended Unicode character set, so it occupies two UTF-16 encoding units, expressed as [d801][dc00] (a square bracket represents a UTF-16 cell, the brackets themselves are meaningless). The first unit is called High-surrogates, ranging from u+d800 to U+DBFF, and the second unit is called Low-surrogates, ranging from u+dc00 to U+DFFF, which looks similar to multibyte encoding. But, what I'm saying is, however, there is a very important difference, from u+d800 to U+dfff, which is actually a reserved range of UTF-16, specifically used as the coded Unicode extended character set, which is not given any actual standard Unicode characters. That is, in your program just determine whether a character is within this range, you can know if the character should be treated as a separate Unicode basic character, or as a half extended Unicode character.

Back to the beginning I mentioned the problem, when I developed the Android input method, when the need to delete a emoji, always need to delete two times in order to erase clean. This is because the emoji that I use are all of the Unicode extended character sets, and the edit box occupies two UTF-16 cells, and each delete operation deletes only one UTF-16 unit, one Java byte, and the other unit is not deleted, so an exception is displayed. Just delete the other byte again, even if it really removes the emoji.

In real-world applications, the Unicode standard character set and the extended character set are used in combination, meaning that some characters occupy 1 Java bytes, and some characters occupy two Java bytes. Then I have to do the deletion of a character operation must also first to determine whether I want to delete one byte at a time or delete two bytes. This is very simple, according to the rules I mentioned earlier, first determine whether the manipulated character value is in [U+D800,U+DFFF] (inclusive), if so, it means that the manipulated character is not a valid standard Unicode character, but a half Unicode extended character , you only need to automatically delete two Java bytes in the program. Conversely, if the character being processed does not fall within the [U+D800,U+DFFF] range, then the character is a standard Unicode character that needs to be handled in a single Java byte.

After the theory is clear (I assume you know, in fact, it is not difficult, is I express unclear, no way, engineering people, always thinking beyond the language), I would like to briefly introduce the related processing functions in Java.

Character.tochars (int codepoint): The parameter codepoint is a Unicode value that converts the Unicode value to a standard Java byte array. If Codepoint is a Unicode standard character, the return value contains only one char, and if Codepoint is an extended Unicode character, the return value consists of 2 char.
Example: In a program if you want to output a Unicode extended character, you can do so string.valueof (Character.tochars (0x1f60e))
Character.islowsurrogate (char ch): Determines whether a character is a low 16-bit encoding of a Unicode extended character.
Character.ishighsurrogate (char ch): Determines whether a character is a high 16-bit encoding of a Unicode extended character.

This article is from the "Yaorugang" blog, make sure to keep this source http://yaorugang.blog.51cto.com/10286709/1704142

How to correctly handle Unicode characters in Java

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to correctly handle Unicode characters in Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to correctly handle Unicode characters in Java

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support