Yesterday's colleagues in the iOS group ran into a tricky question: How to get the number of characters in a text box when the input box contains a emoji expression (a emoji expression counts as one character). First from the Java I recently contacted, Java, when using the length method of string, if it is a normal Chinese and English characters, no problem, but if the character's Unicode encoding is greater than 0xFFFF, this length method does not correctly get the number of characters, In fact, such special characters are calculated as 2 characters. Of course, Java already has a ready-made way to solve this problem: Codepointcount. Unfortunately, it took a long time to find a similar solution in the objective-c. (It seems that the array length is the exact number of characters after substring.) I'm not an iOS programmer and I can't provide a solution in OC for the time being. But in yesterday's groping, also have a little harvest, take out to share. 1. Emoji expression Most of the Unicode encoding is greater than 0xFFFF, that is, UTF16 encoding occupies 4 bytes, only a small portion of the expression Unicode is less than 0xFFFF, this Part UTF16 encoded 2 bytes. 2. Whether it is Android or iOS, the string that is read from the text box is stored in the UTF-16 encoded (big-endian) Form in memory. (By default) 3. By the way, by extracting the rules of UTF-16 encoding (see this rule, the problem of solving code point count on iOS itself is solved):
1) If U < 0x10000, encode u as a 16-bit unsigned integer and terminate. 2) let U ' = u-0x10000. Because U is less than or equal to 0x10ffff, U ' must being less than or equal to 0xFFFFF. That's, U ' can be represented in. 3) Initialize-16-bit unsigned integers, W1 and W2, to 0xD800 and 0xdc00, respectively. These integers each has a value of bits free to encode, the character, and for a total of ten bits. 4) Assign The high-order bits of the 20-bit U ' to the ten low-order bits of W1 and the low-order bits of U ' to t He low-order bits of W2. Terminate. Graphically, steps 2 through 4 look like: U ' = yyyyyyyyyyxxxxxxxxxx W1 = 110110yyyyyyyyyy W2 = 110111xxxxxx Xxxx
About emoji emoticons and utf-16 encoding