Mobile front-end phone input method comes with emoji expression character processing

Source: Internet
Author: User

Today, the test gave me a bug, said the mobile input emoji expression can not be submitted. Long ago has thought, the cell phone input method comes with the emoji expression, should be some special characters. Since it is a character, it should all be able to submit, but why is it stuck? Search, only to find that the original emoji used by the character is 4 bytes of utf-16(utf-16 has 2 bytes and 4 bytes of two encodings), and our database is the Utf-8, and the maximum allowed only 3 bytes of characters. This conflict arises when the form cannot be submitted because of the presence of these emoji characters.

After you find out why, the next thing you need to think about is the solution. Now consider the two scenarios, one is to let the background processing, the utf-16 character to do some conversion (not discussed here). The second option is to convert the front-end directly to an entity character before committing. In this way, the background does not have to do any processing, the user's submission of information can be retained, is not a way to the both worlds? The next thing we're going to talk about is how to convert the emoji expression character into an entity character.

First, let's take a look at what the emoji characters are in the phone's input method. The following section is a picture, from http://computerism.ru/emoji-smiles.htm. We see that each emoji expression character corresponding to the entity character encoding is relatively large, such as the first line of Smiley, the entity character is & #128522; Furthermore, we note that there is also a 16 binary encoding d83dde0a in the back. So what's this code for? Then look down.



First, character detection

To convert these emoji expression characters into solid characters, you need to detect them first. When it comes to character detection, it's time for us to play. First we have to determine the range of these characters. As we already know, the emoji expression character is encoded with a 4-byte utf-16 encoding, while the 4-byte utf-16 encoding is not accepted by the background. Therefore, our detection range becomes the detection of all 4 bytes of utf-16 code. We found through the search that the 4-byte utf-16 encoding range is u+010000 to u+10ffff, so our regular can be written like this:/[\u010000-\u10ffff]/g? No, you will find that this regular does not work as we expect. What is this for?

The above question, some children's shoes may already know the answer. Yes, it's the JavaScript coding problem that's causing it. We know that JavaScript uses Unicode encoding, and, to be exact, ucs-2 encoding. From the name, we already know that this encoding scheme is 2 bytes. Finding 4-byte characters in 2-byte encodings is obviously not that simple. So, we have to think about how this utf-16 is represented in the Usc-2 code? Here, I found our lovely missionary--an article by teacher Ruan, "adetailed explanation of Unicode and JavaScript " (http://www.ruanyifeng.com/blog/2014/12/unicode.html). Simply put, the utf-16 4-byte character is split into two usc-2 2-byte characters. The specific algorithm can refer to the above article of the teacher Ruan, this article is not discussed in detail. From Mr. Ruan's article, we already know, 4 bytes utf-16 in JS is used two characters, the high range is 0XD800-0XDBFF, the low range of 0XDC00-0XDFFF. Then the regular expression that we used to detect is also out:/[\ud800-\udbff][\udc00-\udfff]/g . Now look back at our first picture of the string 16, d83dde0a, D83DDE03, is not suddenly understood?


Second, the conversion algorithm

Now, we have been able to detect the emoji expression characters in the form. So, how do we convert this character to an entity character? We know that the entity character is used to denote the encoding of a single character, and our emoji expression, in JS, is represented by two characters. What can I do about it? Wait, who says emoji is two characters, OK 4 byte single character? Yes, at first emoji is to use utf-16 said Ah here, I also refer to another article, http://unicode-table.com/cn/sets/emoji/, the following section of the picture to illustrate.


We still take that smiley face character for example, its utf-16 code is u+1f600, we turn into decimal look.


128512 is not exactly our entity code & #128512; It? So, now the problem has become how to get emoji expression character utf-16 encoding problem. But, however, we have just already known, in JS, emoji expression is also used ucs-2 code, ah, just become with two characters to express. So, our problem eventually evolved into the question of how to convert from Ucs-2 encoding to utf-16 encoding.

Thank you, Mr. Ruan, in Mr. Ruan's article, there is a reference to the utf-16 ucs-2 (Unicode) formula

H = Math.floor ((c-0x10000)/0x400) +0xd800//High L = (c-0x10000)% 0x400 + 0xdc00//Low
However, this is utf-16 turn ucs-2, we want is ucs-2 turn utf-16 ah? What to do? Derivation go back to chant. Let's take a look at what these two formulas have done. First, the high-level formula, the character C minus 0x10000, and then 0x400, take its quotient, plus 0xd800. And the low is the character C minus 0x1000, take the remainder of 0x400, add 0xdc00. So this character is actually divided into two parts: quotient and remainder, and then the processing of the quotient as a high, plus the processing of the remainder as a low, so combined into a ucs-2 character. Then, we can also in turn, to obtain c/0x400 quotient, plus c/0x400, add not to get C. To facilitate the calculation, we use Q to denote the quotient of C, with M for the remainder of C, then there is the following formula:

H = q-0x10000/0x400 + 0xd800l = m-0x10000% 0x400 + 0xdc00c = Q * 0x400 + m//because 0x10000% 0x400 = 0, so push: H = q-0x10 000/0x400 + 0xd800l = M + 0xdc00c = q * 0x400 + m//According to the C formula, put the h*0x400 plus L, get: H * 0x400 + L = q * 0x400-0x10000/0x400 * 0x400 + 0xD800 * 0x400 + M + 0xdc00//finally change Q * 0x400 + M to C, get: H * 0x400 + L = c-0x10000 + 0xD800 * 0x400 + 0xdc00//Move Item, I Our final formula is: C = (h-0xd800) * 0x400 + 0x10000 + l-0xdc00
After the formula came out, I believe you already know how to do, but at the end of the offer of ugly, I wrote a processing function to provide you with reference:

/** * is used to convert a UTF16 encoded character into an entity character for background storage * @param  {string} str will convert the string containing the UTF16 character will be automatically checked out * @return {string}     after the converted string, Utf16 characters will be converted to & #xxxx; form entity character */function utf16toentities (str) {    var patt=/[\ud800-\udbff][\udc00-\udfff]/g;// Detect UTF16 Word regular    str = str.replace (Patt, function (char) {            var H, L, code;            if (char.length===2) {                H = char.charcodeat (0);//Remove high                L = char.charcodeat (1);//Remove low                code = (h-0xd800) * 0 X400 + 0x10000 + l-0xdc00; Conversion algorithm                return "" "+ code +"; ";            } else {                return char;            }        });    return str;}

The results of the operation are as follows:


Careful children's shoes, in just looking at those reference articles, may have found that, in fact, not all emoji expression characters are utf-16 encoded, but also some fall in the Ucs-2 encoding range (that is, only two bytes). But this is not the point, the point is that we have successfully converted the UTF-16 encoded part of the emoji expression to the entity character.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Mobile front-end phone ime comes with emoji emoji character processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.