How can I escape emoji so that it can be stored in the utf8 database?

Source: Internet
Author: User
Unicodeemoji is 4 bytes long and cannot be stored in MySQL. Find an escape library code. iamcal. comphpemoji, but after being converted to Unicode, it is still 4 bytes. It cannot be saved. It should be said that it is not converted at all. I am afraid that the emoji will be converted into other formats and it will not be easy to do with new emojis...

Unicode emoji is 4 bytes long and cannot be stored in MySQL. Find an escape library token. I am afraid that it will not be easy to add new emojis in other formats. How do you get it without changing the database encoding?

Method 1: base_encode64

This method is acceptable, but the old data does not pass throughencodeOperation. If the decode operation is performed for data retrieval, the old data will be lost.

Method 2: urlencode

This seems to be acceptable. The decode operation on data that does not pass through the encode operation will not affect the operation, and the multiple decode operations will not affect the operation. Are there any defects in this method?

======================================
One found that when getting basic user information, the expression print_r was\ud83d\ude02When I stored the file, I reported an error saying that the \ xF0 \ x9F \ x98 \ x82 value cannot be stored. What is the problem? What is it after automatic transcoding? Have you transcoded it?

======================================

Method 3: The method adopted below is used at last, because I think it has the following advantages:

1. That method only converts emotices and does not convert Chinese characters, so the data is directly readable.
This is what is stored in the database. The \ ud83d \ udca5 can be copied and pasted at will and displayed as follows,

2. There is only one simple and fixed conversion algorithm that will not convert emoticon into other standards. That is to say, there is no need for an emoticon library to compare the conversion, so when other people want to use this data in the future, it is easy to know which of the expressions is corresponding. Even if Uncle Apple adds a new expression, no additional modifications are required.

3. The correct content can be output without limit. In some cases, decode may be performed in two places in one request. The correct data is changed to other data multiple times for other decodes.

Disadvantages:
1. After reading the following code, we can see that this is a mandatory modification to the character encoding and the encoding within the specified range. This means that it may be killed by mistake, it is also possible that the emoji beyond this range has not been killed. However, only a backslash is added before the character. Even if it is accidentally killed, it is easy to change it back after discovery.
In the database, we found this was a false positive, but I don't know why. This can be directly stored in the database.

/** Escape the text entered by the user (mainly for special symbols and emoji expressions) */function userTextEncode ($ str) {if (! Is_string ($ str) return $ str; if (! $ Str | $ str = 'undefined') return ''; $ text = json_encode ($ str ); // exposes unicode $ text = preg_replace_callback ("/(\ u [ed] [0-9a-f] {3})/I", function ($ str) {return addslashes ($ str [0]) ;}, $ text); // leave the unicode of emoji. If the other parameters are not moved, the regular expression here is d more than the original answer, I found that many of my emoji actually start with \ ud, but I haven't found the start with \ ue yet. Return json_decode ($ text);}/** decodes the Escape Character above */function userTextDecode ($ str) {$ text = json_encode ($ str ); // exposes unicode $ text = preg_replace_callback ('/\\\\\\\\/I', function ($ str) {return '\\';}, $ text); // convert the two slashes into one. The rest do not return json_decode ($ text );}

Reply content:

Unicode emoji is 4 bytes long and cannot be stored in MySQL. Find an escape library token. I am afraid that it will not be easy to add new emojis in other formats. How do you get it without changing the database encoding?

Method 1: base_encode64

This method is acceptable, but the old data does not pass throughencodeOperation. If the decode operation is performed for data retrieval, the old data will be lost.

Method 2: urlencode

This seems to be acceptable. The decode operation on data that does not pass through the encode operation will not affect the operation, and the multiple decode operations will not affect the operation. Are there any defects in this method?

======================================
One found that when getting basic user information, the expression print_r was\ud83d\ude02When I stored the file, I reported an error saying that the \ xF0 \ x9F \ x98 \ x82 value cannot be stored. What is the problem? What is it after automatic transcoding? Have you transcoded it?

======================================

Method 3: The method adopted below is used at last, because I think it has the following advantages:

1. That method only converts emotices and does not convert Chinese characters, so the data is directly readable.
This is what is stored in the database. The \ ud83d \ udca5 can be copied and pasted at will and displayed as follows,

2. There is only one simple and fixed conversion algorithm that will not convert emoticon into other standards. That is to say, there is no need for an emoticon library to compare the conversion, so when other people want to use this data in the future, it is easy to know which of the expressions is corresponding. Even if Uncle Apple adds a new expression, no additional modifications are required.

3. The correct content can be output without limit. In some cases, decode may be performed in two places in one request. The correct data is changed to other data multiple times for other decodes.

Disadvantages:
1. After reading the following code, we can see that this is a mandatory modification to the character encoding and the encoding within the specified range. This means that it may be killed by mistake, it is also possible that the emoji beyond this range has not been killed. However, only a backslash is added before the character. Even if it is accidentally killed, it is easy to change it back after discovery.
In the database, we found this was a false positive, but I don't know why. This can be directly stored in the database.

/** Escape the text entered by the user (mainly for special symbols and emoji expressions) */function userTextEncode ($ str) {if (! Is_string ($ str) return $ str; if (! $ Str | $ str = 'undefined') return ''; $ text = json_encode ($ str ); // exposes unicode $ text = preg_replace_callback ("/(\ u [ed] [0-9a-f] {3})/I", function ($ str) {return addslashes ($ str [0]) ;}, $ text); // leave the unicode of emoji. If the other parameters are not moved, the regular expression here is d more than the original answer, I found that many of my emoji actually start with \ ud, but I haven't found the start with \ ue yet. Return json_decode ($ text);}/** decodes the Escape Character above */function userTextDecode ($ str) {$ text = json_encode ($ str ); // exposes unicode $ text = preg_replace_callback ('/\\\\\\\\/I', function ($ str) {return '\\';}, $ text); // convert the two slashes into one. The rest do not return json_decode ($ text );}

This is how I play.


  

Provide a standard solution:

  1. The mysql version must be v5.5.3 or later.

  2. Change the database encodingutf8mb4 -- UTF-8 Unicode

  3. Then you need to store the selection of emoji fields.utf8mb4_general_ci

  4. The database connection needs to be changedutf8mb4

After the settings, you can see the following result similar to the character set settings. Therefore, you can directly store the data in the database without any additional operations.

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';  +--------------------------+--------------------+  | Variable_name            | Value              |  +--------------------------+--------------------+  | character_set_client     | utf8mb4            |  | character_set_connection | utf8mb4            |  | character_set_database   | utf8mb4            |  | character_set_filesystem | binary             |  | character_set_results    | utf8mb4            |  | character_set_server     | utf8mb4            |  | character_set_system     | utf8               |  | collation_connection     | utf8mb4_unicode_ci |  | collation_database       | utf8mb4_unicode_ci |  | collation_server         | utf8mb4_unicode_ci |  +--------------------------+--------------------+   rows in set (0.00 sec) 

I encountered this problem when I was developing a public platform. The nickname of a user can contain expressions (such --!). So I converted the whole nickname into a HEX string and stored it in MySQL. Currently, the user is over million and the system is stable. You can refer to this solution for the subject.

MySQL supports the hex () and unhex () functions. Java can use the org. apache. commons. codec. binary. Hex tool class. Other languages also have corresponding methods.

How can I try Weibo or qq? Use simple encoding for ing, such as smiling[wx]Or/wx. However, the last four characters in an emoticon are not enough...

urldecode
I looked at the decode source code and it should not be a problem.

As long as no%It must be the original character (string) After decoding.%There will be two situations: decoding is successful, and this is definitely not the original string, one is decoding failure, throw an exception (in fact, this exception can be used as an encode standard ).

The decoding is strict. It may appear as a user name.%The probability of successful decoding is relatively small. For this part, you can manually change the database.

You can try this function. You have also touched the Emoji expression when getting a custom menu. At that time, you can see that this function is used to encode the Emoji expression.

function utf8_bytes($cp) {    if ($cp > 0x10000){        # 4 bytes        return    chr(0xF0 | (($cp & 0x1C0000) >> 18)).        chr(0x80 | (($cp & 0x3F000) >> 12)).        chr(0x80 | (($cp & 0xFC0) >> 6)).        chr(0x80 | ($cp & 0x3F));    }else if ($cp > 0x800){        # 3 bytes        return    chr(0xE0 | (($cp & 0xF000) >> 12)).        chr(0x80 | (($cp & 0xFC0) >> 6)).        chr(0x80 | ($cp & 0x3F));    }else if ($cp > 0x80){        # 2 bytes        return    chr(0xC0 | (($cp & 0x7C0) >> 6)).        chr(0x80 | ($cp & 0x3F));    }else{        # 1 byte        return chr($cp);    }}

  1. UseBOLOType

  2. Change Database encodingUtf8mb4

I have just solved this problem (the backend is implemented in java and the database Mysql) for your reference.
1. Modify the encoding of the stored emoji field, for example, in the username field:

    ALTER TABLE user CHANGE username username VARCHAR(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci default null;

2. Before performing database insert and update operations in java, you must first execute the SQL statement "set names utf8mb4.

Https://github.com/iamcal/php-emoji
I am using this process ~

Http://www.emoji-cheat-sheet.com/

There is a four-bit UTF-8 encoding method called utfmb4.

Your MySQL version must be later than 5.5

No need to switch. Convert the database directly to utf8mb4. I used to do this.

You do not need to change the overall database handle... Create xxx () charset = utf8mb4 single table utf8mb4

Because I need to limit the number of words in my project and involve word count, I have added the perfect Implementation of the emoji function. If you need code, please let me know and I will provide it to you.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.