Emoji treatment mode large starting bottom

Source: Internet
Author: User
Tags php and mysql

Emoji information

Today, the study of emoji, very interesting, a lot of information, abstract some information for everyone to share, but also a record of their own learning.

Emoji Introduction

Emoji (maximum speed text, meaning from Japanese えもじ,e-moji,moji in Japanese is a character) is a set of 12x12 pixel emoticons originating in Japan, created by Chestnut Tanaka (Shigetaka kurit), which was first popular among Japanese networks and mobile phone users, Since emoji was added to Apple's iOS 5 input method, the emoji began to sweep across the globe, and the emoji has been adopted by most modern computer systems compatible with Unicode encoding, and is commonly used in a variety of mobile SMS and social networks. Recently, there are many netizens use emoji pattern to play guessing word game, enjoy this expression culture brings fun.

About emoji pronunciation: a lot of people at the first sight of emoji will subconsciously read it as "a grinding Ji", in fact, emoji transliteration came to probably read as "Eh Grind Ji", among them "E" pronunciation rather like the letter abc of a pronunciation.

Originally, Japan's three major telecom operators each had different character definitions, namely DoCoMo, KDDI and SoftBank. With iOS built-in version SoftBank, emoji is popular worldwide (before the iOS5 version). And Google itself defines a set of emoji characters. After iOS5, Apple adopted the Unicode-defined emoji character (after the iOS5 version).

The Unicode definition of emoji is four characters, SoftBank is 3 characters, emoji four characters from storage to show the corresponding system has not been considered, it is simply a disaster.

Facing the problem:

Insert emoji expression, save to database times wrong:

Sqlexception:incorrect string value: ' \xf0\x9f\x98\x84 ' for column ' review ' at row 1

The UTF-8 encoding can be two, three, and four bytes. The emoji expression is 4 bytes, while MySQL UTF8 encodes up to 3 bytes, so data cannot be plugged in.

Solution: Filter Resolution

The emoji directly filter out, simple and convenient and effective. Although several emoji characters have been lost, they are too strong to cause the entire record to be lost.

public static string Removenonbmpunicode (String str) {     if (str = = null) {         return null;     }     str = Str.replaceall ("[^\\u0000-\\uffff]", "");    return str;  }  
This kind of solution can prevent solve the problem, and can also be the program is more robust, but from the user experience is not good, the user emoji expression lost, see the following solution.

Solution: Convert the MySQL encoding from UTF8 to UTF8MB4.

Starting with MySQL 5.5.3, MySQL supports a utf8mb4 character set that can support 4-byte UTF8 encoded characters. The UTF8MB4 character set can be perfectly backwards compatible with UTF8 strings. In the case of data storage, when a normal Chinese character is stored in a database, it still occupies 3 bytes, and when it is deposited into a Unified Emoji expression, it automatically takes up 4 bytes. So there is no garbled problem in the input and output.

To use this feature of MySQL, you first need to upgrade MySQL to 5.5.3 or more versions.

Second, you need to modify the character set in the data structure to UTF8MB4, such as Utf8mb4_general_ci. Since UTF8MB4 is a superset of UTF8, upgrading from UTF8 to UTF8MB4 will not have any problems, upgrade directly, or if you are converting from another character set such as gb2312 or GBK, you must first back up the database.

Then, modify the MySQL configuration file/etc/my.cnf, modify the connection default character set to Utf8mb4, if it is written by the PHP script, you can also connect to the database after the first execution of a sentence sql:set NAMES utf8mb4;. At this time, PHP should be able to properly save Emoji to the database.

The problems that this approach can bring:

Storage: In a data table, for fields that are variable in length (such as Varchar2,text), UTF8MB4 can store a maximum of fewer characters than the collation of the UTF8 series, and in the index, for fields of the text type, UTF8MB4 can index fewer characters than the collations of the UTF8 series. Indexes such as InnoDB use up to 767 bytes. If you use UTF8MB4, each character is reserved with 4 bytes to index, while UTF8 reserves 3 bytes. Therefore, the former is 191 characters, the latter is 255 characters.

Performance: For the above reasons, coupled with a large character set, UTF8MB4 performance may be lower than the UTF8 series collations, you can refer to a test results on Stackoverfolow: http://stackoverflow.com/questions /766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci, the difference is not particularly large.

Operations: If a large environment, if the other database is UTF8 mode, one of the libraries set to UTF8MB4 mode, in the subsequent handover operation may cause problems, left the pit.

Upstream and downstream: the database supports Unicode emoji storage, which is not necessarily supported upstream and downstream. For example, MySQL client driver (low version of JDBC does not) may not support UTF8MB4, or DDL middleware does not support UTF8MB4. Web-side processing utf8mb4 character display, which all have the potential to affect emoji's storage alive.

From the above information, from the database level if not particularly value storage, performance, operations and can solve the upstream and downstream problems, the database is fully capable of supporting emoji, but a new problem is not resolved, emoji on iOS show ok,andriod device How to display emoji expression?

Solution: Escape resolution

1:unicode emoji turn SoftBank emoji.

We know that the Unicode emoji is 4 bytes, SoftBank defined emoji takes up 3 bytes of storage, through emoji for PHP http://code.iamcal.com/php/emoji/, We can convert the Unicode emoji mode to SoftBank mode, so that the implementation of the database can be stored emoji, relative to the database level of the way to solve the problem, the action is much smaller, and there will be no performance, operation and other aspects of the problem. However, there is an unavoidable problem is that the SoftBank way is no longer maintained, so the new increase in emoji expression, SoftBank, will cause some loss of emoji expression situation.

2:ubb

The UBB code is a variant of HTML (an application under the standard Universal Markup Language) and is a special tag used by Ultimate Bulletin Board (a BBS program abroad). You may already be familiar with it. UBB code is very simple, the function is very few, but because of its tag syntax check implementation is very easy, so many websites introduced this code, in order to facilitate users to use the display image/link/bold font and other common features.

For example, emoji's sun symbol, his Unicode emoji encoded as u+2600, when stored in the database, you can convert it to UBB code [Emoji]2600[/emoji] Save, read, can be converted back. Of course, for different devices, such as andriod, we can escape the emoji symbol that andriod can handle.

This transfer can be a good solution for iOS and andriod display emoji problems, but there are still a few problems.

1:andriod and iOS emoji are not the same, the same encoding may be on iOS is the sun, and on the andriod is cloudy, the best way to solve this problem is under iOS and andriod emoji mapping, but also on the web through the JS escape processing.

2: Performance, handled in an escaped manner, performance will certainly be reduced, but can be tolerated.

The UBB corresponds to HTML escape, which, in fact, is somewhat similar to UBB, using the HTML escape character & #x2600; The results are about the same as the performance and Ubb, which, by way of normalization, UBB better.


Resources

Php-emoji conversion table:http://code.iamcal.com/php/emoji/

Unicode Emoji Symbols: http://www.unicode.org/~scherer/emoji4unicode/20091221/utc.html

Emoji icon and Unicode correspondence relationship:http://www.easyapns.com/iphone-emoji-alerts

Filter user input non-BMP characters (our factory):http://www.atatech.org/articles/15677

Unicode USC-2 (BMP) characters outside the Hurt and pain (my factory):http://www.atatech.org/articles/13587

Emoji character compatibility processing (our factory): http://www.atatech.org/articles/27568

Talk about Unicode encoding, briefly explaining UCS, UTF, BMP, BOM and other nouns:http://www.fmddlmyy.cn/text6.html

Emoji Online conversion tool:http://unicodey.com/js-emoji/demo.htm

Emoji emoticons for communication between iOS and PHP and MySQL storage: http://blog.csdn.net/wildfireli/article/details/9370161

The difference between proofreading set utf8_unicode_ci and utf8_general_ci in MySQL: HTTP://HI.BAIDU.COM/PHPKOO/ITEM/38238BD8505899E955347FCA,http ://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci

The difference between MySQL collation Utf8_unicode_ci and utf8_general_ci:http://justdo2008.iteye.com/blog/2162842

Unicode Character Sets:http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

MySQL Settings utf8mb4 code:http://www.linuxidc.com/Linux/2014-07/104231.htm

Andriod support emoji Solution:http://blog.csdn.net/waylife/article/details/11095113

Supporting New Emojis on IOS 6:http://blog.manbolo.com/2012/10/29/supporting-new-emojis-on-ios-6

Let MySQL support emoji emoticons (4 bytes UTF8 character save method in MySQL):http://www.w2bc.com/Article/8533

How to handle 4-byte Unicode characters such as emoji: http://zhidao.baidu.com/link?url=z6PW1ya6plRBgFN7M2zdVLXUnmxYcH2_ Vyk8nw9yi9-kh2estgmjomw1lssmsa853wyhsrtulkjn2okq0a3taudqhiime7b0vs-fegmnyuu

Emoji treatment mode large starting bottom

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.