Comb for special characters (including emoji)

Source: Internet
Author: User

Background knowledge
    • The emoji emoji, created in the 1990s by NTT DoCoMo Chestnut Tanaka (Shigetaka kurit), is derived from Japanese (the meaning of えもじ,e-moji,moji in Japanese is character). Emoji can make digital communication as human-face communication, to avoid the wrong message.
    • Since emoji was added to Apple's iOS 5 input method, the emoji began to sweep across the globe, and the emoji has been adopted by most modern computer systems compatible with Unicode encoding, and is commonly used in a variety of mobile SMS and social networks.
    • The so-called emoji is a character that is in \u1F601 the Unicode- \u1F64F section. This obviously exceeds the encoding range of the currently used UTF-8 character set \u0000 - \uFFFF .

Knowledge points
    • In Java UTF-8, only double byte is supported \u0000-\uffff,emoji (??) = "\ud83d\udc34"
    • Symbola table, our target object is roughly from
      • 1F300-1F3FF = "\ud83c\udf00"-"\UD83C\UDFFF"
      • 1F400-1F4FF = "\ud83d\udc00"-"\UD83D\UDCFF"
      • 1f500-1f5ff = "\ud83d\udd00"-"\UD83D\UDDFF"
      • 1F600-1F6FF = "\ud83d\ude00"-"\ud83d\udeff"
      • 1F700-1F7FF = "\ud83d\udf00"-"\UD83D\UDFFF"
Coding knowledge
Code UTF-8 UTF-16 LE Surrogates
1f7ff F0 9F 9F BF 3D D8 FF DF d83d DFFF
UTF-16 description

Unicode encodes space from u+0000 to U+10FFFF, with a total of 1,112,064 code point (s) that can be used to map characters. The encoding space for Unicode can be divided into 17 planes (plane), each containing 216 (65,536) code bits. The code bits of 17 planes can be represented as from u+xx0000 to U+xxffff, where xx represents a hexadecimal value from 0016 to 1016, which amounts to 17 planes. The first plane is called the basic multi-lingual plane (basic multilingual Plane, BMP), or the 0th plane (Plane 0). Other planes are called auxiliary Planes (supplementary Planes). Within the base multi-lingual plane, the code-bit segments from u+d800 to U+DFFF are persisted without mapping to Unicode characters. The UTF-16 uses the code bits of the reserved 0xd800-0xdfff segment to encode the code bits of the characters in the auxiliary plane.

UTF-16 decoding

..
Lead \ Trail DC00 DC01 ...DFFF
D800 10000 10001 ... 103FF
D801 10400 10401 ... 107FF
? ? ? ? ?
DBFF 10fc00 10fc01 ... 10FFFF
Example:

For example u+10437 encoding:

    • 0x10437 minus 0x10000, the result is 0x00437, and the binary is 0000 0000 0100 0011 0111.
    • partitions its upper 10-bit value and the next 10-bit value (using binary): 0000000001 and 0000110111.
    • Add 0xD800 to the upper value to form a high position: 0xD800 + 0x0001 = 0xd801.
    • Add 0XDC00 to the lower value to form the Low: 0xdc00 + 0x0037 = 0xdc37.
    • The following table summarizes the conversion, among others. The color indicates how the bit is distributed from the code point to the UTF-16 byte. Adding additional bits to the UTF-16 encoding process is shown in black.
Binary
character Ordinary binary UTF-16UTF-16 Hex
Character code
utf-16be
hexadecimal byte
utf-16le
Hexadecimal bytes
$ U+0024 0000 0000 0010 0100 0000 0000 0010 0100 0024 00 24 24 00
U+20AC 0010 0000 1010 1100 0010 0000 1010 1100 20AC 20 AC AC 20
?? U+10437 0001 0000 0100 0011 0111 1101 1000 0000 0001 1101 1100 0011 0111 D801 DC37 D8 01 DC 37 01 D8 37 DC
?? U+24B62 0010 0100 1011 0110 0010 1101 1000 0101 0010 1101 1111 0110 0010 D852 DF62 D8 52 DF 62 52 D8 62 DF
Solution a Database
    • Jar Package: MySQL connector version is higher than 5.1.13
    • The minimum MySQL version of MYSQL:UTF8MB4 supports version 5.5.3+
      • Changing from UTF8 to UTF8MB4 requires a restart of MySQL
      • Because Rd should not change the MySQL configuration, you need to call set names UTF8MB4 in the business application so that the data is stored UTF8MB4 encoded in the database
Two filtration

because the fundamental method of the database is based on all the systems that have data storage to meet the above conditions, it is not often satisfied. This also requires a palliative approach.

publicstaticvoidmain(String[] args) {    String source = "a\uD83D\uDE36b\uD83D\uDE36\uD83D\uDE36\uD83D\uDE36\uD83C\uDE3612312\uD83C\uDE36";    while(true) {        Integer pos = source.indexOf("\uD83D");        if(pos == -1) {            pos = source.indexOf("\uD83C");        }        if(pos != -1) {            source = source.substring(0, pos) + source.substring(pos + 2);        else{            break;        }    }    System.out.println(source);}
Document Http://files.cnblogs.com/files/lanelim/Symbola.pdf

Comb for special characters (including emoji)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.