The application of the primary dictionary tree lookup in Emoji and keyword retrieval Part-1

Source: Internet
Author: User
Series Index
    1. Unicode and Emoji
    2. Dictionary tree Trietree and performance testing
    3. Production Practice
Objective

It is common for users to modify the data themselves, and we stipulate that the nickname length is between 2 and 10. Suppose the user tries to use the emoji as the user name, is the request legal?

Open the browser console, enter ''.length , and print the result is 11.

Company projects involved in content printing, before the Emoji display as garbled, the box is a commonplace, and mobile phones and browsers, print a variety of inconsistencies are also quite torture. Hard scalp read Unicode.org/emoji, using hash lookup to temporarily resolve the issue.

Years ago, the project encountered the need for sensitive word filtering, a variety of references, combined with the previous Emoji plan, there has been a peach blossom "dozens of steps, enlightened" the sentiment, the solution has been upgraded.

The following is a preliminary use of the-trietree of the Dictionary tree and applies to the actual process of Emoji locating and sensitive word filtering.

Unicode

For our programmers, Emoji brings a lot of problems

    • What is the length?
    • How do i show consistency across various platforms?

Solving these problems cannot be separated from Unicode characters.

When we talk about Unicode, what are we talking about?
    • Talk about Emoji and character coding length is not long, to Emoji is what, and Unicode characters have what relationship did a better opening;
    • Character Set and character encoding (Charset & Encoding) relative to college School, the system introduces the development of characters;
    • Unicode and JavaScript are detailed in JavaScript, but can be extended to a variety of languages.

Since JavaScript can only handle UCS-2 encoding, all characters are 2 bytes in the language, and if they are 4-byte characters, they are treated as two double-byte characters. JavaScript's character functions are affected by this and cannot return the correct results.

After reading the above information, presumably the first two problems have a preliminary concept. The following is a representation of the Unicode character "" in some programming languages and versions.

Programming Languages Character Set Coding the literal of the character ""
C# Unicode UTF-16 "\ud834\udf06"
Java Unicode UTF-16 "\ud834\udf06"
ECMAScript 5 Unicode UCS-2 "\ud834\udf06"
ECMAScript 6 Unicode UCS-2, UTF-16 "\ud834\udf06", "\u{1d306}"
Python? ? ? U ' \u0001d306 '

In summary, UTF-16 uses a set of rules to augment the character set.

  • If the character encoding U is less than 0x10000, that is, within 0 to 65535 of the decimal, the two-byte representation is used directly;
  • If the character encoding U is greater than 0x10000, because the Unicode encoding range is 0X10FFFF, there is a total of 0xFFFFF encoding between 0x10000 and 0X10FFFF, which means 20 bits is required to mark these encodings. Use U ' to represent the value from the 0-0XFFFFF, the first ten bit as a high and a bit of the numeric 0xd800 for a logical OR operation, the post-bit as a low and 0xdc00 to do a logical OR operation, so that the composition of the 4 byte is the code of U.
Support for 4-byte Unicode in some programming languages

Java

String str = "\ud834\udf06";System.out.printf("str: %s, length: %d", str, str.length());// str: , length: 2

C#

 mString str = "\ud834\udf06";Console.WriteLine("str: {0}, length: {1}", str, str.Length);// str: , length: 2

Javascript

> let str = "\ud834\udf06";> str< ""> console.log("str: %s, length: %d", str, str.length);  str: , length: 2

Python 3

>>> s = "\ud834\udf06">>> s'\ud834\udf06'>>> len(s)2

Python 2

>>> s = "\ud834\udf06">>> s'\\ud834\\udf06'>>> len(s)12>>> s = u'\ud834\udf06' >>> su'\U0001d306'>>> len(s)2

The "string length" of most programming languages is expressed as "the length of a string taking up bytes". The length calculation and retrieval of a visual character requires that the sequence of bytes be converted to a sequence of Unicode characters first. The use of UTF-16 programming language has the ability to understand the above rules, but due to historical issues such as UCS-2-based ECMAScript 5 and Python2 tragedy.

C # Char.IsHighSurrogate andStringInfo

//获取 unicode 码点public static IEnumerable<Int32> CodePoints(this String s) {  for (int i = 0; i < s.Length; ++i) {    yield return Char.ConvertToUtf32(s, i);    if (Char.IsHighSurrogate(s, i))      i++;  }}   public static IEnumerable<String> TextElements(String s) {  var enumerator = StringInfo.GetTextElementEnumerator(s);  while (enumerator.MoveNext()) {    yield return enumerator.GetTextElement();  }}

ECMAScript 6String.prototype.codePointAt(index: number)

Note that for a 4-byte code point character, if the parameter is greater than the number of Unicode characters, the String.prototype.codePointAt function is still in effect but degenerated into an String.prototype.charCodeAt implementation.

Therefore, it cannot be easily realized aslet codePoints = s => Array.from([...s].keys()).map(i => s.codePointAt(i));

let s = '';let codePoints = s => Array.from([...s].keys()).map(i => s.codePointAt(i));codePoints(s)//[128104, 56424, 8205, 128105, 56425] ERROR!!!

The right approach

let s = '';let codePoints = s => [...s].length === 1   ? Array.from([...s].keys()).map(i => s.codePointAt(i))   : Array.prototype.concat.call(...[...s].map(codePoints));codePoints(s)//(5) [128104, 8205, 128105, 8205, 128102]
Emoji

Emoji was first developed in Japan and then introduced by Apple and is currently an international standard, seen in the Unicode Emoji. This process brings a variety of historical issues (which are mentioned in the back), and emoji itself continues to evolve, and today's data may become obsolete.

With the popular science of Unicode in front, we now know that Emoji is just a Unicode character or sequence that the text rendering engine encounters when parsing and replacing it with its own implementation.

    • Some Emoji can be represented by 2-byte characters
    • Some Emoji can be represented by 4-byte characters
    • Partial Emoji can be a set of Unicode character combinations
    • Part of the Emoji is a combination of other Emoji that may have a degradation scheme

Slightly mentioned, MacOS and Android respectively using the solution keyword is applecoloremoji and Notocoloremoji', involving TTF word off programming, etc., please search by yourself if necessary.

It can be seen that the length of the Emoji is determined but not visually visible; How to display the work of a text rendering engine, but there are huge differences between different platforms, browsers, vendors, and even versions.

What is the length?

Explore the emoji character length there is a code that demonstrates the emoji character length.

// neutral family// U+1F46A// length: 2> // ZWJ sequence: family (man, woman, boy)// U+1F468 + U+200D + U+1F469 + U+200D + U+1F466//  + U+200D +  + U+200D + // length: 8> // ZWJ sequence: family (woman, woman, girl)// U+1F469 + U+200D + U+1F469 + U+200D + U+1F467//  + U+200D +  U+200D + // length: 8> // ZWJ sequence: family (woman, woman, girl, girl)// U+1F469 + U+200D + U+1F469 + U+200D + U+1F467 + U+200D + U+1F467//  + U+200D +  + U+200D +  

This text may see the expression sequence rather than the combination because of the browser version and so on, so I made the display effect under Chrome

How do i show consistency across various platforms?

Twitter's Emoji cross-platform consistent display solution is Twitter/twemoji. It has the following issues:

    • Updated by year month, the emoji character ' \u0031\ufe0f\u20e3 ' in the middle of the box is not supported
    • With its CDN resources as a result of the fruit output.

We want to know what Emoji in a text, where, how to replace, how to customize the display, need more control.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.