The application of the primary dictionary tree lookup in Emoji and keyword retrieval Part-2

Source: Internet
Author: User
Tags arithmetic assert base64 garbage collection elastic search
Series Index
    1. Unicode and Emoji
    2. Dictionary tree Trietree and performance testing
    3. Production Practice

After the knowledge of Unicode and Emoji is prepared, this article enters the coding link.

When we know that Emoji is a sequence of Unicode characters, it is natural to understand that Emoji find and sensitive words look exactly the same thing: index Emoji lists or keywords, user input participle, traversal filtering.

This paper does not discuss the word segmentation technique for Lucene and Elastic Search.

That's fine, my 1th version of Emoji looks like this, and it's got two problems.

    1. The traditional participle is based on the double traversal of the long sentence;
    2. The contrast clause requires a lot of SubString() action, which results in a huge GC pressure;

The double traversal can be optimized, with the inner traversal to propel the outer traversal position, but the extraction clause is unavoidable and will be mentioned later in this article.

Dictionary Tree Trie-tree

The dictionary tree trie-tree algorithm itself is simple and easy to understand, each programming language can use about 100 lines to complete the basic implementation.

There is also a very optimized implementation, the homepage can see the author's Blog Park Address and optimization experience.

    • Toolgood/toolgood.words

For more in-depth reading, please go to

    • How to overcome the shortcomings of the dictionary tree (Trie trees)?

In this paper, not only to detect emoji/keywords, but also look forward to positioning, replacement and other operations, so start from scratch.

JavaScript Version Implementation

Given the redundancy of the static language, the following uses a more expressive version of JavaScript to exclude irrelevant portions as examples of source code, which is found in Github.com/jusfr/chuye.character.

The following implementations use the syntax in ECMAScript 6 Symbol , seen in the Symbol@[mdn Web document] (developer.mozilla.org/zh-cn/), without affecting reading.

Const COUNT_SYMBOL = symbol (' count '), const END_SYMBOL = symbol (' End '), class Triefilter {constructor () {this.    root = {[Count_symbol]: 0};        } apply (Word) {let node = this.root;        Let depth = 0;            for (let-ch of word) {let child = node[ch];            if (child) {Child[count_symbol] + = 1;            } else {Node[ch] = child = {[Count_symbol]: 1};        } node = child;    } Node[end_symbol] = true;        } findFirst (sentence) {Let node = this.root;        Let sequence = [];            for (let-ch of sentence) {let-child = Node[ch];            if (!child) {break;            } sequence.push (CH);        node = child;        } if (Node[end_symbol]) {return sequence.join (');        }} findAll (sentence) {let offset = 0;        let segments = []; while (Offset < SentenCe.length) {Let child = This.root[sentence[offset]];                if (!child) {offset + = 1;            Continue                    } if (Child[end_symbol]) {Segments.push ({offset:offset,            Count:1,});            } Let count = 1;            let proceeded = 1;                while (child && offset + count < sentence.length) {child = Child[sentence[offset + count]];                if (!child) {break;                } count + = 1;                    if (Child[end_symbol]) {proceeded = count;                Segments.push ({offset:offset, Count:count,});        }} offset + = proceeded;    } return segments; }}module.exports = Triefilter;

Contains blank lines but only 87 lines of code, only to see 3 methods

    • apply(word): Add keywordsword
    • findFirst(sentence): sentence retrieves a 1th match in a statement
    • findAll(sentence): sentence checks all occurrences in a statement
Using the example

Index keywords Hello and Hey , retrieving in statements 'Hey guys, we know "Hello World" is the beginning of all programming languages'

const assert     = require('assert');const base64     = require('../src/base64');const TrieFilter = require('../src/TrieFilter');describe('TrieFilter', function () {    it('feature', function () {        let trie  = new TrieFilter();        let words = ['Hello', 'Hey', 'He'];        words.forEach(x => trie.apply(x));        let findFirst = trie.findFirst('Hello world');        console.log('findFirst: %s', findFirst);        let sentence = 'Hey guys, we know "Hello World" is the beginning of all programming languages';        let findAll  = trie.findAll(sentence);        console.log('findAll:\noffset\tcount\tsubString');        for (let {offset, count} of findAll) {            console.log('%s\t%s\t%s', offset, count, sentence.substr(offset, count));        }    });})

Output results

$ mocha .findFirst: HellofindAll:offset  count   subString0       2       He0       3       Hey19      2       He19      5       Hello

The binary traversal used by the source code is an optimized version, as we mentioned later.

When our TrieFilter implementation is more complete, such as the declaration of the type of nodes to save the parent node's reference can be achieved by keyword removal and other functions. When the index phrase is all Emoji, retrieving the Emoji in the user input is a cinch.

C # implementation

C # Implementation is a bit verbose, the author first implemented the generic node and tree Github.com/jusfr/chuye.character later found that the optimization was difficult, and finally adopted a simplified version based on Char.

    Class Chartrienode {private Dictionary<char, chartrienode> _children;        Public Char Key {get; private set;}        Internal Boolean istail {get; set;} Public Chartrienode This[char Key] {get {if (_children = = null) {return nu                ll                } Chartrienode child; if (!_children.                TryGetValue (Key, out child)) {return null;            } return child;            } set {_children[key] = value; }} public Int32 Count {get {if (_children = = null) {return                0; } return _children.            Count;        }} public Chartrienode (Char key) {key = key;            } public Chartrienode Apppend (Char key) {Chartrienode child; if (_children = = null) {_children = newDictionary<char, chartrienode> ();                Child = new Chartrienode (key);                _children[key] = child;            return child; } if (!_children.                TryGetValue (Key, Out child)} {child = new Chartrienode (key);            _children[key] = child;        } return child;            } public Boolean TryGetValue (Char-Key, out-Chartrienode child) {child = null;            if (_children = = null) {return false; } return _children.        TryGetValue (Key, out child);        }} public interface Iphrasecontainer {void Apply (String phrase);        Boolean Contains (String phrase);    Boolean Contains (String phrase, Int32 offset, Int32 length); }

In contrast to the hash-based implementation, the IPhraseContainer Trietree-based implementation and the JAVASCRIPT version are defined as data portals, CharTriePhraseContainer while the hash- Apply() based HashPhraseContainer internal maintenance and operation is one HashSet<String> .

The high-level API is provided by the PhraseFilter internal dependency of an IPhraseContainer implementation.

Since the test result is already, the Hash-based implementation will be shifted later to reduce code redundancy.

PhraseFilterinternally, the retrieval method is as follows, note that the ClassicSearchAll() binary traversal of the optimized version is not materially different from the JavaScript version, but the traversal is handled by the IPhraseFilter method defined, SearchAll() because the CharTriePhraseContainer Trie-tree lookup simply to traverse once .

Public ienumerable<arraysegment<char>> Searchall (String phrase) {var container = _container as Chartriephr    Asecontainer; if (container! = null) {return container.    Searchall (phrase); } return Classicsearchall (phrase);}  Public ienumerable<arraysegment<char>> Classicsearchall (String phrase) {if (phrase = = null) {throw    New ArgumentNullException (nameof (phrase)); } var chars = phrase.    ToCharArray ();    var offset = 0; while (offset < phrase.        Length) {//SET clause length and future offset propulsion value to be used var count = 1;        var proceeded = 1; Determines whether the letters in the next position of offset are in the keyword while (offset + count <= phrase. Length) {//Fast assertion if (_assertors. Count = = 0 | | _assertors. All (x = x.contains (phrase, offset, count))) {//Determine if clause exists, _container may be based on HashSet etc. if (_ Container. Contains (phrase, offset, count)) {//record offset push value proceeded = CounT                Yield return new arraysegment<char> (chars, offset, count);        }} Count + = 1;    }//Push offset position offset + = proceeded; }}

The

Trie-tree looks for the is the procedure to match the input statement Chartrienode .

Public ienumerable<arraysegment<char>> Searchall (String phrase) {if (phrase = = null) {throw new Ar    Gumentnullexception (nameof (phrase)); } var chars = phrase.    ToCharArray ();    var offset = 0; while (offset < phrase.        Length) {var current = _root[phrase[offset]];            if (current = = null) {//push offset position offset + = 1;        Continue }//If it is the end, the single-character hit keyword if (current.        Istail) {yield return new arraysegment<char> (chars, offset, 1);        }//Set clause length and future offset propulsion value to be used var count = 1;        var proceeded = 1; Determines if the letter of the trailing position of offset is in the keyword while (current! = NULL && offset + count < phrase.            Length) {current = Current[phrase[offset + count]];            if (current = = null) {break;            } count + = 1; if (current. Istail) {//Set the offset size that has been pushed proceeded = count;            Yield return new arraysegment<char> (chars, offset, proceeded);    }}//push offset position offset + = proceeded; }}

Because there is no double traversal and SubString() invocation, the performance and overhead relative to Hash or regular methods have progressed.

Using the example

Project source has been packaged and posted to NuGet

PM > Install-package chuye.triefilter

For emoji retrieval, you need to prepare a list of emoji or get it from chuye-emoji.txt.

var filter = new PhraseFilter();var filename = Path.Combine(Directory.GetCurrentDirectory(),"chuye-emoji.txt");filter.ApplyFile(filename);var clause = @"颠簸了三小时飞机️两小时公交地铁四小时大巴一小时 终于到了我们的目的地像面粉一样的沙滩和碧绿的大海 这就是我们第一次旅行的地方in沙美岛";var segments = filter.SearchAll(clause).ToArray();var searched = new SearchResult(clause, segments);var replaced = searched.Replace(x => new String('*', x.Length));var comparsion = "颠簸了三小时飞机*️*两小时公交地铁***四小时大巴*一小时** 终于到了我们的目的地像面粉一样的沙滩和碧绿的大海 这就是我们第一次旅行的地方in沙美岛**";Assert.Equal(comparsion, replaced);

The Chuye-emoji.txt file is compiled by the author from the Unicode Web site.

Retrieving keywords/sensitive words is completely one thing, please prepare yourself, there is no too much discussion here, the method used in the following code Dump() can be linqpad on the shortcut output.

var filter = new PhraseFilter();filter.Apply("Hello", "Hey");var sentence = "Hey guys, we know \"Hello World\" is the beginning of all programming languages";var searched = filter.SearchAll(sentence).ToArray();searched.Select(x => new { x.Offset, x.Count, Substring = sentence.Substring(x.Offset, x.Count) }).Dump("Searched");new SearchResult(sentence, searched).Replace(x => new String('*', x.Length)).Dump("Replaced");

Implementation IPhraseProvider -owned, and Autofac integration examples

class EmojiPhraseProvider : IPhraseProvider {    private readonly IEmojiRepository _emojiRepository;    public EmojiPhraseProvider(IEmojiRepository emojiRepository) {        _emojiRepository = emojiRepository;    }    public IEnumerable<String> Fetch() {        var values = _emojiRepository.GetValues();        return values.Select(x => x.value);    }}public class EmojiFinderModule : Module {    protected override void Load(ContainerBuilder builder) {        builder.RegisterType<EmojiPhraseProvider>().As<IPhraseProvider>();        builder.RegisterType<PhraseFilter>().OnActivated(OnPhraseFilterActivated).As<IPhraseFilter>().SingleInstance();        base.Load(builder);    }    private void OnPhraseFilterActivated(IActivatedEventArgs<PhraseFilter> obj) {        var provider = obj.Context.Resolve<IPhraseProvider>();        obj.Instance.Apply(provider);    }}
Performance testing

100,000 cycles

trieFilter.SearchAll    Time Elapsed : 65ms    CPU Cycles   : 174,521,817    Memory cost  : 1,192    Gen 0        : 7    Gen 1        : 2    Gen 2        : 2hashFilter.SearchAll    Time Elapsed : 627ms    CPU Cycles   : 1,694,437,899    Memory cost  : 2,440    Gen 0        : 137    Gen 1        : 2    Gen 2        : 2

JavaScript version

$ node trieFilter.jsShow pretty:depth 00 count 002: │Hdepth 01 count 002: │─edepth 02 count 001: │──ldepth 03 count 001: │───ldepth 04 count 001: └────odepth 02 count 001: └──yfindFirst: HellofindAll:offset  count   subString0       3       Hey19      5       Hellomarky: loop 100000 times[ { startTime: 5.214011,    name: 'findAll',    duration: 180.187891,
Optimization method

The performance bottleneck of classic lookups comes from the word segmentation based on double traversal, while the large number of clause segmentation brings the pressure of garbage collection.

Optimization of I "word breaker" linearization

The basic idea is that when the first iteration is completed, the second traversal lookup succeeds using the end position of the current clause as the next traversal start

This method reduces the call to SubString () by the same proportion as the calculation reduction, but the substring segmentation is unavoidable

  public ienumerable<arraysegment<char>> Searchall (String phrase) {if (phrase = = null    ) {throw new ArgumentNullException (nameof (phrase)); } var chars = phrase.    ToCharArray ();    var offset = 0; while (offset < phrase.        Length) {//SET clause length and future offset propulsion value to be used var count = 1;        var proceeded = 1; Determines whether the letters in the next position of offset are in the keyword while (offset + count <= phrase. Length) {//Here you can add a quick assertion, deferred clause slicing var clause = phrase.            Substring (offset, count); If the clause exists, _container may be based on HashSet and/or whether phrase is called. Substring (offset, count) depending on the implementation of the IF (_container.                 Contains (phrase, offset, count)) {//Record offset Advance value proceeded = count;            Yield return new arraysegment<char> (chars, offset, count);        } count + = 1;    }//Push offset position offset + = proceeded; }}
Optimization II clause after quick judgment of "word breaker"
    1. Optimization based on the length of clauses

Use an integer to store all keyword-length combinations, such as the initialization of a keyword Hi + Hello , the length combination calculation process

    1. Initial value0
    2. Bit arithmetic 0 | 1 << len('Hi') , with a length combination of 4
    3. Bit arithmetic 4 | 1 << len('Hello') , with a length combination of 36

Find samples

    • Find ' Hey '
      • 36 & (1 << len('Hey')) = 0, find the end
    • Find ' Hell '
      • 36 & (1 << len('Hell')) = 0, find the end
    • Find ' Hello '
      • 36 & (1 << len('Hello')) = 32For subsequent lookups
    1. Optimization based on the character position of clauses

Use an integer array of length char.maxvalue to store all combinations of each character's position on each key, such as the initialization of a keyword Hi+Hello , the array calculation process

    1. Add ' Hi ',
    • array['H'] = 1 << 0 = 1
    • array['i'] = 1 << 1 = 2
    1. Add ' Hello '
    • array['H'] |= 1 << 0 = 1
    • array['e'] |= 1 << 1 = 2
    • array['l'] |= 1 << 2 = 4
    • array['l'] |= 1 << 3 = 4 | 8 = 12
    • array['o'] |= 1 << 4 = 16

The final character combination is

    • 'H': 1
    • 'i': 2
    • 'e': 2
    • 'l': 12
    • 'o': 16

Ignore the predecessor's length check to find the sample

    • Find ' Hey '
      • Contrast ' h ', array['H'] = 1 which means ' H ' on index 0, contrasted by
      • Contrast ' E ', array['e'] = 2 which means ' e ' on index 1, contrasted by
      • Contrast ' y ', array['y'] = 0 means ' y ' does not appear, comparison fails, lookup failed

If the lookup is ' helll ', because the 5th bit is ' l ', and array['l'] = 12 the ' l ' is on index 2 or 3, the contrast will not pass

Multiple performance comparisons, the discovery clause of the rapid judgment performance is very unstable, sometimes have a drag effect, may be related to the test sample, there is no further testing.

Because the trie-tree looks for the step approximation process, the length optimization can only degenerate into a "not greater than the maximum length" judgment.

Performance comparison
# 传统二重遍历hashFilter.SearchAll    Time Elapsed : 3,613ms    CPU Cycles   : 9,758,423,828    Memory cost  : 19,176    Gen 0        : 880    Gen 1        : 2    Gen 2        : 2# 优化遍历方法hashFilter.SearchAll    Time Elapsed : 1,310ms    CPU Cycles   : 3,538,391,198    Memory cost  : 14,696    Gen 0        : 440    Gen 1        : 2    Gen 2        : 2# trie 查找trieFilter.SearchAll    Time Elapsed : 63ms    CPU Cycles   : 171,441,680    Memory cost  : 1,192    Gen 0        : 7    Gen 1        : 2    Gen 2        : 2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.