International - English

Cart Console

Topic Center

Contact Sales

Home > Others

The application of the primary dictionary tree lookup in Emoji and keyword retrieval Part-2

Last Update:2018-08-31 Source: Internet

Author: User

Tags arithmetic assert base64 garbage collection elastic search

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Series Index

Unicode and Emoji
Dictionary tree Trietree and performance testing
Production Practice

After the knowledge of Unicode and Emoji is prepared, this article enters the coding link.

When we know that Emoji is a sequence of Unicode characters, it is natural to understand that Emoji find and sensitive words look exactly the same thing: index Emoji lists or keywords, user input participle, traversal filtering.

This paper does not discuss the word segmentation technique for Lucene and Elastic Search.

That's fine, my 1th version of Emoji looks like this, and it's got two problems.

The traditional participle is based on the double traversal of the long sentence;
The contrast clause requires a lot of SubString() action, which results in a huge GC pressure;

The double traversal can be optimized, with the inner traversal to propel the outer traversal position, but the extraction clause is unavoidable and will be mentioned later in this article.

Dictionary Tree Trie-tree

The dictionary tree trie-tree algorithm itself is simple and easy to understand, each programming language can use about 100 lines to complete the basic implementation.

There is also a very optimized implementation, the homepage can see the author's Blog Park Address and optimization experience.

Toolgood/toolgood.words

For more in-depth reading, please go to

How to overcome the shortcomings of the dictionary tree (Trie trees)?

In this paper, not only to detect emoji/keywords, but also look forward to positioning, replacement and other operations, so start from scratch.

JavaScript Version Implementation

Given the redundancy of the static language, the following uses a more expressive version of JavaScript to exclude irrelevant portions as examples of source code, which is found in Github.com/jusfr/chuye.character.

The following implementations use the syntax in ECMAScript 6 Symbol , seen in the Symbol@[mdn Web document] (developer.mozilla.org/zh-cn/), without affecting reading.

Const COUNT_SYMBOL = symbol (' count '), const END_SYMBOL = symbol (' End '), class Triefilter {constructor () {this.    root = {[Count_symbol]: 0};        } apply (Word) {let node = this.root;        Let depth = 0;            for (let-ch of word) {let child = node[ch];            if (child) {Child[count_symbol] + = 1;            } else {Node[ch] = child = {[Count_symbol]: 1};        } node = child;    } Node[end_symbol] = true;        } findFirst (sentence) {Let node = this.root;        Let sequence = [];            for (let-ch of sentence) {let-child = Node[ch];            if (!child) {break;            } sequence.push (CH);        node = child;        } if (Node[end_symbol]) {return sequence.join (');        }} findAll (sentence) {let offset = 0;        let segments = []; while (Offset < SentenCe.length) {Let child = This.root[sentence[offset]];                if (!child) {offset + = 1;            Continue                    } if (Child[end_symbol]) {Segments.push ({offset:offset,            Count:1,});            } Let count = 1;            let proceeded = 1;                while (child && offset + count < sentence.length) {child = Child[sentence[offset + count]];                if (!child) {break;                } count + = 1;                    if (Child[end_symbol]) {proceeded = count;                Segments.push ({offset:offset, Count:count,});        }} offset + = proceeded;    } return segments; }}module.exports = Triefilter;

Contains blank lines but only 87 lines of code, only to see 3 methods

apply(word): Add keywordsword
findFirst(sentence): sentence retrieves a 1th match in a statement
findAll(sentence): sentence checks all occurrences in a statement

Using the example

Index keywords Hello and Hey , retrieving in statements 'Hey guys, we know "Hello World" is the beginning of all programming languages'

const assert     = require('assert');const base64     = require('../src/base64');const TrieFilter = require('../src/TrieFilter');describe('TrieFilter', function () {    it('feature', function () {        let trie  = new TrieFilter();        let words = ['Hello', 'Hey', 'He'];        words.forEach(x => trie.apply(x));        let findFirst = trie.findFirst('Hello world');        console.log('findFirst: %s', findFirst);        let sentence = 'Hey guys, we know "Hello World" is the beginning of all programming languages';        let findAll  = trie.findAll(sentence);        console.log('findAll:\noffset\tcount\tsubString');        for (let {offset, count} of findAll) {            console.log('%s\t%s\t%s', offset, count, sentence.substr(offset, count));        }    });})

Output results

$ mocha .findFirst: HellofindAll:offset  count   subString0       2       He0       3       Hey19      2       He19      5       Hello

The binary traversal used by the source code is an optimized version, as we mentioned later.

When our TrieFilter implementation is more complete, such as the declaration of the type of nodes to save the parent node's reference can be achieved by keyword removal and other functions. When the index phrase is all Emoji, retrieving the Emoji in the user input is a cinch.

C # implementation

C # Implementation is a bit verbose, the author first implemented the generic node and tree Github.com/jusfr/chuye.character later found that the optimization was difficult, and finally adopted a simplified version based on Char.

    Class Chartrienode {private Dictionary<char, chartrienode> _children;        Public Char Key {get; private set;}        Internal Boolean istail {get; set;} Public Chartrienode This[char Key] {get {if (_children = = null) {return nu                ll                } Chartrienode child; if (!_children.                TryGetValue (Key, out child)) {return null;            } return child;            } set {_children[key] = value; }} public Int32 Count {get {if (_children = = null) {return                0; } return _children.            Count;        }} public Chartrienode (Char key) {key = key;            } public Chartrienode Apppend (Char key) {Chartrienode child; if (_children = = null) {_children = newDictionary<char, chartrienode> ();                Child = new Chartrienode (key);                _children[key] = child;            return child; } if (!_children.                TryGetValue (Key, Out child)} {child = new Chartrienode (key);            _children[key] = child;        } return child;            } public Boolean TryGetValue (Char-Key, out-Chartrienode child) {child = null;            if (_children = = null) {return false; } return _children.        TryGetValue (Key, out child);        }} public interface Iphrasecontainer {void Apply (String phrase);        Boolean Contains (String phrase);    Boolean Contains (String phrase, Int32 offset, Int32 length); }

In contrast to the hash-based implementation, the IPhraseContainer Trietree-based implementation and the JAVASCRIPT version are defined as data portals, CharTriePhraseContainer while the hash- Apply() based HashPhraseContainer internal maintenance and operation is one HashSet<String> .

The high-level API is provided by the PhraseFilter internal dependency of an IPhraseContainer implementation.

Since the test result is already, the Hash-based implementation will be shifted later to reduce code redundancy.

PhraseFilterinternally, the retrieval method is as follows, note that the ClassicSearchAll() binary traversal of the optimized version is not materially different from the JavaScript version, but the traversal is handled by the IPhraseFilter method defined, SearchAll() because the CharTriePhraseContainer Trie-tree lookup simply to traverse once .

Public ienumerable<arraysegment<char>> Searchall (String phrase) {var container = _container as Chartriephr    Asecontainer; if (container! = null) {return container.    Searchall (phrase); } return Classicsearchall (phrase);}  Public ienumerable<arraysegment<char>> Classicsearchall (String phrase) {if (phrase = = null) {throw    New ArgumentNullException (nameof (phrase)); } var chars = phrase.    ToCharArray ();    var offset = 0; while (offset < phrase.        Length) {//SET clause length and future offset propulsion value to be used var count = 1;        var proceeded = 1; Determines whether the letters in the next position of offset are in the keyword while (offset + count <= phrase. Length) {//Fast assertion if (_assertors. Count = = 0 | | _assertors. All (x = x.contains (phrase, offset, count))) {//Determine if clause exists, _container may be based on HashSet etc. if (_ Container. Contains (phrase, offset, count)) {//record offset push value proceeded = CounT                Yield return new arraysegment<char> (chars, offset, count);        }} Count + = 1;    }//Push offset position offset + = proceeded; }}

The

Trie-tree looks for the is the procedure to match the input statement Chartrienode .

Public ienumerable<arraysegment<char>> Searchall (String phrase) {if (phrase = = null) {throw new Ar    Gumentnullexception (nameof (phrase)); } var chars = phrase.    ToCharArray ();    var offset = 0; while (offset < phrase.        Length) {var current = _root[phrase[offset]];            if (current = = null) {//push offset position offset + = 1;        Continue }//If it is the end, the single-character hit keyword if (current.        Istail) {yield return new arraysegment<char> (chars, offset, 1);        }//Set clause length and future offset propulsion value to be used var count = 1;        var proceeded = 1; Determines if the letter of the trailing position of offset is in the keyword while (current! = NULL && offset + count < phrase.            Length) {current = Current[phrase[offset + count]];            if (current = = null) {break;            } count + = 1; if (current. Istail) {//Set the offset size that has been pushed proceeded = count;            Yield return new arraysegment<char> (chars, offset, proceeded);    }}//push offset position offset + = proceeded; }}

Because there is no double traversal and SubString() invocation, the performance and overhead relative to Hash or regular methods have progressed.

Using the example

Project source has been packaged and posted to NuGet

PM > Install-package chuye.triefilter

For emoji retrieval, you need to prepare a list of emoji or get it from chuye-emoji.txt.

var filter = new PhraseFilter();var filename = Path.Combine(Directory.GetCurrentDirectory()，"chuye-emoji.txt");filter.ApplyFile(filename);var clause = @"颠簸了三小时飞机️两小时公交地铁四小时大巴一小时 终于到了我们的目的地像面粉一样的沙滩和碧绿的大海 这就是我们第一次旅行的地方in沙美岛";var segments = filter.SearchAll(clause).ToArray();var searched = new SearchResult(clause, segments);var replaced = searched.Replace(x => new String('*', x.Length));var comparsion = "颠簸了三小时飞机*️*两小时公交地铁***四小时大巴*一小时** 终于到了我们的目的地像面粉一样的沙滩和碧绿的大海 这就是我们第一次旅行的地方in沙美岛**";Assert.Equal(comparsion, replaced);

The Chuye-emoji.txt file is compiled by the author from the Unicode Web site.

Retrieving keywords/sensitive words is completely one thing, please prepare yourself, there is no too much discussion here, the method used in the following code Dump() can be linqpad on the shortcut output.

var filter = new PhraseFilter();filter.Apply("Hello", "Hey");var sentence = "Hey guys, we know \"Hello World\" is the beginning of all programming languages";var searched = filter.SearchAll(sentence).ToArray();searched.Select(x => new { x.Offset, x.Count, Substring = sentence.Substring(x.Offset, x.Count) }).Dump("Searched");new SearchResult(sentence, searched).Replace(x => new String('*', x.Length)).Dump("Replaced");

Implementation IPhraseProvider -owned, and Autofac integration examples

class EmojiPhraseProvider : IPhraseProvider {    private readonly IEmojiRepository _emojiRepository;    public EmojiPhraseProvider(IEmojiRepository emojiRepository) {        _emojiRepository = emojiRepository;    }    public IEnumerable<String> Fetch() {        var values = _emojiRepository.GetValues();        return values.Select(x => x.value);    }}public class EmojiFinderModule : Module {    protected override void Load(ContainerBuilder builder) {        builder.RegisterType<EmojiPhraseProvider>().As<IPhraseProvider>();        builder.RegisterType<PhraseFilter>().OnActivated(OnPhraseFilterActivated).As<IPhraseFilter>().SingleInstance();        base.Load(builder);    }    private void OnPhraseFilterActivated(IActivatedEventArgs<PhraseFilter> obj) {        var provider = obj.Context.Resolve<IPhraseProvider>();        obj.Instance.Apply(provider);    }}

Performance testing

100,000 cycles

trieFilter.SearchAll    Time Elapsed : 65ms    CPU Cycles   : 174,521,817    Memory cost  : 1,192    Gen 0        : 7    Gen 1        : 2    Gen 2        : 2hashFilter.SearchAll    Time Elapsed : 627ms    CPU Cycles   : 1,694,437,899    Memory cost  : 2,440    Gen 0        : 137    Gen 1        : 2    Gen 2        : 2

JavaScript version

$ node trieFilter.jsShow pretty:depth 00 count 002: │Hdepth 01 count 002: │─edepth 02 count 001: │──ldepth 03 count 001: │───ldepth 04 count 001: └────odepth 02 count 001: └──yfindFirst: HellofindAll:offset  count   subString0       3       Hey19      5       Hellomarky: loop 100000 times[ { startTime: 5.214011,    name: 'findAll',    duration: 180.187891,

Optimization method

The performance bottleneck of classic lookups comes from the word segmentation based on double traversal, while the large number of clause segmentation brings the pressure of garbage collection.

Optimization of I "word breaker" linearization

The basic idea is that when the first iteration is completed, the second traversal lookup succeeds using the end position of the current clause as the next traversal start

This method reduces the call to SubString () by the same proportion as the calculation reduction, but the substring segmentation is unavoidable

  public ienumerable<arraysegment<char>> Searchall (String phrase) {if (phrase = = null    ) {throw new ArgumentNullException (nameof (phrase)); } var chars = phrase.    ToCharArray ();    var offset = 0; while (offset < phrase.        Length) {//SET clause length and future offset propulsion value to be used var count = 1;        var proceeded = 1; Determines whether the letters in the next position of offset are in the keyword while (offset + count <= phrase. Length) {//Here you can add a quick assertion, deferred clause slicing var clause = phrase.            Substring (offset, count); If the clause exists, _container may be based on HashSet and/or whether phrase is called. Substring (offset, count) depending on the implementation of the IF (_container.                 Contains (phrase, offset, count)) {//Record offset Advance value proceeded = count;            Yield return new arraysegment<char> (chars, offset, count);        } count + = 1;    }//Push offset position offset + = proceeded; }}

Optimization II clause after quick judgment of "word breaker"

Optimization based on the length of clauses

Use an integer to store all keyword-length combinations, such as the initialization of a keyword Hi + Hello , the length combination calculation process

Initial value0
Bit arithmetic 0 | 1 << len('Hi') , with a length combination of 4
Bit arithmetic 4 | 1 << len('Hello') , with a length combination of 36

Find samples

Find ' Hey '
- 36 & (1 << len('Hey')) = 0, find the end
Find ' Hell '
- 36 & (1 << len('Hell')) = 0, find the end
Find ' Hello '
- 36 & (1 << len('Hello')) = 32For subsequent lookups

Optimization based on the character position of clauses

Use an integer array of length char.maxvalue to store all combinations of each character's position on each key, such as the initialization of a keyword Hi+Hello , the array calculation process

Add ' Hi ',

array['H'] = 1 << 0 = 1
array['i'] = 1 << 1 = 2

Add ' Hello '

array['H'] |= 1 << 0 = 1
array['e'] |= 1 << 1 = 2
array['l'] |= 1 << 2 = 4
array['l'] |= 1 << 3 = 4 | 8 = 12
array['o'] |= 1 << 4 = 16

The final character combination is

'H': 1
'i': 2
'e': 2
'l': 12
'o': 16

Ignore the predecessor's length check to find the sample

Find ' Hey '
- Contrast ' h ', array['H'] = 1 which means ' H ' on index 0, contrasted by
- Contrast ' E ', array['e'] = 2 which means ' e ' on index 1, contrasted by
- Contrast ' y ', array['y'] = 0 means ' y ' does not appear, comparison fails, lookup failed

If the lookup is ' helll ', because the 5th bit is ' l ', and array['l'] = 12 the ' l ' is on index 2 or 3, the contrast will not pass

Multiple performance comparisons, the discovery clause of the rapid judgment performance is very unstable, sometimes have a drag effect, may be related to the test sample, there is no further testing.

Because the trie-tree looks for the step approximation process, the length optimization can only degenerate into a "not greater than the maximum length" judgment.

Performance comparison

# 传统二重遍历hashFilter.SearchAll    Time Elapsed : 3,613ms    CPU Cycles   : 9,758,423,828    Memory cost  : 19,176    Gen 0        : 880    Gen 1        : 2    Gen 2        : 2# 优化遍历方法hashFilter.SearchAll    Time Elapsed : 1,310ms    CPU Cycles   : 3,538,391,198    Memory cost  : 14,696    Gen 0        : 440    Gen 1        : 2    Gen 2        : 2# trie 查找trieFilter.SearchAll    Time Elapsed : 63ms    CPU Cycles   : 171,441,680    Memory cost  : 1,192    Gen 0        : 7    Gen 1        : 2    Gen 2        : 2

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

wordpress the field what use of final keyword in java what use of static keyword in c use of final keyword in java meaning of static keyword in java use of finally keyword in java in keyword

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The application of the primary dictionary tree lookup in Emoji and keyword retrieval Part-2

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support