Efficient keyword filtering and search algorithms (trie Ko hash)

Source: Internet
Author: User
Tags findone

In reality, chat filtering may eventually consume a considerable (sometimes astonishing) amount of resources-both in terms of development and original computing capabilities. Some well-known SNS games consume more than 50% of computing resources only in chat filtering. Therefore, this part of optimization is particularly important.

Recently, the game server needs to use chat filtering. The first thought is to use the hashset <string> method.

The basic idea is to save all the keywords to be filtered in hashset <string>.

The user-input messages are forcibly divided.

For example, how are you today?

Single Character Segmentation: Today/day/you/OK/

The two characters are separated into: Today/day, you/Hello/OK

The three characters are separated into: Today you/day hello/you are OK

...... And so on

For large strings. We limit the split length to the length of the keyword to be filtered.

Then, the split string is matched.

 

View code

Public class hashfilter
{
Int m_maxlen; // maximum length of a keyword
Hashset <string> m_keys = new hashset <string> ();

/// <Summary>
/// Insert a new key.
/// </Summary>
/// <Param name = "name"> </param>
Public void addkey (string key)
{
If ((! String. isnullorempty (key) & m_keys.add (key) & Key. length> m_maxlen)
{
M_maxlen = key. length;
}
}
/// <Summary>
/// Check for illegal characters
/// </Summary>
/// <Param name = "text"> input text </param>
/// <Returns> the first illegal character found. If no, String. Empty is returned. </returns>
Public String findone (string text)
{
For (INT Len = 1; Len <= text. length; Len ++)
{
Int maxindex = text. Length-len;
For (INT Index = 0; index <= maxindex; index ++)
{
String key = text. substring (index, Len );
If (m_keys.contains (key ))
{
Return key;
}
}
}
Return string. empty;
}
}

This method has the disadvantage of using string. substring to create a large number of temporary objects.

Even if the maximum length is used to split the string, it is not efficient when the string to be filtered is long.

 

Trie, Also knownWord search treeOrKey treeIs a tree structure and a variant of the hash tree. It has the following advantages: minimizes unnecessary string comparisons and improves query efficiency than hash tables.

Let's look at an example of a trie structure.

In this trie structure, the eight strings A, to, tea, Ted, ten, I, in, and inn are saved.

First, let's look at how we implement this structure:

Public class triefilter
{
Private char m_key;
Private dictionary <char, triefilter> m_values;
// Root node, which does not contain characters (m_key = '\ 0 ';
Public triefilter ()
{
M_values = new dictionary <char, triefilter> ();
} // Used to create a subnode
Triefilter (char key)
{
M_key = key;
M_values = new dictionary <char, triefilter> ();
}
/// <Summary>
/// Add a keyword
/// </Summary>
/// <Param name = "key"> </param>
Public void addkey (string key)
{
If (string. isnullorempty (key ))
{
Return;
}
Triefilter node = this;
Foreach (var c in key)
{
Triefilter subnode;
If (! Node. m_values.trygetvalue (C, out subnode ))
{
Subnode = new triefilter (C );
Node. m_values.add (C, subnode );
}
Node = subnode;
} // The Last node indicates the end of the word. It is indicated by '\ 0' and points to an empty object.
Node. m_values ['\ 0'] = NULL;
}
}
}

In this way. A c # trie structure indicates completion ..

Next let's take a look at how to implement keyword search

 

/// <Summary>
/// Check for illegal characters
/// </Summary>
/// <Param name = "text"> input text </param>
/// <Returns> the first illegal character found. If no, String. Empty is returned. </returns>
Public String findone (string text)
{
For (INT I = 0; I <text. length; I ++)
{
Triefilter node;
If (m_values.trygetvalue (Text [I], out node ))
{
For (Int J = I + 1; j <text. length; j ++)
{
If (node. m_values.trygetvalue (Text [J], out node ))
{
If (node. m_values.containskey ('\ 0 '))
{
Return text. substring (I, (J + 1-I ));
}
}
Else
{
Break;
}
}
}
}
Return string. empty;
}

Is it easy?

Converting from a brute force match to a Trie-based match method greatly saves execution time, which is more obvious when matching long strings.

For a string of about 20 characters. The matching speed of tire is nearly 10 times faster than that of hash.

Complete code download is attached below

The badword.txt contains more than 7000 sensitive keywords.

Complete code and Performance Comparison Test download: http://files.cnblogs.com/yeerh/FilterTest.rar

 

I hope this article will help you. Welcome to shoot bricks.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.