Modified ASP. NET dirty word Filtering Algorithm

Source: Internet
Author: User

Old Algorithm It is simple to call String. Replace for every dirty word, of course, stringbuilder is used. Http://www.jb51.net/article/20575.htm. During my tests, RegEx is about twice faster. However, I am still not satisfied. I should use a lot of dirty word filtering on our website. after some thought, I made an algorithm myself. I tested it on my machine and used the dirty Character Library in the original text, the string length of 0x19c, 1000 loops, 1933.47 ms for text search, and 1216.719 ms for RegEx, my algorithm only uses 34.125 Ms.

The key of the algorithm is to use space for time and two global bitarrays with the length of char. maxvalue. One bitarray is used to determine whether a dirty character starts with a char, And the other bitarray is used to determine whether all dirty characters contain a char. After the two bitarrays, you can make a quick judgment, and then use the hash code to judge the complete dirty word, and optimize the traversal process through the obtained maximum dirty word length.

The required variables are as follows:CopyCodeThe Code is as follows: Private dictionary <string, Object> hash = new dictionary <string, Object> ();
Private bitarray firstcharcheck = new bitarray (char. maxvalue );
Private bitarray allcharcheck = new bitarray (char. maxvalue );
Private int maxlength = 0;

Here, hash only uses key and value is set to null. You can also use hashset in. Net 3.5, or use dictionary <string, int> to record the number of times dirty words appear.

The method for initializing the data is as follows:Copy codeThe Code is as follows: foreach (string word in badwords)
{
If (! Hash. containskey (Word ))
{
Hash. Add (word, null );
Maxlength = math. Max (maxlength, word. Length );
Firstcharcheck [word [0] = true;

Foreach (char C in word)
{
Allcharcheck [c] = true;
}
}
}

The code used to determine whether a dirty word appears in a string is as follows:

Copy code The Code is as follows: int Index = 0;
Int offset = 0;
While (index <text. length)
{
If (! Firstcharcheck [Text [Index])
{
While (index <text. Length-1 &&! Firstcharcheck [Text [++ Index]);
}

For (Int J = 1; j <= math. Min (maxlength, text. Length-index); j ++)
{
If (! Allcharcheck [Text [index + J-1])
{
Break;
}

String sub = text. substring (index, J );

If (hash. containskey (sub ))
{
Return true;
}
}

Index ++;
}

Return false;

The replaced code is no longer pasted. It is similar to the judgment and contains, except that a dirty word cannot be found and then the loop is exited. If dirty words may not appear very high, there is no need to create a temporary stringbuilder.

Furthermore, we can use the implementation of string. gethashcode () in. net source code to avoid a single substring call and improve performance. You can also design progressive hashcode implementation. For example, "helloworld" can be further computed using "helloworl" hash to optimize efficiency.

In addition, you can discard the hash, use the sorted string [], and use binarysearch to determine whether the sub is a dirty word. Binarysearch results can be progressive, that is, you can use the result of "helloworl" to accelerate the judgment of "helloworld ". (Tested with 700 dirty characters, binarysearch is sometimes much less efficient .)
The last bit of complaints, the first time (http://www.jb51.net/article/20576.htm), just to illustrate their own algorithm, the specific code even has a little error. Two things made me feel bad. One was reposted by numerous messy websites without explaining the source. As a result, my later improvements and incorrect fixes were not effective, second, some people are willing to see the final code, rather than understanding the core design that I want to express, and then consider implementation on their own.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.