asp.net dirty word filtering Algorithm modified version _ Practical skills

Source: Internet
Author: User
The old algorithm is simple for every dirty word called once string.replace, of course, with the StringBuilder. Http://www.jb51.net/article/20575.htm. When I test here, the regex is about one times faster. But still not very satisfied, should be for our website dirty word filter used quite a lot, after some thinking, they made an algorithm. Test on their own machine, the use of the original text of the dirty font, 0x19c string length, 1000 cycles, text lookup time 1933.47ms,regex used 1216.719ms, and my algorithm only used 34.125ms.

The key of the algorithm is to use space to change time, use 2 global BitArray, the length is char.maxvalue. One of the BitArray is used to determine if there is a dirty word at the beginning of a char, and another bitarray to determine whether any of the dirty names contain a char. After these two bitarray, you can make a quick judgment, then use the hash code to judge the complete dirty word, and optimize the traversal process with the maximum length of the first word.

The required variables are as follows:
Copy Code code as follows:

Private dictionary<string, object> hash = new dictionary<string, object> ();
Private BitArray Firstcharcheck = new BitArray (char. MaxValue);
Private BitArray Allcharcheck = new BitArray (char. MaxValue);
private int maxLength = 0;

Where the hash is used only to key,value null. You can also use the hashset in. NET 3.5, or use dictionary<string, Int&gt, to record the number of dirty word occurrences.

The methods for initializing this data are as follows:
Copy Code code as follows:

foreach (string word in badwords)
{
if (!hash. ContainsKey (word))
{
Hash. ADD (word, NULL);
MaxLength = Math.max (maxlength, Word. Length);
Firstcharcheck[word[0]] = true;

foreach (char c in Word)
{
Allcharcheck[c] = true;
}
}
}

The code that determines whether a dirty word appears in a string is as follows:

Copy Code code as follows:

int index = 0;
int offset = 0;
while (Index < text. Length)
{
if (!firstcharcheck[text[index]])
{
while (Index < text. Length-1 &&!firstcharcheck[text[++index]]);
}

for (int j = 1; J <= Math.min (maxlength, text). Length-index); J + +)
{
if (!allcharcheck[text[index + j-1]])
{
Break
}

String sub = text. Substring (index, j);

if (hash. ContainsKey (sub))
{
return true;
}
}

index++;
}

return false;

The replacement code is not pasted, similar to the judgment, except that you cannot find a dirty word and exit the loop. If a dirty word may not be very high, there is no need to create a temporary StringBuilder.

Further, can be used for reference. NET source code in the implementation of String.gethashcode (), to avoid a substring call, improve performance. can also design a progressive hashcode implementation, such as "HelloWorld" can be used "helloworl" hash further calculation, optimize efficiency.

Alternatively, you can discard the hash and use the sorted string[], using BinarySearch to determine whether a sub is a dirty word. BinarySearch results can be progressive, that is, to find "helloworl" results to speed up the judgment of "HelloWorld." (Tested, 700 dirty words, binarysearch efficiency is sometimes much lower.) )
Finally a little grumble, the first time (http://www.jb51.net/article/20576.htm), just to illustrate their own algorithm, the specific code even a little wrong. Two things make me feel like I'm not cool, one is a mess of countless web sites reproduced without stating the source, resulting in my later improvement and error correction can not achieve results, the second is that some people are willing to see the final code, rather than understand I want to express the most core design, and then consider their own implementation.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.