See http://www.cnblogs.com/xingd/archive/2008/01/23/1050443.html.
Thanks to sumtech's reply and discussion, the original efficiency is enough for the website to be practical. Although I think of some improvement methods, I have been too lazy to do it. Sumtech discussed with me via email, and I finally took the time to make improvements.AlgorithmThe efficiency is 400% higher than the original algorithm, that is, only 1/5 of the original time is required.
The key to the algorithm is to combine two bitarrays into byte [Char. maxvalue]. Seven bits are used to determine the first seven characters, and the other bit to determine other characters. Minwordlength and charcheck are added to filter short judgments and to quickly judge only one character.
Data used:
Private Hashset < String > Hash = New Hashset < String > ();
Private Byte [] Fastcheck = New Byte [ Char . Maxvalue];
Private Bitarray charcheck = New Bitarray ( Char . Maxvalue );
Private Int Maxwordlength = 0 ;
Private Int Minwordlength = Int . Maxvalue;
Initialize the dataCode:
Foreach ( String Word In Badwords)
{
Maxwordlength = Math. Max (maxwordlength, word. Length );
Minwordlength = Math. Min (minwordlength, word. Length );
For ( Int I = 0 ; I < 7 && I < Word. length; I ++ )
{
Fastcheck [word [I] | = ( Byte )( 1 < I );
}
for ( int I = 7 ; I word. length; I ++ )
{< br> fastcheck [word [I] |= 0x80 ;< BR >}
If(Word. Length= 1)
{
Charcheck [word [0]= True;
}
Else
{
Hash. Add (Word );
}
}
Code used to determine whether a dirty word is contained:
Public Bool Hasbadword ( String Text)
{
Int Index = 0 ;
While (Index < Text. length)
{
If (Fastcheck [Text [Index] & 1 ) = 0 )
{
While (Index < Text. Length - 1 && (Fastcheck [Text [ ++ Index] & 1 ) = 0 );
}
If(Minwordlength= 1 &&Charcheck [Text [Index])
{
Return True;
}
For ( Int J = 1 ; J <= Math. Min (maxwordlength, text. Length - Index - 1 ); J ++ )
{
If (Fastcheck [Text [Index + J] & ( 1 < Math. Min (J, 7 ))) = 0 )
{
Break ;
}
If (j + 1 = minwordlength)
{< br> string sub = text. substring (index, j + 1 );
If (hash. contains (sub)
{< br> return true ;
}< BR >}
Index++;
}
Return False;
}
Revision: When a bug is found, the charcheck of a character should be placed out of the for loop, and the judgment of J = 1 should be removed. The judgment of the outer layer should be changed to If (J + 1> minwordlength ).
Finally, I will introduce myself. I am currently a System Architect at Dianping.com and my personal blog is www.steven xu.com. The entire development team of comming.com is about to launch a team blog.ArticleGenerally, it will be sent to the blog site and team blog first, and then sent to the personal blog after feedback and revision.
PS: currently, the minimum length and Case sensitivity are matched. This function must be implemented when you replace dirty words.