(Resend). Net dirty word Filtering Algorithm

Source: Internet
Author: User
See http://www.cnblogs.com/xingd/archive/2008/01/23/1050443.html.

Thanks to sumtech's reply and discussion, the original efficiency is enough for the website to be practical. Although I think of some improvement methods, I have been too lazy to do it. Sumtech discussed with me via email, and I finally took the time to make improvements.AlgorithmThe efficiency is 400% higher than the original algorithm, that is, only 1/5 of the original time is required.

The key to the algorithm is to combine two bitarrays into byte [Char. maxvalue]. Seven bits are used to determine the first seven characters, and the other bit to determine other characters. Minwordlength and charcheck are added to filter short judgments and to quickly judge only one character.

Data used:

Private Hashset < String > Hash =   New Hashset < String > ();
Private   Byte [] Fastcheck =   New   Byte [ Char . Maxvalue];
Private Bitarray charcheck =   New Bitarray ( Char . Maxvalue );
Private   Int Maxwordlength =   0 ;
Private   Int Minwordlength =   Int . Maxvalue;


Initialize the dataCode:

Foreach ( String Word In Badwords)
{
Maxwordlength = Math. Max (maxwordlength, word. Length );
Minwordlength = Math. Min (minwordlength, word. Length );

For ( Int I =   0 ; I <   7   && I < Word. length; I ++ )
{
Fastcheck [word [I] | = ( Byte )( 1   < I );
}

for ( int I = 7 ; I word. length; I ++ )
{< br> fastcheck [word [I] |= 0x80 ;< BR >}

If(Word. Length= 1)
{
Charcheck [word [0]= True;
}
Else
{
Hash. Add (Word );
}
}

Code used to determine whether a dirty word is contained:

Public   Bool Hasbadword ( String Text)
{
Int Index =   0 ;

While (Index < Text. length)
{
If (Fastcheck [Text [Index] &   1 ) =   0 )
{
While (Index < Text. Length -   1   && (Fastcheck [Text [ ++ Index] &   1 ) =   0 );
}

If(Minwordlength= 1 &&Charcheck [Text [Index])
{
Return True;
}

For ( Int J =   1 ; J <= Math. Min (maxwordlength, text. Length - Index -   1 ); J ++ )
{
If (Fastcheck [Text [Index + J] & ( 1   < Math. Min (J, 7 ))) =   0 )
{
Break ;
}

If (j + 1 = minwordlength)
{< br> string sub = text. substring (index, j + 1 );

If (hash. contains (sub)
{< br> return true ;
}< BR >}

Index++;
}

Return False;
}

Revision: When a bug is found, the charcheck of a character should be placed out of the for loop, and the judgment of J = 1 should be removed. The judgment of the outer layer should be changed to If (J + 1> minwordlength ).

Finally, I will introduce myself. I am currently a System Architect at Dianping.com and my personal blog is www.steven xu.com. The entire development team of comming.com is about to launch a team blog.ArticleGenerally, it will be sent to the blog site and team blog first, and then sent to the personal blog after feedback and revision.

PS: currently, the minimum length and Case sensitivity are matched. This function must be implemented when you replace dirty words.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.