Java implementation of sensitive word filtering (DFA algorithm)

Source: Internet
Author: User

Little Alan in the recent development encountered a sensitive word filter, then went online to look up a lot of sensitive word filter information, here also share their understanding with you.

Before writing, little Alan recommended a post from http://cmsblogs.com/?p=1031, and would refer to some of the content to describe the blog post.

Sensitive word filter should not give you too much explanation, right? White is when you enter certain words in the project (such as input xxoo related text) to be able to detect, many projects will have a sensitive word management module, in the sensitive Word management module you can add sensitive words, and then according to the added sensitive words to filter the input content of the sensitive words and processing, or prompt , either highlighted or replaced directly with other text or symbols.

There are many ways to filter sensitive words, and I briefly describe a few of the things I understand now:

① Query the database of sensitive words, loop every sensitive word, and then go through the input text to search through, to see if there is such a sensitive word, there is the corresponding treatment, this way to speak White is to find a deal with one.

Advantages: So easy. Java code to achieve basic no difficulty.

Disadvantage: This efficiency let my heart run over 100,000 grass mud horse, and match is not some egg pain, if it is English you will find a very silent thing, such as English A is a sensitive word, then I if is an English document, that program its sister's have to deal with how many times sensitive words? Who can tell me?

② legend of the DFA algorithm (there are poor automata), is what I want to share with you, after all, feel more general, the principle of the algorithm hope that we can go online to check the information, here is not detailed description.

Advantage: At least above that SB efficiency high.

Disadvantage: For the learning algorithm should not be difficult, for not learning the algorithm used is not difficult, is to understand a bit of GG pain, matching efficiency is not high, more memory, more sensitive words, memory occupies the larger.

③ the third kind of here to specifically explain, that is you to write an algorithm, or on the basis of the existing algorithm optimization, which is also the pursuit of small Alan one of the highest level, if any of the prostitutes have their own ideas must not forget the little Alan, Can add little Alan's qq:810104041 teach little Alan two tricks.

So, how is the legendary DFA algorithm implemented?

The first step: the Sensitive thesaurus initialization (the sensitive word with the DFA algorithm of the principle of encapsulation into the sensitive Word library, the sensitive Word library with HashMap Save), the code is as follows:

Package Com.cfwx.rox.web.sysmgr.util;import Java.util.hashmap;import Java.util.hashset;import java.util.Iterator; Import Java.util.list;import Java.util.map;import Java.util.set;import     com.cfwx.rox.web.common.model.entity.sensitiveword;/** * Sensitive Thesaurus initialization * * @author Alanlee * */public class sensitivewordinit{    /** * Sensitive Thesaurus */public HashMap sensitivewordmap;        /** * Initialization of sensitive words * * @return */public Map Initkeyword (list<sensitiveword> sensitivewords) { try {//Remove the sensitive word from the sensitive Word collection object and encapsulate it into the Set collection set<string> Keywordset = new hashset<string>            ();            for (Sensitiveword s:sensitivewords) {Keywordset.add (S.getcontent (). Trim ());        }//Add the sensitive thesaurus to HashMap Addsensitivewordtohashmap (Keywordset);        } catch (Exception e) {e.printstacktrace ();    } return Sensitivewordmap; }/** * Package Sensitive thesaurus * * @param KeyWordSET * @SuppressWarnings ("rawtypes") private void Addsensitivewordtohashmap (set<string> keywordset) {        Initializes the HashMap object and controls the size of the container Sensitivewordmap = new HashMap (Keywordset.size ());        Sensitive word String key = null;        Used to save the sensitive thesaurus data in the appropriate format Map nowmap = null;        Used to assist in building sensitive Thesaurus map<string, string> newwormap = null;        Use an iterator to loop the collection of sensitive words iterator<string> Iterator = Keywordset.iterator ();            while (Iterator.hasnext ()) {key = Iterator.next ();            is equal to the sensitive thesaurus, the HashMap object occupies the same address in memory, so this Nowmap object changes, Sensitivewordmap object will also change nowmap = Sensitivewordmap;                for (int i = 0; i < key.length (); i++) {//intercepts the words in the sensitive word, the key value of the HashMap object in the sensitive thesaurus                Char KeyChar = Key.charat (i);                Determine if the word exists in the sensitive thesaurus Object Wordmap = Nowmap.get (KeyChar); if (wordmap! = null) {Nowmap = (MAP) Wordmap;                    } else {newwormap = new hashmap<string, string> ();                    Newwormap.put ("Isend", "0");                    Nowmap.put (KeyChar, Newwormap);                Nowmap = Newwormap;                    }//If the word is the last word of the current sensitive word, it is identified as the trailing word if (i = = Key.length ()-1) {                Nowmap.put ("Isend", "1");            } System.out.println ("Encapsulating the Sensitive Thesaurus process:" +sensitivewordmap);        } System.out.println ("View Sensitive Thesaurus data:" + Sensitivewordmap); }    }}

Step Two: Write a sensitive word filter tool class, which can write the method you need, the code is as follows:

Package Com.cfwx.rox.web.sysmgr.util;import Java.util.hashset;import Java.util.iterator;import java.util.Map;    Import java.util.set;/** * Sensitive word Filter Tool class * * @author Alanlee * */public class sensitivewordengine{/** * Sensitive Thesaurus */    public static MAP sensitivewordmap = null;    /** * Filter only minimum sensitive words */public static int minmatchtype = 1;    /** * Filter all sensitive words */public static int maxmatchtype = 2; /** * Sensitive thesaurus number of sensitive words * * @return * * public static int getwordsize () {if (Sensitivewordengine.sen        Sitivewordmap = = null) {return 0;    } return SensitivewordEngine.sensitiveWordMap.size (); /** * contains sensitive words * * @param txt * @param matchtype * @return */public static Boolean Iscont        Aintsensitiveword (String txt, int matchtype) {Boolean flag = false;            for (int i = 0; i < txt.length (); i++) {int matchflag = Checksensitiveword (txt, I, matchtype); if (mAtchflag > 0) {flag = true;    }} return flag; }/** * Get sensitive words content * * @param txt * @param matchtype * @return Sensitive words * */public static Set<s Tring> Getsensitiveword (String txt, int matchtype) {set<string> sensitivewordlist = new Hashset<str        Ing> ();            for (int i = 0; i < txt.length (); i++) {int length = Checksensitiveword (txt, I, matchtype);  if (length > 0) {//The detected sensitive words are saved to the collection Sensitivewordlist.add (txt.substring (i,                i + length));            i = i + length-1;    }} return sensitivewordlist; }/** * Replace sensitive words * * @param txt * @param matchtype * @param replacechar * @return */Publi        C Static string Replacesensitiveword (string txt, int matchtype, string replacechar) {string resulttxt = txt; set<string> set = GetsensitivEword (TXT, matchtype);        iterator<string> Iterator = Set.iterator ();        String word = null;        String replacestring = null;            while (Iterator.hasnext ()) {word = Iterator.next ();            replacestring = Getreplacechars (Replacechar, Word.length ());        Resulttxt = Resulttxt.replaceall (Word, replacestring);    } return resulttxt; }/** * Replace sensitive word contents * * @param replacechar * @param length * @return */private static String G        Etreplacechars (string replacechar, int length) {string resultreplace = Replacechar;        for (int i = 1; i < length; i++) {resultreplace + = Replacechar;    } return resultreplace; /** * Check the number of sensitive words * * @param txt * @param beginindex * @param matchtype * @return */Publ        IC static int Checksensitiveword (String txt, int beginindex, int matchtype) {Boolean flag = false;    Record the number of sensitive words    int matchflag = 0;        char word = 0;        Map nowmap = Sensitivewordengine.sensitivewordmap;            for (int i = beginindex; I < txt.length (); i++) {word = Txt.charat (i);            Determine if the word exists in the sensitive thesaurus Nowmap = (MAP) nowmap.get (word);                if (nowmap! = null) {matchflag++; Determines whether the word is the end of the sensitive word, and if it is the end word, determines whether to continue detecting if ("1". Equals (Nowmap.get ("Isend"))) {FL                    AG = TRUE;                        Determine the type of filtering, if it is a small filter out of the loop, otherwise continue to loop if (Sensitivewordengine.minmatchtype = = MatchType) {                    Break            }}} else {break;        }} if (!flag) {matchflag = 0;    } return Matchflag; }}

The third step: everything is ready, of course, query good database of sensitive words, and began to filter, the code is as follows:

    @SuppressWarnings ("Rawtypes")    @Override public    set<string> sensitivewordfiltering (String text)    {        //Initialize sensitive Thesaurus object        sensitivewordinit sensitivewordinit = new Sensitivewordinit ();        Gets a collection of sensitive Word objects from the database (the method called is from the DAO layer, which is the implementation class of the service layer)        list<sensitiveword> sensitivewords = Sensitiveworddao.getsensitivewordlistall ();        Build Sensitive thesaurus        Map sensitivewordmap = Sensitivewordinit.initkeyword (sensitivewords);        The sensitive thesaurus in the incoming Sensitivewordengine class        sensitivewordengine.sensitivewordmap = Sensitivewordmap;        Get the sensitive word which, incoming 2 means get all the sensitive words        set<string> Set = Sensitivewordengine.getsensitiveword (text, 2);        return set;    }

The final step: Write a method at the controller layer to the front-end request, the front-end to obtain the required data and to do the corresponding processing, the code is as follows:

    /**     * Sensitive word filter     *      * @param text     * @return     *    /@RequestMapping (value = "/word/filter")    @ Responsebody public    respvo sensitivewordfiltering (String text)    {        Respvo respvo = new Respvo ();        Try        {            set<string> Set = sensitivewordservice.sensitivewordfiltering (text);            Respvo.setresult (set);        }        catch (Exception e)        {            throw new Roxexception ("Error filtering sensitive words, contact maintenance staff");        return RESPVO;    }

  

Little Alan wrote a lot of comments in the code, I hope you can move their brains to understand well.

 

Java implementation of sensitive word filtering (DFA algorithm)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.