Implement sensitive word filtering in Java

Source: Internet
Author: User

Sensitive word and text filtering is an essential function of a website. It is necessary to design a good and efficient filtering algorithm. Some time ago, a friend of mine (who graduated immediately and was not very familiar with programming) asked me to help him read a text filter. It said the retrieval efficiency was very slow. I will show it through the process: Read the sensitive dictionary, if the HashSet collection, get the page to upload text, and then perform matching. I think this process must be very slow. For those who have no contact with him, I can only think of this. The more advanced point is the regular expression. But unfortunately, both methods are not feasible. Of course, I do not know that the algorithm can solve the problem in my consciousness, but Google knows that!

DFA Overview

DFA is the only good algorithm to implement text filtering. DFA is the Deterministic Finite automatic, that is, the Deterministic Finite automatic machine. It obtains the next state through the event and the current state, that is, event + state = nextstate. Shows the transition of its status

A B
S -----> u s -----> v u -----> V

In the algorithm that implements sensitive word filtering, we must reduce the number of operations, while DFA has almost no calculation in the DFA algorithm, and some are only state conversion.

References: http://www.iteye.com/topic/336577

Implement the DFA Algorithm in Java to filter sensitive words

The key to sensitive word filtering in Java is the implementation of the DFA algorithm. First, we will analyze it. In this process, we think the following structure will be clearer.

/** Read the sensitive dictionary, put the sensitive words into the HashSet, and construct a DFA Algorithm Model: <br> * medium = {* isEnd = 0 * Country = {<br> * isEnd = 1 * person = {isEnd = 0 * citizen = {isEnd = 1 }*}* male = {* isEnd = 0 * person = {* isEnd = 1 *} * 5 = {* isEnd = 0 * star = {* isEnd = 0 * red = {* isEnd = 0 * flag = {* isEnd = 1 *} * @ author chenming * @ date 3:04:20 on January 1, April 20, 2014 * @ param keyWordSet sensitive dictionary * @ version 1.0// @ SuppressWarnings ({"rawtypes", "unchecked"}) private void addSensitiveWordToHashMap (Set <String> keyWordSet) {sensitiveWordMap = new HashMap (keyWordSet. size (); // initialize the sensitive word container to reduce the expansion operation String key = null;
Map nowMap = null; Map <String, String> newWorMap = null; // iterative keyWordSet Iterator <String> iterator = keyWordSet. iterator (); while (iterator. hasNext () {key = iterator. next (); // keyword nowMap = sensitiveWordMap; for (int I = 0; I <key. length (); I ++) {char keyChar = key. charAt (I); // convert to char-type Object wordMap = nowMap. get (keyChar); // get

<Span> nowMap. put (keyChar, newWorMap); nowMap = newWorMap ;}

            </span><span>                }            }        }    }

The hashMap structure obtained by running the command is as follows:

{Five = {star = {Red = {isEnd = 0, flag = {isEnd = 1 }}, isEnd = 0}, isEnd = 0}, medium = {isEnd = 0, country = {isEnd = 0, person = {isEnd = 1}, man = {isEnd = 0, person = {isEnd = 1 }}}}}

We have implemented the sensitive word library in a simple way. How can we achieve retrieval? The retrieval process is nothing more than the get Implementation of hashMap. Finding it proves that the word is a sensitive word, otherwise it is not a sensitive word. The process is as follows: if we match "Long live of the Chinese people ".

1. The first word "medium" can be found in hashMap. Get a new map = hashMap. get ("").

2. If map = null, it is not a sensitive word. Otherwise, jump to 3.

3. Obtain isEnd in map and determine whether the word is the last one by checking whether isEnd is equal to 1. If isEnd = 1, the word is a sensitive word. Otherwise, it jumps to 1.

Through this step, we can determine that "Chinese people" is a sensitive word, but if we enter "Chinese Women", it is not a sensitive word.

/*** Check whether the text contains sensitive characters. The check rules are as follows: <br> * @ author chenming * @ date 4:31:03 on April 9, April 20, 2014 * @ param txt * @ param beginIndex * @ param matchType * @ return. If yes, the length of sensitive word characters is returned, 0 * @ version 1.0 */@ SuppressWarnings ({"rawtypes"}) public int CheckSensitiveWord (String txt, int beginIndex, int matchType) {boolean flag = false; // end identifier of sensitive words: used when only one sensitive word is used. int matchFlag = 0; // The default number of matching identifiers is 0 char word = 0; Map nowMap = sensiti VeWordMap; for (int I = beginIndex; I <txt. length (); I ++) {word = txt. charAt (I); nowMap = (Map) nowMap. get (word); // get the specified key if (nowMap! = Null) {// indicates whether the last matchFlag ++ exists. // locate the corresponding key and match the ID + 1 if ("1 ". equals (nowMap. get ("isEnd") {// if the last matching rule ends the loop, return the number of matched identifiers flag = true; // The end flag is true if (SensitivewordFilter. minMatchTYpe = matchType) {// minimum rule. Return directly. The maximum rule still needs to be searched for break; }}} else {// does not exist. Return break directly ;}} if (matchFlag <2 &&! Flag) {matchFlag = 0;} return matchFlag ;}

At the end of the article, I provided file downloads that use Java to filter sensitive words. The following is a test class to prove the efficiency and reliability of this algorithm.

Public static void main (String [] args) {SensitivewordFilter filter = new SensitivewordFilter (); System. out. println ("number of sensitive words:" + filter. sensitiveWordMap. size (); String string = "too many sad feelings may be limited to the plot on the screen of the breeding base, the hero tried to gradually explain his suicide guide in a certain way with the sadness he had experienced. "+" Method. wheel. the role we play is to add our emotions to the screen plot with the anger and sorrow of the hero's red guest alliance, and then we are moved to tears, "+" when you are sad, you can lie in the arms of a person to express your heart or cell phone card copy. One person has a cup of red wine and one movie is at night. level. closed the phone for a quiet evening and sat quietly. "; System. out. println ("number of words in the statement to be detected:" + string. length (); long beginTime = System. currentTimeMillis (); Set <String> set = filter. getSensitiveWord (string, 1); long endTime = System. currentTimeMillis (); System. out. println ("the number of sensitive words in the statement is:" + set. size () + ". Include: "+ set); System. out. println (" total consumed time: "+ (endTime-beginTime ));}

Running result:

From the above results, we can see that there are 771 sensitive word libraries, the length of the detection statement is 184 characters, and 6 sensitive words are found. It takes 1 ms in total. The visible speed is quite impressive.

Download the following two documents:

Top.rar (http://pan.baidu.com/s/1o66teGU) contains two Java files, one is to read the sensitive Dictionary (SensitiveWordInit), one is the sensitive word tool class (SensitivewordFilter ), it contains sensitive words (isContaintSensitiveWord (String txt, int matchType) and getSensitiveWord (String txt, int matchType )) replaceSensitiveWord (String txt, int matchType, String replaceChar.

Sensitive word library: http://pan.baidu.com/s/1pJoGhVP

 

Note: all sensitive words in this article are used for testing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.