Java implements sensitive word filtering (RPM)

Source: Internet
Author: User

Sensitive words, text filtering is an essential function of a website, how to design a good, efficient filtering algorithm is very necessary. Some time ago I a friend (immediately graduated, contact programming soon) want me to help him to see a text filter thing, it says retrieval efficiency is very slow. I took it to the program to see the whole process is as follows: Read the sensitive thesaurus, if the HashSet collection, get the page upload text, and then make a match. I just thought the process must be very slow. I think I can only think of this for someone who has no contact, and the more advanced point is the regular expression. Unfortunately, neither of these approaches is feasible. Of course, without me in my mind, I didn't realize that the algorithm would solve the problem, but Google knows it!

About DFA

In the implementation of text filtering algorithm, DFA is the only better implementation algorithm. The DFA is deterministic finite automaton, which is determined to have a poor automaton, it is through the event and the current state to get the next state, that is, event+state=nextstate. Shows the transformation of its state

In this image, capital letters (S, U, V, Q) are states, and lowercase A and B are actions. We can see the following relationship through

A b b
s-–> U s-–> v u-–> v

In the algorithm that implements the filtering of sensitive words, we have to reduce the computation, and the DFA has little computation in the DFA algorithm, and some are just the transformation of the state.

Reference: http://www.iteye.com/topic/336577

Java implementation of DFA algorithm for sensitive word filtering

The key to implement sensitive word filtering in Java is the implementation of the DFA algorithm. First of all, we analyze. In this process, we think that the following structure will be more clearly understood.

At the same time there is no state transition, no action, some just query (find). We can assume that through the S query U, V, through the U query V, p, through the V query U p. With this transition we can transform the state into a lookup using the Java collection.

Admittedly, there are a few sensitive words to join in our sensitive thesaurus: The Japanese, the Japs, Obama. So what kind of structure do I need to build into?

First: query date {This}, query Ben->{person, devil}, query person->{null}, query ghost---{child}. form the following structure:

Let's expand on this diagram:

This allows us to build our sensitive thesaurus into a tree similar to one, so that when we judge whether a word is a sensitive word, we greatly reduce the scope of the search. For example, we have to judge the Japanese, according to the first word we can confirm the need to retrieve the tree, and then in this tree to search.

But how to judge if a sensitive word is over? Use the identity bit to judge.

So for this key is how to build a tree such a tree of sensitive words. Below I have hashmap in Java as an example to implement the DFA algorithm. The process is as follows:

Japanese, JAP, for example

1, in the HashMap query "Day" to see whether it exists in the HashMap, if not exist, it proves that the "day" beginning of the sensitive word does not exist, we directly build such a tree.

Jump to 3.

2, if found in the HashMap, indicates that there is a "day" beginning of the sensitive word, set hashMap = Hashmap.get ("Day"), jump to 1, in turn, matching "Ben", "person".

3, judge whether the word is the last word in the word. If the expression of the sensitive word ends, set the flag bit isend = 1, otherwise set the flag bit isend = 0;

The program is implemented as follows:

/** * Read the sensitive word library, put the sensitive words into the hashset, build a DFA algorithm model:<br> * = {* Isend = 0 * Country = {<br> *           Isend = 1 * person = {isend = 0 * min = {isend = 1} *} *                       Male = {* Isend = 0 * person = {* Isend = 1 *  } *} *} *} * five = {* Isend = 0 * Star =                   {* Isend = 0 * Red = {* Isend = 0 * flag = {* Isend = 1 *} *} *} *} * @author chenming * @ Date April 20, 2014 afternoon 3:04:20 * @param keywordset Sensitive Thesaurus * @version 1.0 */@SuppressWarnings ({"Rawtypes", "Unch Ecked "}) private void Addsensitivewordtohashmap (set<string> keywordset) {sensitivewordmap = new HashMap (     Keywordset.size ());Initializes the sensitive word container, reducing the expansion operation of the String key = null;        Map nowmap = null;        map<string, string> newwormap = null;        Iteration Keywordset iterator<string> Iterator = Keywordset.iterator ();    while (Iterator.hasnext ()) {key = Iterator.next ();            Keyword Nowmap = sensitivewordmap;       for (int i = 0; i < key.length (); i++) {Char KeyChar = Key.charat (i);       Convert to char type Object wordmap = Nowmap.get (KeyChar);                 Gets if (wordmap! = null) {//If the key exists, direct assignment Nowmap = (Map) wordmap; } else{//Does not exist, then build a map and set Isend to 0, because he is not the last Newwormap = new Hashm                    Ap<string,string> ();     Newwormap.put ("Isend", "0");                    Not the last Nowmap.put (KeyChar, Newwormap);                Nowmap = Newwormap;    } if (i = = Key.length ()-1) {                Nowmap.put ("Isend", "1"); Last}}}

The resulting HASHMAP structure is as follows:

{Five ={star ={Red ={isend=0, Flag ={isend=1}}, Isend=0}, Isend=0}, Zhong ={isend=0, Guo ={isend=0, man ={isend=1}, male ={isend=0, Person ={isend=1}}}}

Sensitive Thesaurus We have a simple way to implement, so how to implement retrieval? The retrieval process is nothing more than a get implementation of HashMap, found to prove that the word is a sensitive word, otherwise it is not a sensitive word. The process is as follows: if we match "long live the Chinese people".

1, the first word "in", we can find in the HashMap. Get a new map = Hashmap.get ("").

2. If map = = NULL, it is not a sensitive word. Otherwise jump to 3

3, to obtain the map of Isend, through the isend is equal to one to determine whether the word is the last. if Isend = = 1 indicates that the word is a sensitive word, skip to 1.

By this step we can judge the "Chinese people" as sensitive words, but if we enter "Chinese woman" is not a sensitive word.

/** * Check if the text contains sensitive characters, check the rules as follows:<br> * @author chenming * @date April 20, 2014 afternoon 4:31:03 * @param txt * @param beginindex * @param matchtype * @return, if present, returns the length of the sensitive word character, does not exist return 0 * @version 1.0 */@SuppressWarni NGS ({"Rawtypes"}) public int checksensitiveword (String txt,int beginindex,int matchtype) {Boolean flag = False    ;     Sensitive word end identification bit: for sensitive words only 1 bits of case int matchflag = 0;        The default number of matching identities is 0 char word = 0;        Map nowmap = Sensitivewordmap;            for (int i = beginindex; I < txt.length (); i++) {word = Txt.charat (i);     Nowmap = (MAP) nowmap.get (word);     Gets the specified key if (nowmap! = null) {//exists, then determines whether it is the last matchflag++;                    Locate the corresponding key, matching the identification +1 if ("1". Equals (Nowmap.get ("Isend"))) {//If the last matching rule, end loop, returns the number of matching identities       Flag = true;                  The end flag bit is true if (Sensitivewordfilter.minmatchtype = = MatchType) {//min rule, direct return, maximum rule still needs to be searched      Break            }}} else{//does not exist and directly returns a break;        }} if (Matchflag < 2 &&!flag) {matchflag = 0;    } return Matchflag; }

At the end of the article I provided a file download using Java to implement sensitive word filtering. The following is a test class to demonstrate the efficiency and reliability of this algorithm.

public static void Main (string[] args) {        sensitivewordfilter filter = new Sensitivewordfilter ();        System.out.println ("Number of sensitive words:" + filter.sensitiveWordMap.size ());        String string = "Too much sentimental might only be confined to the plot on the base screen, and the protagonist tries to use some sort of a way to release the suicide guide gradually and with the sadness of his own experience." "                        +" Our role is to follow the protagonist of the Red Guest League angry funeral music and too far-fetched to their own feelings attached to the screen plot, and then moved on the tears, "                        +" Sad to lie in the bosom of a person to indulge in the elaboration of the heart or mobile phone card replicator a glass of red wine a movie in the night three-level film deep quiet night, shut the phone quietly in a daze. ";        System.out.println ("Number of words to be detected:" + string.length ());        Long beginTime = System.currenttimemillis ();        set<string> set = Filter.getsensitiveword (String, 1);        Long endTime = System.currenttimemillis ();        System.out.println ("The number of sensitive words contained in the statement is:" + set.size () + ".) Include: "+ Set");        System.out.println ("Total consumption time:" + (Endtime-begintime));    }

Operation Result:

From the above results can be seen, the sensitive thesaurus has 771, the detection statement length of 184 characters, detect 6 sensitive words. It takes a total of 1 milliseconds. The visible speed is still very impressive.

Two documents are available below for download:

Desktop.rar (Http://pan.baidu.com/s/1o66teGU) contains two Java files, one is to read the sensitive thesaurus (sensitivewordinit), One is the sensitive Word tool class (Sensitivewordfilter), which contains the judgment whether there are sensitive words (Iscontaintsensitiveword (String txt,int matchtype)), Gets the sensitive Word (getsensitiveword (string txt, int matchtype)), the sensitive word substitution (Replacesensitiveword (string Txt,int matchtype,string Replacechar)) Three methods.

Sensitive thesaurus: Http://pan.baidu.com/s/1pJoGhVP

http://cmsblogs.com/?p=1031

Java implements sensitive word filtering (RPM)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.