Java implementation of sensitive word filter Instance _java

Source: Internet
Author: User
Tags rar

Sensitive words, text filtering is an essential function of a website, how to design a good, efficient filtering algorithm is very necessary. Some time ago I a friend (immediately graduate, contact programming soon) asked me to help him to see a text filter things, it said the retrieval efficiency is very slow. I'll take it over. A look, the whole process is as follows: Read the sensitive word library, if the HashSet collection, get the page upload text, and then match. I think the process must be very slow. I think I can only think of this when he's not in touch, the more advanced point is the regular expression. Unfortunately, neither of these methods is feasible. Of course, in my mind there is no I do not know that the algorithm can solve the problem, but Google knows!

Introduction to the DFA

In the algorithm of text filtering, the DFA is the only better implementation algorithm. The DFA, deterministic finite automaton, is the determination of a poor automaton, which is the next state, that is event+state=nextstate, through the event and the current state. The following figure shows the transition of its state
In this picture, uppercase letters (S, U, V, Q) are states, and lowercase A and B are actions. From the above diagram we can see the following relationship

A b b

s-----> U S-----> v u-----> V

In the algorithm of realizing the filtering of sensitive words, we must reduce the operation, while the DFA has little computation in the DFA algorithm, and some just state conversion.

Implementation of DFA algorithm for filtering of sensitive words in Java

The key to realize the filtering of sensitive words in Java is the implementation of DFA algorithm. First of all, we analyze the above image. In the process, we think that the following structure will be clearer.



At the same time there is no state conversion, no action, and some just query (lookup). We can think that through S query U, V, through U query V, p, through v query U p. With this transformation, we can turn the transformation of the state into a lookup using a Java collection.

Admittedly, there are several sensitive words in our sensitive thesaurus: Japanese, Japanese, Mao Ze dong. So what kind of structure do I need to build?

First: Query Day---> {ben}, query this--->{people, devils}, query--->{null}, query Ghost---> {child}. form the following structure:

Here we extend the diagram:


This allows us to build our sensitive thesaurus into a tree similar to one, so that when we judge whether a word is a sensitive word, we greatly reduce the matching range of the search. For example, we have to judge the Japanese, according to the first word we can confirm the need to retrieve the tree, and then in this tree to retrieve.

But how do you judge a sensitive word is over? Use the identification bit to judge.

So for this key is how to build a tree of such sensitive words. Here I have hashmap in Java as an example to implement the DFA algorithm. The specific process is as follows:

Japanese, Japanese devils for example

1, in the HashMap query "Day" to see whether it exists in the HashMap, if it does not exist, then prove that the "day" at the beginning of the sensitive word does not exist, then we directly build such a tree. Jump to 3.

2, if found in the HashMap, indicating that there is a "day" beginning of the sensitive words, set hashMap = Hashmap.get ("Day"), jump to 1, followed by "This", "person".

3, to determine whether the word is the last word in the word. If a sensitive word ends, set the flag bit isend = 1, otherwise set the flag bit isend = 0;

The program is implemented as follows:

 /** * Reads the sensitive word library, puts the sensitive word into the hashset, constructs a DFA algorithm model:<br> * = {* Isend = 0 * Country = {<br> * Isend = 1  
  * Person = {isend = 0 * min = {isend = 1} * * male = {* Isend = 0 * person = {* Isend = 1    *} * [0-9] *} * five = {* Isend = 0 * Star = {* Isend = 0 * Red = {* Isend = 0 * flag = {* Isend = 1 *} * {} *} * @author chenming * @date April 2014  20th Afternoon 3:04:20 * @param keywordset Sensitive Thesaurus * @version 0/@SuppressWarnings ({"Rawtypes", "Unchecked"}) Private  void Addsensitivewordtohashmap (set<string> keywordset) {sensitivewordmap = new HashMap (Keywordsetsize ()); 
  Initialize the sensitive Word container, reduce the expansion operation String key = null; 
  Map nowmap = null; 
  map<string, string> newwormap = null; 
  Iterative keywordset iterator<string> iterator = Keywordsetiterator (); while (Iteratorhasnext ()) {key = Iteratornext ();//Keyword NOWMAP = sensitiveworDMap;  for (int i = 0; i < keylength (); i++) {Char Keychar = Keycharat (i);  Convert to char type Object wordmap = Nowmapget (Keychar); 
    Gets the IF (wordmap!= null) {//If the key exists, the direct assignment nowmap = (Map) wordmap; 
     else{//does not exist, then build a map and set the Isend to 0, because he is not the last Newwormap = new hashmap<string,string> ();  Newwormapput ("Isend", "0"); 
     Not the last Nowmapput (Keychar, Newwormap); 
    Nowmap = Newwormap; 
 if (i = = Keylength ()-1) {Nowmapput ("Isend", "1");//Last}}}

The HASHMAP structure to run is as follows:

{Five ={star ={Red ={isend=0, Flag ={isend=1}, Isend=0}, isend=0}, Middle ={isend=0, Country ={isend=0, man ={isend=1}, male ={isend=0, Man ={isend=1}}}

Sensitive Thesaurus We have a simple method to implement, then how to achieve the search? The retrieval process is nothing more than a hashmap get implementation, found to prove that the word is sensitive words, otherwise not for sensitive words. The process is as follows: if we match the "long live Chinese people".

1, the first word "Chinese", we can find in the HashMap. Get a new map = Hashmap.get ("").

2. If map = = NULL, it is not a sensitive word. or jump to 3.

3, get the isend in the map, whether or not the isend is equal to determine whether the word is the last one. If Isend = 1 indicates that the word is a sensitive word, skip to 1.

By this step we can judge the "Chinese people" as sensitive words, but if we enter "Chinese women" it is not a sensitive word.

/** * Check if the text contains sensitive characters, check the rules as follows:<br> * @author chenming * @date April 20, 2014 afternoon 4:31:03 * @param txt * @para M Beginindex * @param matchtype * @return, if present, returns the length of the sensitive word character, does not exist return 0 * @version 0 * * @SuppressWarnings ({"Rawty PES "}) public int checksensitiveword (String txt,int beginindex,int matchtype) {Boolean flag = false;//sensitive Word end identity bit: for sensitive words  Only 1-bit condition int matchflag = 0; 
  The matching identity number defaults to 0 char word = 0; 
  Map nowmap = Sensitivewordmap; 
   for (int i = Beginindex i < txtlength (); i++) {word = Txtcharat (i);  Nowmap = (MAP) nowmapget (word);  Gets the specified key if (Nowmap!= null) {//exists, then determines whether it is the last matchflag++;  Find the corresponding key, match the identity +1 if ("1" Equals ("Isend")) {//If the last match rule, the end loop, return the matching identity number flag = true; nowmapget 
     End Flag bit is true if (Sensitivewordfilterminmatchtype = = MatchType) {//Minimum rule, direct return, maximum rule still need to continue lookup break; 
   }} else{//Does not exist, return to break directly; 
  } if (Matchflag < 2 &&!flag) {matchflag = 0; } return Matchflag; 
 }

At the end of the article I provided the use of Java to implement sensitive word filter file download. Here is a test class to prove the efficiency and reliability of the algorithm.

public static void Main (string[] args) { 
  sensitivewordfilter filter = new Sensitivewordfilter (); 
  Systemoutprintln ("Number of sensitive words:" + filtersensitivewordmapsize ()); 
  String string = "Too much sadness may be confined to the plot of the base screen, and the protagonist tries to use some kind of way gradually to release the suicide Guide with the sadness of their own experience. " 
      +" and then Falun Gong our role is to follow the protagonist's Joy Red Guest union anger and music, and too far-fetched to their own feelings are attached to the screen plot, and then moved to tears, " 
      +" Sad to lie in the bosom of a person to indulge in the elaboration of the heart or mobile phone card replicator A person a glass of wine a movie in the night, three-level film deep quiet night, shut the phone quietly in a daze. "; 
  Systemoutprintln ("to detect the sentence words:" + stringlength ()); 
  Long begintime = Systemcurrenttimemillis (); 
  set<string> set = Filtergetsensitiveword (String, 1); 
  Long endtime = Systemcurrenttimemillis (); 
  The number of sensitive words in the SYSTEMOUTPRINTLN ("statement") is: "+ setsize () +". Contains: "+ set"; 
  Systemoutprintln ("Total consumption time is:" + (Endtime-begintime)); 
 

Run Result:

From the above results, we can see that there are 771 sensitive thesaurus, detection statement length of 184 characters, found 6 sensitive words. A total of 1 milliseconds. The visible speed is still very considerable.

The following two document downloads are available:

Desktop.rar (http://xiazai.jb51.net/201611/yuanma/Desktop_ Jb51.rar) contains two Java files, one reads the sensitive word library (sensitivewordinit), and the other is a sensitive Word tool class (Sensitivewordfilter) that contains a sense of whether a sensitive word exists (Iscontaintsensitiveword (String Txt,int matchtype), get Sensitive Words (Getsensitiveword (string txt, int matchtype)), sensitive word substitution (Replacesensitiveword (string txt , int matchtype,string Replacechar)) three methods.

Sensitive Word Library: Click to download

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.