In the previous project, the customer put forward a demand, need to use text into the system of the function of voice sent to detect sensitive words, prohibit users to submit the voice of sensitive words. Through the query of various aspects of information, organized a few scenarios:
- When the project starts, load the sensitive thesaurus as a cache (a large map, the sensitive word is key, take any value). To the request incoming text participle, traverse the word breaker, each word is found in the map, if there is a value, the request text has a sensitive word.
- The sensitive thesaurus is stitched into a large regular expression and then directly matches the text.
- DFA algorithm using DFA (deterministic finite state automata)
For the choice of options, the online reference to a lot of other people's code. The simplest is that method 2 uses regular expressions, but the text is said to have a large efficiency problem. About the method 3DFA algorithm, because in the school time algorithm lesson and the compiling principle did not listen attentively (ashamed = =| | ), this method is ignored directly, so the final decision is to use Method 1.
In fact, Method 1 still has a lot of ways to improve, and then refer to the 12 floor method of this post, using indexed array with associative array, improve the efficiency of the retrieval, even the steps of the word-breaker are omitted. The entire implementation code is as follows.
Import Org.apache.commons.lang.stringutils;import Org.apache.commons.io.fileutils;import Org.apache.commons.lang.stringutils;import Java.io.ioexception;import Java.util.arraylist;import Java.util.hashmap;import java.util.list;import java.util.map;/*** user:eternity* date:2014/8/11* time:16:17* Sensitive word detection class * Sensitive word detection initialization rules: * Load sensitive words from the thesaurus and generate a hash table of sensitive words in 2, 3, 4, and 5 words. * When these hash tables are formed into an array of banwordslist, the array subscript indicates the number of words in the sensitive thesaurus * Banwordslist[2] = {A horse: true, Masked: true, la La: true};* banwordslist[3] = {Some horse: true, Three words: true, La La la: true, small ads: true};* banwordslist[4] = {Some bad silver: true, four characters: true, haha haha: true, love Fengjie: true};* banwordslist[5] = {Some Dafa is good : True, five sensitive words: true};* According to the above groups of sensitive words, automatically generate the following index * Generate rule for, the index name is the first word of the sensitive word, the value is a int* the int's rule is, when the int is converted to binary, the first 1 indicates that the above 4 table has a sensitive word of length I, Otherwise there is no sensitive word of length I (10000) * Wordindex = {Two: 0x04, three: 0x08, four: 0x10, five: 0x20, one: 0x3c, 0x0c, ha: 0x10, small: 0x08, just: 0x10};** check the following rules: * 1, verbatim test, whether the word is in the Wordindex Index table. * 2, if not in the table, continue to test * 3, if in the table, according to the index table of the key value, take this word and the word after the test Detail Table banwordslist[index Word length]. * * Test Example * There is a paragraph of the following text to check whether it contains sensitive words: "I will play ads, angry moderator"--Detection "I" |-not the Index table-detection "just" |-in the Index Table | The index value of "on" is 0x10, indicating that there are 4 words to ""The sensitive words at the beginning--take" on "and the following words a total of 4, composition" Dozen Small Wide "|-check 4 word Sensitive glossary, no this, continue--detect" hit "|-not the Index table-detection" small "|-in the Index Table |-The index value is 0x08, which indicates that there are 3 words of the length of the word |-take" small "and" small "the words behind , a total of 3 words form a word "small ads" |-"small ads" in 3 word sensitive words, this post contains sensitive words, prohibit publishing */public class Banwordsutil {//public Logger Logger = Logger.getlogger (This . GetClass ()); public static final int words_max_length = 10; public static final String ban_words_lib_file_name = "BanWords.txt"; List of sensitive words public static map[] banwordslist = null; Sensitive word index public static map<string, integer> Wordindex = new hashmap<string, integer> (); /* Initialize Sensitive thesaurus */public static void Initbanwordslist () throws IOException {if (banwordslist = = null) { Banwordslist = new Map[words_max_length]; for (int i = 0; i < banwordslist.length; i++) {Banwordslist[i] = new hashmap<string, string> (); }}//The directory where the sensitive word is located, here is the txt text, a sensitive word line String path = BanWordsUtil.class.getClassLoader () . GETRESOURCE (ban_words_lib_file_name). GetPath (); SYSTEM.OUT.PRINTLN (path); list<string> words = fileutils.readlines (fileutils.getfile (path)); for (String w:words) {if (Stringutils.isnotblank (W)) {//To save sensitive words to map by length Banwords List[w.length ()].put (W.tolowercase (), ""); Integer index = wordindex.get (w.substring (0, 1)); Generate a sensitive word index, and deposit the map if (index = = null) {index = 0; } int x = (int) Math.pow (2, W.length ()); Index = (index | x); Wordindex.put (w.substring (0, 1), index); }}}/** * Retrieve sensitive words * @param content * @return */public static list<string> Searchbanword S (String content) {if (banwordslist = = null) {try {initbanwordslist (); } catch (IOException e) {throw new RuntimeexcEption (e); }} list<string> result = new arraylist<string> (); for (int i = 0; i < content.length (); i++) {Integer index = wordindex.get (content.substring (i, i + 1)); int p = 0; while ((index! = null) && (Index > 0)) {p++; index = index >> 1; String sub = ""; if ((i + P) < (Content.length ()-1)) {sub = content.substring (i, i + P); } else {sub = content.substring (i); } if (((index% 2) = = 1) && Banwordslist[p].containskey (sub)) {Result.add (conte Nt.substring (i, i + P)); SYSTEM.OUT.PRINTLN ("Find sensitive Words:" +content.substring (i,i+p)); }}} return result; public static void Main (string[] args) throws IOException {String content = "A test statement containing a sensitive word. ";Banwordsutil.initbanwordslist (); list<string> banwordlist = banwordsutil.searchbanwords (content); for (String S:banwordlis) {System.out.println ("Find sensitive words:" +s); } }}
The above test language in fact there is no sensitive words (I am also afraid of being shielded XD), the test when casually add a few sensitive words can be detected. This enables a simple and fast detection of sensitive words, of course, if there is a need for more sophisticated detection logic (such as "guitar mother is really beautiful"), or to use the word breaker tool to split the words.
Writing for the first time with markdown, haha:)
PS: Thank you for the patient answer in the discussion area. :) http://www.oschina.net/question/1010578_164557
Some summaries about detection of sensitive words in Java