Algorithm Analysis Based on DFA sensitive word query and dfa algorithm analysis
Article copyright by the author Li Xiaohui and blog park a total of, if reproduced Please clearly indicated in the source: http://www.cnblogs.com/naaoveGIS/
1. Background
To filter sensitive words in a project, you can select the following solutions:
A. After the sensitive phrase is directly woven into a String, the indexOf method is used for query.
B. SQL queries after traditional sensitive words are stored in the database.
C. Use Lucene to create a word segmentation index for query.
D. Use the DFA algorithm.
First, there are thousands of sensitive words collected by the project, and the use of Solution a is definitely not good. Second, in order to facilitate future scalability and minimize the dependence on the database, the B solution is abandoned. Then Lucene itself acts as a local index. When a sensitive word is added, it needs to trigger an update index. Here, we do not want to introduce more libraries based on the lightweight principle, so we give up the c solution. So we chose Solution d as the research goal.
2. Introduction to DFA Algorithms
The full name of DFA is: Deterministic Finite Automaton, that is, determining the Finite automatic machine. It has the following features: there is a finite state set and some edges that lead from one State to another. Each edge is marked with a symbol, one of which is the initial state and some is the final state. However, unlike the uncertain finite automaton, DFA does not have two edge signs starting from the same State with the same symbol.
Simply put, it obtains the next state through the event and the current state, that is, event + state = nextstate. It is understood that there are multiple nodes in the system. By passing the incoming event, you can determine which route to take to another node, while the node is limited.
3. DFA algorithm 3.1 sensitive dictionary construction description in sensitive word search
The description is based on two sensitive words: Wang babao and Wang babao. First, a sensitive word dictionary is constructed. The name of the dictionary is SensitiveMap. The binary tree structure of the two words is as follows:
The hash table is constructed as follows:
3.2 description of the algorithm based on the sensitive dictionary
The SensitiveMap constructed in the preceding example is a sensitive dictionary. Assume that the keyword entered here is: wang ba is not good. The flowchart is as follows:
4. Code Writing 4.1 construct sensitive words to implement code
4.2 sensitive word query code
5. Optimization ideas 5.1 meaningless characters in the middle of sensitive words
For words like "King * 8 & egg", the center is filled with meaningless characters to confuse them. When we collect sensitive words, we should also filter meaningless words, skip the loop to this type of meaningless characters to avoid interference.
5.2 use Pinyin instead of pinyin for sensitive words
There are two solutions: one is that the simplest is to encounter such problems. First, we need to enrich the sensitive lexicon to quickly solve them. The second type is to convert sensitive words into pinyin for comparison and judgment.
However, neither of the two solutions can completely solve this problem, and further research is needed.
----- Welcome to reprint, but retain the copyright, Please clearly indicated in the source: http://www.cnblogs.com/naaoveGIS/