Algorithm Analysis Based on DFA sensitive word query and dfa algorithm analysis

Source: Internet
Author: User

Algorithm Analysis Based on DFA sensitive word query and dfa algorithm analysis

Article copyright by the author Li Xiaohui and blog park a total of, if reproduced Please clearly indicated in the source: http://www.cnblogs.com/naaoveGIS/

1. Background

To filter sensitive words in a project, you can select the following solutions:

A. After the sensitive phrase is directly woven into a String, the indexOf method is used for query.

B. SQL queries after traditional sensitive words are stored in the database.

C. Use Lucene to create a word segmentation index for query.

D. Use the DFA algorithm.

First, there are thousands of sensitive words collected by the project, and the use of Solution a is definitely not good. Second, in order to facilitate future scalability and minimize the dependence on the database, the B solution is abandoned. Then Lucene itself acts as a local index. When a sensitive word is added, it needs to trigger an update index. Here, we do not want to introduce more libraries based on the lightweight principle, so we give up the c solution. So we chose Solution d as the research goal.

2. Introduction to DFA Algorithms

The full name of DFA is: Deterministic Finite Automaton, that is, determining the Finite automatic machine. It has the following features: there is a finite state set and some edges that lead from one State to another. Each edge is marked with a symbol, one of which is the initial state and some is the final state. However, unlike the uncertain finite automaton, DFA does not have two edge signs starting from the same State with the same symbol.

 

Simply put, it obtains the next state through the event and the current state, that is, event + state = nextstate. It is understood that there are multiple nodes in the system. By passing the incoming event, you can determine which route to take to another node, while the node is limited.

3. DFA algorithm 3.1 sensitive dictionary construction description in sensitive word search

The description is based on two sensitive words: Wang babao and Wang babao. First, a sensitive word dictionary is constructed. The name of the dictionary is SensitiveMap. The binary tree structure of the two words is as follows:

 

The hash table is constructed as follows:

 

3.2 description of the algorithm based on the sensitive dictionary

The SensitiveMap constructed in the preceding example is a sensitive dictionary. Assume that the keyword entered here is: wang ba is not good. The flowchart is as follows:

4. Code Writing 4.1 construct sensitive words to implement code

 

4.2 sensitive word query code

 

5. Optimization ideas 5.1 meaningless characters in the middle of sensitive words

For words like "King * 8 & egg", the center is filled with meaningless characters to confuse them. When we collect sensitive words, we should also filter meaningless words, skip the loop to this type of meaningless characters to avoid interference.

5.2 use Pinyin instead of pinyin for sensitive words

There are two solutions: one is that the simplest is to encounter such problems. First, we need to enrich the sensitive lexicon to quickly solve them. The second type is to convert sensitive words into pinyin for comparison and judgment.

However, neither of the two solutions can completely solve this problem, and further research is needed.

 

----- Welcome to reprint, but retain the copyright, Please clearly indicated in the source: http://www.cnblogs.com/naaoveGIS/

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.