Sensitive word Filtering Algorithm

Source: Internet
Author: User
String
Multi-Mode exact match (dirty words/sensitive words/keyword filtering algorithm)-practical F mode of ttmp Algorithm

String
Multi-Mode exact match (dirty word/sensitive word search algorithm) -- An overview of the B mode of ttmp Algorithm

String
Multi-Mode exact match (dirty word/sensitive word search algorithm) algorithm Prepass II

String
Multi-Mode exact match (dirty word/sensitive word search algorithm)

 

String
Multi-Mode exact match (dirty word/sensitive word search algorithm)-the theory of ttmp algorithm is as follows:

Use DFA for text filtering

DFA and text filtering

Text filtering is an essential feature for large websites, and is required by many text websites. So how to design
An efficient text filtering system is very important.

Brief description of text filtering requirements: Determine which subsets of set a belong to Set B. For javaeye, if the user sends
For a table article (set a), we need to determine whether some keywords in this article belong to Set B, and B is generally a list of prohibited words.

If you don't know how to use contains or regular expressions, you may think about these methods.
Methods do not work. The only better algorithm is DFA.

I. dfa introduction:

Those who have learned how to compile must know that in the lexical analysis phase, the set of text in the source code into the syntax is implemented by determining finite automatic machines. However, DFA is not only used in lexical analysis.
It is not limited to the computer field.

The basic function of DFA is to get the next state through the event and the current state, that is, event + state = nextstate,

Let's look at a state chart that can be found everywhere:

Uppercase letters are state, while lowercase letters are action. We can see S + A = u, u + A = Q, S + B = V, and so on. In general, we can use a matrix to represent the entire state transition process:

---------------

Status/character a B

S U V

U q v

V U q

Q

However, the state chart can have many data structures. The matrix above is just a simple example for ease of understanding. Next, the text filter system mentioned in this article will use another data structure for automatic
Machine Model

2. Text Filtering

In a text filtering system, to cope with high concurrency, it is important to reduce the number of computations as much as possible. In DFA, there is basically no computing, some are status transfer. While
Constructing a list of prohibited texts into a state machine is troublesome to implement using matrices. The following describes a simple implementation method, that is, the tree structure.

All prohibited words are essentially composed of ASCII codes, and the text to be filtered is essentially a set of ASCII codes, for example:

Input is a = [101,102,105, 112,110,]

List of prohibited words:

[102,105]

[98,112]

Then our task is to construct the above two prohibited words into a DFA, so that the input a can achieve the function of searching for prohibited words through the transfer on this DFA.

Tree Structure: The basic method for implementing the DFA is the relationship between the index of the array and the array value (in double array trie, it is also based on this basic method)

So 102 can actually be seen as an array index, while 105 is an index in the next array pointed to by the index 102. If there is no value after the index 105, it indicates that the banned word is over.

In this way, you can construct a DFA tree structure.

Next, traverse each byte in the input text, and then perform status transfer in DFA to determine whether a prohibited word exists in the input text.

  1. Public class DFA {
  2. Private string [] arr = {"DFA", "disgusting", "da "};
  3. Private node rootnode = new node ('R ');
  4. Private string content = "Hello DFA world DFA, haha! Disgusting ";
  5. Private list <string> words = new arraylist <string> ();
  6. Private list <string> word = new arraylist <string> ();
  7. Int A = 0;
  8. Private void searchword (){
  9. Char [] chars = content. tochararray ();
  10. Node node = rootnode;
  11. While (A <chars. Length ){
  12. Node = findnode (node, Chars [a]);
  13. If (node = NULL ){
  14. Node = rootnode;
  15. A = A-word. Size ();
  16. Word. Clear ();
  17. } Else if (node. Flag = 1 ){
  18. Word. Add (string. valueof (chars [a]);
  19. Stringbuffer sb = new stringbuffer ();
  20. For (string STR: Word ){
  21. SB. append (STR );
  22. }
  23. Words. Add (sb. tostring ());
  24. A = A-word. Size () + 1;
  25. Word. Clear ();
  26. Node = rootnode;
  27. } Else {
  28. Word. Add (string. valueof (chars [a]);
  29. }
  30. A ++;
  31. }
  32. }
  33. Private void createtree (){
  34. For (string STR: ARR ){
  35. Char [] chars = Str. tochararray ();
  36. If (chars. length> 0)
  37. Insertnode (rootnode, chars, 0 );
  38. }
  39. }
  40. Private void insertnode (node, char [] CS, int index ){
  41. Node n = findnode (node, CS [Index]);
  42. If (n = NULL ){
  43. N = new node (CS [Index]);
  44. Node. nodes. Add (N );
  45. }
  46. If (Index = (CS. Length-1 ))
  47. N. Flag = 1;
  48. Index ++;
  49. If (index <CS. length)
  50. Insertnode (n, Cs, index );
  51. }
  52. Private node findnode (node, char c ){
  53. List <node> nodes = node. nodes;
  54. Node Rn = NULL;
  55. For (node N: nodes ){
  56. If (n. c = c ){
  57. Rn = N;
  58. Break;
  59. }
  60. }
  61. Return rn;
  62. }
  63. Public static void main (string [] ARGs ){
  64. DFA = new DFA ();
  65. DFA. createtree ();
  66. DFA. searchword ();
  67. System. Out. println (DFA. Words );
  68. }
  69. Private Static class node {
  70. Public char C;
  71. Public int flag; // 1: indicates the end, 0: Continuation
  72. Public list <node> nodes = new arraylist <node> ();
  73. Public node (char c ){
  74. This. C = C;
  75. This. Flag = 0;
  76. }
  77. Public node (char C, int flag ){
  78. This. C = C;
  79. This. Flag = flag;
  80. }
  81. }
  82. }

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.