Multi-Mode exact match of strings (dirty words/sensitive words search algorithm)-theory of ttmp Algorithm

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is ttmp algorithm? Sorry, before I published this article, I guess there is no other place to find this algorithm, because it was just created by me.
What does ttmp mean? Terminator triggered multi-pattern, that is, the terminator triggers a multi-mode algorithm.
-_-! It's hard to understand. It doesn't matter. You may understand it after reading it.

However, this self-built algorithm is a bit complicated. To ensure that everyone can read it smoothly, please perform a test:
Take out your watch or other timers and see if you can read the following article in multiple hours.
The criteria are as follows:
If you spend less than 15 seconds, you don't have to read my article, and you have the ability to create a stronger algorithm;
If your time is less than 30 seconds, let's talk about it;
If your time is less than 45 seconds, you can read it carefully. It may be helpful;
If your time is less than 60 seconds, you will be able to dig the treasure mine here;
If you do not belong to the above situation, I suggest you do not bother to read it. It is a bit difficult.

Do you raelly know Engilsh?
At laest in egnlish, wehn pepole RAED, Tehy
Usaully wlil not noitce taht the charcatres bewteen
The frist ltteer and the LSAT leettr are not in
Export CET oredr. In fcat, hmuan Brian does recongize
Wrods by seeknig the fsirt ltteer and the LSAT leettr,
And tehn fnidnig whcih charatcers are insdie of tehm.
See! All the wrods hree wtih mroe tahn 3 leettrs are
All wirtten in a worng way! Do you niotice taht?

Hey! In fact, what I said during the above test is a bit confusing, mainly for everyone to read quickly, rather than reading carefully. Interesting?
This is not what I pulled out. It is the research result of a famous university (like Cambridge). I have no time to find the original article. I don't know how you feel when reading the above text. I think it is shocking and interesting.

Indeed, according to the theory of automation, if one word is read carefully, it is still likely that the syntax structure can be straightened out to clarify the meaning of a sentence (in theory, in fact, no machine is capable of real-person perception ). However, if every word is read carefully and the syntax table is searched, the speed will be slow. Second, a large amount of space is required to do this. The human brain is smarter. After several years of training, I have learned to give up details automatically. For example, when I read the word "cerroct", I find that the front is C and the back is T, there is one EOC in the middle and one R in the middle. When we look up the table, we will know that it must be the word "correct", no matter whether it is correct or not -- Oh, sorry, I wrote it wrong again. It should be correct!

Hmm? What does this have to do with the exact match of string multi-mode in this topic?
Yes! Of course. However, before I tell you this relationship, let's analyze the efficiency of string multi-mode exact matching? Before writing, I would like to tell you that my description below may not be very rigorous, because sometimes it is too rigorous to understand. For example, what makes X = y ...... Anyway, I have been looking for some materials for this matter, and I feel dizzy.

What does string multi-mode exact match mean? Strings can be used to search strings and other things. Multi-Mode: for example
String S = "XXX ";
String T = "XX ";
S. indexof (t );
This is to find the location (or existence) of another string T in the string s, which is called the single mode, there is only one string t to be searched-the only search mode;
String S = "XXX ";
String [] T = new string [] {"X1", "X2", "X3 "...};
S. Scan (t );
This is called multi-mode matching. Because I want to find a group of T in S, or check whether there are sensitive words in the dirty word table in our article.

There are many existing algorithms for Multimode matching. I did not look at them carefully. I only looked at one algorithm that may be WM. In fact, there may be some grep/agrep algorithms. However, we need to remind you that there are still many algorithms discussing Fuzzy Matching. For example, if one word is not correct, those algorithms are not the content I want to discuss in this topic. I want to discuss precise search, that is, finding "sweet potato" instead of "sweet potato ".

Is multi-mode exact matching difficult? It's not hard, it's easy: We just need to loop through, first look for S. indexof (T1), then find S. indexof (T2 )...... But if you do, the efficiency will be very low, because you will need to scan the text many times. As you can imagine, our goal is to scan the entire article To Find Out What sensitive words are contained in this article. However, it is obvious that this goal is not easy to achieve, but at least we can try to approach the "scan only once" goal. Before further analysis, we recommend that you read another article:
(Resend). Net dirty word Filtering Algorithm
The algorithm in this article (such as the xdmp algorithm) has a fast scanning speed and a good understanding of its ideas. It makes sense to discuss the algorithm based on this article. First, let's sort out the idea of this algorithm:
1. First, Scan each character in the article. Only when a character is the first character of any dirty word in the dirty word table (called "Start character "), we will try to see if it is a dirty word (triggering search ).
2. But we don't have a clue to start looping every entry in the dirty word table:
2.1. Search for a character later. Check whether the character is any character in the dirty word table. If not, it indicates that it cannot be any entry in the dirty word table and you can exit.
2.2. If yes, we will take the string from the first detected character to the character currently scanned, calculate the hash value, and see if a dirty word can be detected from the hash table.
If it is checked out, it will be done. Otherwise, the next character (Repeat 2.1, 2.2) will be retrieved until it cannot be found, or it will exceed the maximum length of the dirty word table entries.
2.3 if none of them can be found, or it is too long, then return to the previous "Start character" and the next character to continue scanning (repeat 1 and 2) until the end of the entire article.

Three important concepts are introduced here:
1. Scanning refers to scanning articles to see if there is a need to compare with the dirty word table;
2. Search refers to the process in which text and dirty word tables are compared when a possible situation has been found;
3. Start character: the first character in the entry in the dirty word table.

If we only need to scan and do not need to search, we can complete the task. It must be the fastest, but at present I am ignorant and have not found such an algorithm.
Or, if we scan it once and retrieve it all, it would be nice. Unfortunately, we still haven't seen it.
Obviously, scanning should not be repeated more than once; otherwise, the efficiency is certainly not high. Search is the key to the algorithm! There are several ways to improve the search quality:
1. Try not to trigger search;
2. If you need to trigger the search, you must minimize the number of characters to be traversed each time you trigger the search;
3. Each time you compare Dirty Word tables, you can reduce the workload.

Looking back at the above xdmp algorithm, it is:
1. One scan; (good, nothing to say)
2. Search is triggered as long as a "Start character" is found;
3. During retrieval, the number of characters to be traversed is 1 + 2 + 3 +... + n, where N is the length of the hit dirty word or the closest length;
4. Repeat hashcode for each search. Do not forget to calculate hashcode. To calculate hashcode, you also need to scan strings, that is, traverse 1 + 2 + 3 +... + n characters.

So I had a problem:
1. Do I have to trigger the search every time I encounter a "Start character? Oh, Mom, I want to search this file too (because there may be MB in the dirty word table )?!
2. Do you have to search for a string of 1, 2, and 3 for each trigger ...... When the search is successful or non-dirty table characters appear?
3. Do we need to extract the text to be checked of a specific length for each retrieval?
4. Do I need to calculate the hash value from the beginning for each retrieval? Can't I use the hash value of the last retrieval after the same trigger search to reduce the unnecessary computing workload of this calculation?

These four problems are basically the problems I want to solve. The first two are a type of problem, and the last two are another type of problem. First, check the first type of problem:
Well, let's look back at the first English article. Do we have any inspiration? Yes! The condition for triggering the search is too simple!
If we haven't read a word, why should we start thinking about it?
In addition, we made a lot of unnecessary searches after triggering the search, because when we encounter the character "Cao", it is likely that the dirty word table contains only "CAOT Mom ", "caon mom. If there is "operation" in the article and "love" in the dirty word table, the above xdmp algorithm will still search for two characters, but in fact there is no need.

So how can we reduce these unnecessary operations? First, let's change it to avoid triggering the search every time a "Start character" is encountered. What should we do if we scan the start character? Record his location and other information, and then continue scanning. When we encounter an ending character, that is, any of the last characters in each entry of the dirty word table, we will consider whether to start triggering the scan. During scanning, the dirty characters may not have to be 1, 2, or 3 ...... . Because we have recorded various starting positions, we may only need to scan 1 or 3 cases, or 5 cases.

Next is the second type of problem:
In the above algorithm, a hash table is used to accelerate the retrieval of whether a string character is in a dirty word table. To be able to query tables, you must extract the hash value. However, this leads to two performance loss points:
1. Re-calculate the hash value for each screenshot;
2. Each time you extract a string.
To avoid this problem, first we need to know how the hash table works:
In fact, the hash table obtains a hash value with a relatively average probability based on the current string content (in this way, the hash table is not prone to conflicts, that is, different values are the same ), find the first result with the same hash value in the table and compare the content. If the hash value is the same, the result is found. Otherwise, find the next entry until there are no entries with equal hash values.

Therefore, we can solve the above problems as follows:
1. First, we create a hash value calculation method so that we can use the previous calculation result and then calculate the next result.
For example, we can perform an exclusive or exclusive operation on one byte (the advantage is that the direction is not sensitive), or we can specify that the computation starts from the back of the string.
Why do we require calculation from the end? Because ttmp triggers scanning with an Terminator, such as text:
ABCDE
If e is the Terminator, it will retrieve ABCDE, bcde, CDE, de, and E (and check whether these start operators are scanned ). If we calculate it from the back, we can use the hash value of E and the character d to calculate the hash value of De without the need to calculate the character E again.
2. Second, we can construct a hash table like this:
Dictionary <int, list <string> hash;
The key is the hash value we just calculated. Based on the calculated hash value, we can get a dirty word list under the hash value, then we compare the characters one by one with the text to be checked. It seems strange here. Why can't we find the corresponding characters by using the hash value?
Don't forget that the hash value is inherently conflicted. Now I just extract the conflicting information from the row for processing. In this way, the actual retrieval times are not increased (put in the hash table, it is also necessary to compare character pairs one by one to determine whether the key values are completely equal, rather than the key hash values are equal but not the key values ). The advantage is that we do not need to extract a string so that the hash table can obtain the hash value of this string (each character needs to be traversed from the beginning ).
With the above measures, we can trigger a search for each n-length text to be checked. We only need to traverse up to n characters to get all the hash values that can be traversed up to n times, the original xdmp algorithm needs to traverse sum (n) characters.

Of course, the effects of these measures are not very obvious for three reasons:
1. Generally, our texts are normal texts, with at most occasional sensitive words. Therefore, we do not often challenge the performance loss point mentioned above;
2. Generally, the number of dirty word tables is not extremely large, and the start and end operators should also be concentrated in a limited number of characters. Therefore, most of the time, and the terminator table can greatly improve the performance;
3. Even if we really need to trigger the search, our dirty words usually have a short length, or most of them will be short. Therefore, the performance improvement brought by the above improvement will be limited. For example, in the case of two characters, the hash value calculated by the original algorithm needs to traverse 3 characters, while ttmp only needs to traverse 2 characters ...... Khan
If it is five characters, the original algorithm needs to traverse 15 characters, while ttmp only needs to traverse 5 characters, and there is a sense of gap.
Unfortunately, there are still few sensitive words with five characters, and there are very few sensitive words in an article.

At present, the ttmp algorithm has not been optimized, and it is quite good to compare it with the time consumed by the xdmp algorithm by. 5-2.5. Of course, xingd later developed a new algorithm with a fast testing speed. However, at the time of the test, I was still not stable and there was a missed check. Therefore, I will not comment on it for the time being.
As for my ttmp algorithm, there are still many places that can mine potential, such as the forward retrieval and pre-Calculation of hash values. If it is changed to backward retrieval and the hash value is calculated during retrieval, the performance should be better. However, I am not planning to continue mining for the time being. I am going to put it into practice.

Er, in fact, this article is still true at the beginning of this article. This article is still a bit difficult, and I am not very good at describing it. I don't know if you have understood it?
Source code? Hey, private goods. Add them to favorites for a while. Of course, if you have a piece of source code that can legally create the RMB real currency that the manufacturer legally owns, you can use vs2005 to compile it. The deployment process requires only a few mouse clicks, so you do not have to take care of the running process, and if you are willing to exchange with me, I will think about it ...... The real situation is that I still want to make the algorithm more stable now. Can't I release a lot of code?
In private, this program is much more complicated than the xdms algorithm. If you want to understand it in the future, you need to spend some time.

Oh, reply to someone in advance:
The KMP algorithm is a single-mode matching algorithm, and BM is also said to be a single-mode algorithm.
The WM algorithm is multimode matching. I found an algorithm that is said to be WM and looked at it:
Http://blog.chinaunix.net/u/21158/showart_228430.html
I don't know if you are talking about this.
I found that the idea is actually similar to KMP/BM, mainly through the "Skip" technology to improve performance. However, this article also mentions the following:
If one of the modes is very short and the length is only 2, the distance we move cannot exceed 2. Therefore, the short mode will reduce the efficiency of the algorithm.

The problem is that dirty word tables generally have one to two characters in length, so the vast majority of jumps are not very effective. Even if it is five characters long, it is likely that the scan will not be triggered because it does not encounter an "Terminator" because it exceeds the length. WM requires a shift table, and compression is required to save space. This means that a compression computation is required for each scan unit. To sum up, it is not certain who wins or loses the pk of the ttmp and WM search dirty word tasks. By the way, even WM is not a scan, because if you do not skip, you will need to scan more characters.

Ttmp efficiency description:
Ot = ot (Text length) + ot [number of times the start operator and end operator appear in the scan window * AVG (number of entries with the same hash value in the same Terminator)]
= Ot (n) + ot [f * AVG (h)]

Om = OM (character type table) + OM (Terminator table) + om {total number of entries * [memory consumption of variables in the hash table + memory consumption in the list + AVG (Entry length)]}
= 256 K + 256 K + om {n * [12 + 12 + AVG (k)]}
= 512 K + om [N * (C + k)]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Multi-Mode exact match of strings (dirty words/sensitive words search algorithm)-theory of ttmp Algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Multi-Mode exact match of strings (dirty words/sensitive words search algorithm)-theory of ttmp Algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support