Character string multi-mode exact match (dirty word/sensitive word search algorithm) algorithm Prepass II

Source: Internet
Author: User
It's really amazing. It's just a Star Wars series. The reason is simple. return to my old line-algorithm. I will find many interesting things in this field, even if it is just as small as sesame bean. After the previous article is completed, I will not miss it. However, at that time, there were not more to write, and today we suddenly found several problems, so we can write a long article.

The reason for this is: in the morning, I saw a correction of xingd. I checked it and corrected some other problems. I still found that there was a mismatch. Later, we found that the xingd algorithm does not handle case sensitivity issues, or is case sensitive. The solution is to create a tolower operation for both the dirty word table and the text to be checked. When I found that dirty word filtering is a problem with the Earth, tolower is a solar problem: tolower consumes almost 500 times the algorithm itself. In fact, sometimes the theory is actually the theory, and the actual practice is the actual practice. The actual things must be put on the battlefield for repeated tests, so that you will find such problems that you may not come up with when your head is broken.

Actually.. Net tasks are inconsistent with our task objectives :. net requires strict compliance with standards. Otherwise, it is estimated that some people in the hostile camp will jump out. For example, Microsoft's things are the things that undermine the standards. What are our requirements? The requirement is that the dirty words can be retrieved. As for the characters to be written smaller, it is correct as long as they are defined by ourselves, then you can. Because the two tasks are different, the. NET code will be more complicated, and it will take a lot of time to consider whether there are different factors for different cultural transformations. Our tasks are relatively simple, so if we write code ourselves, the speed may be much faster. This is one of them.
Second, because tolower will inevitably generate a new string copy, it is obvious that this operation will consume the memory allocation time. This time can be long or short. If the memory is insufficient and the VM needs to be used, it is estimated that the difference will be more than 500 times.

So how can we mask the case? You can just create a fastlower (char c) function and then create a stringcompare function. Including stringcompare. The same applies to other functions. This fastlower must be called to process case-sensitive functions. It looks like a stupid way. It looks like it will produce many calls, but don't forget that even if you use tolower, there will be the same amount of computing in it, or even more. I provide two solutions for this fastlower:
1. An array of byte [] loweradd = new int [Char. maxvalue] is provided. If it is a capital letter, an increment is given to a lower-case letter, and all other characters are 0. Therefore, fastlower does not need to judge the recent characters. Just add the offset directly.
2. Write a lot of logical comparisons.

Which of the following is faster? The answer is the second one. It is different from my original idea. I don't care why. People who are interested will study it on their own. The total running time of the filtering system using method 1 is about 50-60 ms, while the system of method 2 only needs to run for about Ms. The following provides a function similar to the merge fastlower function (note that because of my ttmp algorithm, hashcode needs to be calculated, the returned value is uint. You can make adjustments as needed ):

Static public uint fastlower (char ch)
{
Uint val = (uint) ch;
If (Val> 0x2179 & Val <0xff21 | Val <0x41)
{
Return val;
}
Else
{
If (Val> = 0x41 & Ch <= 0x5a) |
(Val> = 0xc0 & Val <= 0xde & Val! = 0xd7) |
(Val >=0x391 & Val <= 0x3a9)
)
{
Val | = 0x20;
}
Else if (Val> = 0x400 & Val <= 0x42f)
{
If (Val> = 0x410)
{
Val + = 0x20;
}
Else
{
Val + = 0x50;
}
}
Else if (Val> = 0x2160 & Val <= 0x2169)
{
Val + = 10;
}
Else if (Val >=0xff21 & Ch <= 0xff3a ))
{
Val + = 0x20;
}
}
Return val;
}

This function is not very rigorous. I think it's okay to use it. This function can be converted to uppercase/lowercase:
Halfwidth English A-Z
Fullwidth English A-Z
Special pronunciation symbols (the Class A carries two-point and other symbols)
Greek letter-Ω
Spanish and Japanese
Roman numerals I-x

Basically, it is covered. If you think that it is not covered, it does not matter. You can just change it yourself. It should be very simple. The above algorithm has a special change, that is, to exclude most Chinese characters from the first judgment, because for characters that are frequently encountered and do not need to be processed, minimizing the judgment can improve performance.

In addition to tolower, there is another problem that has aroused my attention: what types of items are usually matched? I analyzed it a little and found that for normal text, the hit rate of 1-2 Characters in total exceeds 95%! I have two slightly larger texts, both of which are about k characters long. The hit rate of the two texts is 99.5% and 98%, respectively. The hit rate of one character is less than 20%. That is to say, the vast majority of hit items are still two-character entries. In this way, there are several points worth thinking about:

1. Is it worthwhile if we adopt a space-for-time policy for 3rd to n characters to improve performance? If the algorithm is correct, it is not worthwhile. Unless there is a problem with the algorithm, it is often possible to retrieve the nth n character if it is not hit. From this perspective, the ttmp algorithm should have played its advantage normally:
If the terminator cannot be found, it will not be dumbly started to search. Based on the above statistics, the length of the hit entries is mostly within 2 characters. For the ttmp algorithm, it can usually reject the vast majority of invalid searches.

2. Another point worth consideration is that it is the most cost-effective to optimize 1-2 characters. In fact, I have analyzed this result in detail in the reply in an article in xingd. It can be said that it starts with more than 3rd characters, the impact of Optimization on algorithms is already one in ten. Of course, this is a theory. It is perfect, that is, you will not search for impossible hits. The later improvement of xingd can increase the hit rate, which is actually another reason: it reduces the probability of retrieval. It turns out that all the characters appear in a table, so the probability of K2 and K3 is relatively high. Later Improvements mainly split the characters at each position, so the character set at the second character position may be relatively reduced, so K2 and K3 will be reduced, the computing workload is reduced. This should also be the improvement of xingd, and its performance improvement is not just a single improvement.

3. Looking back, ttmp may not have the maximum efficiency yet, because ttmp has little effect on 1-2 characters. The longer the keyword, the more computation it may save. Therefore, if you want to improve the performance, you should consider improving ttmp for 1-2 characters. One-character hit is simple. It means to perform a single hit processing for one character without complicated operations such as initial record position, hash value calculation, and Table query. I have no idea about the two characters. Under the already very high efficiency, any optimization needs to be carefully considered, because it is easy to cause a situation where the optimization is not worth the candle. For example, some may consider using encoding compression to provide a larger character type ing table (indicating whether it is a direct hit, start character, or the maximum number of characters in the middle, or the combination ). However, in my opinion, this optimization is very likely to be worth the candle, because the vast majority of hits should be missed. If we adopt the encoding compression scheme, we have to perform such compress operations on each scan, however, the total amount of computing operations may be several orders of magnitude larger than that of the solution's "optimization without execution. Maybe when I think about it, I will dedicate my solution to you. In either case, it is certainly the TTMP-B mode, it will be more efficient than the TTMP-F mode I am using now, there will be an article to explain later.

4. WM and other algorithms may have no room for our "dirty word filtering" tasks. The reason is too simple: first, we have one character length in multiple modes. Second, even if we can perform differentiation, the remaining two characters that occupy the absolute position in the match cannot be used by WM (relative to ttmp ). Because the fundamental logic of the WM algorithm is to construct a table with a prefix of 2 to 3 characters, this means that the following operations are performed:
A. The pattern must contain at least two characters. In most cases, you can only scan the next character;
B. Because the length of 2 characters is 32 characters, if no encoding compression is performed, at least {[(2 ^ 32) /8] * 2} bytes to construct the transfer table, that is, 1 GB. This is a luxury for common servers. If we want to perform encoding compression, refer to the third point above, it is definitely not worth it. For example, in my two normal texts, there are about 1000 hits, that is, the number of non-hit characters is 140 times the number of hit characters. Even if all WM algorithms are Skip, it may bring performance advantages. It is hard to say whether it can overwrite the performance loss caused by encoding compression.

Summary (whether it is a core algorithm or not, as long as the Code is applicable ):
1. Be sure to pay attention to your task objectives. if the objectives are different from those of the framework, you must consider whether to write something yourself, even if the functions are similar or even the same;
2. During optimization, you must consider the most common situations. For example, here, non-hit is the most common, and scanning miss is the most common, and extremely short characters are the most common when hit. Only by figuring out these problems can you know where the performance bottleneck is, rather than blindly optimizing it. Most of the time, some excellent algorithms are only effective under certain conditions. For problems such as scanning miss, Case sensitivity issues are powerless, or they are not your processing scope. If you choose a core algorithm that can achieve 10 ms efficiency but uses tolower, your efforts may be wasted.

Like the Star Wars, the forward transmission is also undeniable.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.