Character string multi-mode exact match (dirty word/sensitive word search algorithm) algorithm Prepass

Source: Internet
Author: User

In the previous article, I spoke a little about my self-built ttmp algorithm ideas. It seems very good and powerful-it is estimated that it is not the strongest, but at least I think it is satisfactory, at least it reaches the available level. So what else can be written? I haven't written any technical articles of this type for a long time. I am full of ideas:
1. What is the loss of efficiency besides the algorithm itself?
2. The ttmp algorithm is just a rough description. It is far from detailed at the level of the paper. You can give us a detailed explanation when you take the time.
3. In fact, there are at least two ways to implement the ttmp algorithm mentioned last time, namely forward retrieval and backward retrieval. What are their differences?

Well, today's topic is the first one: Let's leave the algorithm alone to see what will affect our performance.

When we filter sensitive words, there must be three elements:
1. Dirty Word Table
2. Pending text
3. Search Process

First, the dirty word table problem. Generally, this dirty word table was not invented by us. We have nothing to do with this figure. Why? Don't you bother yourself. This is usually because someone asks you to do this, so you have to do that. Therefore, someone will usually give you a dirty word table. However, it is usually not a single person, or the format of the file is messy. So there is a duplicate entry problem! For this problem, someone may organize a non-duplicate table by themselves. However, I think it is usually very time-consuming to organize the program. There are two advantages:
1. We don't have to worry about repeated texts.
2. When there is a new dirty table text, it can be directly pasted to the end of the original text, which can save the manual cost of sorting.

So when we start the dirty word filtering process, the first thing we need to do first is to exclude duplicates. Of course, this deduplication process is also a performance loss point. However, this process usually takes a relatively long time and is acceptable. In addition, if you perform this operation, you can retrieve at least several hundred articles, on average, there is almost no cost. This process is very simple. You only need to use the following code (the same can be done with hashset ):

Dictionary <string, string> fixkeys = new dictionary <string, string> (stringcomparer. invariantcultureignorecase );
Foreach (string key in scankeys)
{
Fixkeys [Key] = key;
}
Scankeys = new string [fixkeys. Count];
Fixkeys. Values. copyto (scankeys, 0 );
Return scankeys;

Note: The above Code assumes that your algorithm can ignore the case sensitivity of the text to be checked.
Someone asked me this question: why did I use a good algorithm? I think this is a possible reason. The algorithms we call are usually used to discuss some core issues. Similar to this problem, it is generally not discussed in the papers of WM/grep algorithms. No matter which algorithm we use, this deduplication step is the same. You can think it is necessary or unnecessary. But if you really want to improve efficiency, it is also worth your attention. Or, you have to consider whether you have fully done the preparatory work before you start to do the core tasks.

From this perspective, another way to optimize is to convert a dirty table into a minimal set? For example, when XYZ and Y are both original dirty table entries, do you plan to remove XYZ? It depends on your task. If your task is to see if an article contains sensitive words, if so, you can turn to manual check. In this case, the minimum set is enough and the maximum matching is not required. However, if your task is to use the stars to cover all the dirty words, you should not convert it into a minimal set, but instead enhance your algorithm (even at the cost of performance ). I remember a book that said, Don't be speechless. First, we need to know where the target is, so that we can hit the target and then let's discuss how to make it happen. Generally, we can check whether a piece of code can aim for its purpose, and determine the level of the developer in some aspects. My task aims at the former, so I do not need to reduce the minimum set. By the way, if it is the next task, a common algorithm has a defect, for example:
Dirty Word Table:
XYZ
X
Pending text:
Wxyza
After being blocked, it will become:
W * yza
Instead of what we expect:
W ***
Such results may be unsatisfactory, because some indecent content may be read out.
At this time, even if X is retrieved, it cannot exit the loop. You need to continue searching for longer texts. This process may be optimized, but in any case, it will consume a certain amount of performance more than the original.

Of course, if you want to achieve the ultimate goal, I can use such a program to sort out the original dirty word table, reduce it to the smallest set, and then generate the modified Dirty Word Table file for the core algorithm. At present, I have not chosen to do this. I think that developing such a program will save a considerable amount of time. Another problem is that it is possible that the minimum set should not be used in the future when the task is converted to the latter. The original work may be in vain.

What is the next performance loss point when reading the text to be checked? If you read the relevant books on operating system design, we should know that when we perform disk operations, the CPU will be idle. Our text to be checked is stored either in a database or in separate files. In either case, we may have CPU and other I/O conditions. In addition, a server may have more than one CPU and more than one core. Therefore, to improve the overall search performance, using multiple threads is another necessary means. I believe many articles on the Internet will discuss this topic, at least similar. If you want to further improve the efficiency through multithreading, I can remind you that performance cannot be squeezed to the extreme, otherwise it may affect the use of ordinary users. Another issue that needs to be noted is that whether or not you plan to use multiple threads, you must note that only one process can be processed at the same time. If multiple processes are doing such a check at the same time, it is likely that all of them are checking the same article. This meaningless performance loss is that no matter whether the algorithm is faster, it cannot be saved. There are many feasible methods. For example, to make this thing a service, there will always be only one process doing it, and there is a synchronization lock inside, and re-import is not allowed. Or there is a database-Level Lock that prevents others from starting such a retrieval process through the page.

Is it necessary to create a distributed architecture? In general, we are not a search engine, and our own data volume is still relatively limited. I guess it's amazing to have a single server for such a job. Using the multithreading above to squeeze the server performance is usually enough. Distributed processing may bring you a lot of technical difficulties, and the development and debugging process is also troublesome. I suggest using this as the final method. Before that, we should consider configuring a server with more and faster CPUs or simply changing the space for time. In fact, I personally think it is hard to imagine that the speed at which users upload text will exceed the processing speed of a server. To this extent, it is estimated that the first thing that requires distributed processing is not the dirty word filtering program, but the program that your site provides services like a user.

As mentioned above, you can say it is an algorithm. Strictly speaking it is not, but it is also a piece of code and a solution to the problem, this is the meaning of "algorithm forward" in my title. The rest is the retrieval process. So far, we hope to give you another idea of improving efficiency and make necessary preparations to achieve the highest efficiency of your algorithms. Next will first disassemble the concepts of TTMP-F and TTMP-B, the advantages and disadvantages and the implementation of rough, detailed code practice, but also brewing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.