In such scenes as forums and chat rooms, we often need to block a lot of bad words in order to ensure the user experience. For a single keyword lookup, nature is indexof, regular that way the efficiency is higher. But for the case of more keywords, repeated calls indexof, regular words to match the full text, the performance consumption is very large. Because the target string is usually large in size, it is important to ensure that the results are obtained once the traversal is over. Based on this requirement, it's easy to think of the way in which each character of the full text is matched sequentially. For example, for this text: "Mike Jordan had said" Just does IT, so Mark has been a coder. "If our keyword is" Mike "" Mark ", then we can iterate through the sentence, and when we find" M ", we'll see if we can match" I. " "or" a ", can always match to the end of the successful find a keyword, otherwise continue to traverse. Then the structure of the keywords should be this:
var keywords = {
M: {
i: {
K: {
e: {end:true}}}
,
A: {
r: {
K: {end:true}
}
}}}
As can be seen from the above data is a tree structure, and based on the keyword group to create a tree structure is more time-consuming, and the keyword is we have already given, so you can create such a data structure before the match. The code is as follows:
function Buildtree (keywords) {
var tblcur = {},
key, Str_key, Length, J, I;
var tblroot = tblcur;
for (j = keywords.length-1 J >= 0; J-= 1) {
str_key = keywords[j];
Length = str_key.length;
for (i = 0; i < Length; i + 1) {
key = Str_key.charat (i);
if (Tblcur.hasownproperty (key)) {
tblcur = Tblcur[key];
} else {
tblcur = Tblcur[key] = {};
}
}
tblcur.end = true;//The last keyword
tblcur = tblroot;
}
return tblroot;
}
This code uses a hyphen statement: Tblcur = Tblcur[key] = {}, here is to note the order of execution of the statement, because [] the operation level ratio = high, so first of all, in the Tblcur object to create a key property first. Combining Tblroot = Tblcur = {} See, the Order of execution is:
var tblroot = Tblcur = {};
Tblroot = tblcur;
tblcur[' key '] = undefined; Now tblroot = {key:undefined}
tblcur[' key ' = {};
Tblcur = tblcur[' key '];
Through the above code to build a good need for the query data, the following look at the query interface.
For each word of the target string, we start the match from the top of the keywords. First is keywords[a], if exist, then see keyword[a][b], if the last Keyword[a][b] ... [X]=true The match is successful, if KEYWORD[A][B] ... [X]=undefined, the match Keywords[a] is restarted from the next location.
function Search (content) {
var tblcur,
p_star = 0,
n = content.length,
p_end,
match,//whether to find a match
Match_key,
match_str,
arrmatch = [],//storage result
arrlength = 0;//arrmatch's length index while
(P_star < n) {
tblcur = tblroot;//back to root
p_end = P_star;
Match_str = "";
Match = false;
do {
Match_key = Content.charat (p_end);
if (!) ( Tblcur = Tblcur[match_key]) {//This match ended
P_star = 1;
break;
else {
Match_str + = Match_key;
}
P_end + 1;
if (tblcur.end)//is matched to the tail
{
match = true;
}
} while (true);
if (match) {//MAX match
arrmatch[arrlength] = {
key:match_str,
begin:p_star-1,
end:p_end
};< C35/>arrlength + 1;
P_star = P_end;
}
return arrmatch;
}
The above is the core of the whole keyword matching system. Here very good use of JS language characteristics, efficiency is very high. I used a 500,000-word "search God" to do the test, from which to find the given 300 idioms, matching the effect is 1 seconds or so. Importantly, as the target text is traversed at one time, the length of the target text has little effect on the query time. The query time has a greater impact on the number of keywords, the target text of each word is traversed the keyword, so the query has a certain impact.
Simple analysis
See the above estimate you also wonder, each word is traversed all the keywords, even if some of the keywords are the same, but full traversal is very time-consuming ah. But the properties of the object in JS are constructed using a hash table, which is very different from the simple array traversal, and more efficient than the sequential array traversal. Maybe some students are not familiar with the data structure, here I simply say the relevant content of the hash table.
First look at the data storage.
The storage of data in memory consists of two parts, one is the value and the other is the address. Think of the memory as a Xinhua dictionary, the word interpretation is the value, and the directory is the address. The dictionary is in alphabetical order, such as the same pronunciation of "ni" on the same line, that is, the array neatly arranged in a memory area, so the structure is an array, you can specify "ni" 1th, 10th to visit. The structure diagram is as follows:
The advantage of the array is that the traversal is simple, and the corresponding data can be accessed directly by subscript. But it is very difficult to delete a particular item. For example, if you want to delete the 6th item, the data after the 5th item should move forward one position. If you want to remove the first bit, the entire array will be moved, consuming very large.
In order to solve the problem of array additions and deletions, the list appears. If we divide the value into two parts, one for storing the original value, the other for storing an address, the address is pointing to a different structure, and so on, it forms a list. The structure is as follows:
As can be seen from the above diagram, it is very simple to delete the linked list, as long as the target item and the previous item's next rewrite is done. However, to query the value of an item is very difficult, you have to iterate in order to access the target location.
To integrate the advantages of these two structures, smart as you must have thought of the following structure.
This kind of data structure is the hash table structure. The array stores the header address of the linked list, and can form a two-dimensional datasheet. As for the data distribution, this is the hashing algorithm, the normal translation should be hashing algorithm. Although there are many kinds of algorithms, the principle is to solve the key through a function, and then based on the results of the solution to put the data. That is, the key and the actual address form a mapping between, so this time we no longer as an array subscript or simply over the history of access to the array, but the hash function to locate the inverse function of the data. The object in JS is a hash structure, such as we define an obj, Obj.name by hashing, his position in memory may be 90 of the above figure, so when we want to manipulate the obj.name, the bottom line will automatically help us navigate to the 90 location via the hash algorithm, which means that we are looking for the list directly from the 12 items of the array, rather than traversing the entire block of memory from 0.
JS defines an object Obj{key:value},key is converted to a string and then hashed to get a memory address, and then put the value into it. This can understand why we can delete and remove attributes, but also understand why in JS can also be an array of properties, and the array does not have the so-called cross.
In the case of large data volume, the hash table has a very obvious advantage, because it reduces many unnecessary computations through hashing algorithm. The so-called performance optimization, in fact, is to make the computer less operation, the biggest optimization is not calculated!
Optimization of the algorithm
Now understand the underlying implementation of the algorithm, look back to the algorithm can be considered optimization. However, before the optimization or to emphasize one: do not blindly pursue performance! For example, in this case, we are up to 5000 words match, the existing algorithm is enough, all optimization is unnecessary. The reason is also to say optimization, is to improve their understanding of the algorithm on the program, rather than really to do that 1ms optimization.
We found that our keywords are not a word, then we follow a word of the unit of the keyword traversal is obviously a waste. The optimization here is the maximum minimum length of a predefined keyword, which is searched each time with a minimum length. For example, the keyword of my test case is an idiom, the shortest is 4 words, then I match every time is 4 words to match, if the hit continues to drill down to find the maximum length. That is to say, the first time we start to construct a tree is to build it with a minimum length and then increase it verbatim.
Simply calculate, according to our test case, 300 idioms, we match a word only one time, and the word query we need to compare 4 times, and each comparison we have to access our tree structure, which is avoidable performance consumption. More importantly, the contrast here is not string contrast, here our keywords are as key exists, the effect is the same as the key in obj, are the key to hash the transformation and then access the corresponding address! So do not tangle contrast a word and contrast 4 words of difference, we do not compare strings!
On the matching of multiple keywords here, the optimized version of the code I will not post, because generally also not used.