A string match that can be used for special purposes.Algorithm
Ji Fuquan Zhu zhanli (Computer College, Xi'an Petroleum University, Xi'an 710065, China)
Abstract
The existing string matching algorithms perform sequential matching from left to right or from right to left Based on the pattern string, this paper proposes a special character string matching algorithm, zzl. For frequently-used master and mode strings to be matched, the matching speed of the zzl algorithm is very fast.
Keywords
String, pattern matching, algorithm string matching is to find one or all occurrences of a pattern string in a string. String Matching is widely used. For example, string matching is required for spelling checks, language translation, data compression, search engines, network intrusion detection, computer virus pattern matching, and DNA sequence matching. Many string matching algorithms have been proposed. Traditional BF Algorithms
[2, 3] , KMP Algorithm
[2, 3] And so on. Recently, we have proposed the BM algorithm.
[2, 4] , Sunday Algorithm
[2, 3] . This paper proposes a special character string matching algorithm, zzl.
1
Related algorithm analysis
The meaning of string mode matching is: in the main string S, start from the position to find whether there is a mode string (also known as the mode string) T, if a pattern string with the same pattern string T is found in the primary string S, the pattern string matches the primary string; if a pattern string that is the same as the pattern string t is not found in the main string S, it does not match
[1] . First, make the following assumptions: Main string S: s [1... N], length is N; mode string T: T [1... M], length is m; n ≥ m;
1.1 BF
Algorithm
The core idea of BF (brute force) algorithms is: first, compare s [1] with T [1]. If they are equal, then compare s [2] and T [2]. until T [M]. If s [1] and P [1] are not the same, t moves the position of a character to the right and then compares it in sequence. If K, 1 ≤ k ≤ n, and s [k + 1... K + M] = T [1... M], the match is successful; otherwise, the match fails. In the worst case, M * (N-M + 1) times are compared, and the time complexity is O (M * n ).
1.2 KMP
Algorithm
The core idea of the KMP (knuth-Morris-Pratt) algorithm is: In case of mismatch, the primary string does not need to be traced back, instead, use the obtained "partially matched" result to shift the pattern string right as far as possible and continue the comparison. It should be emphasized that the pattern string does not necessarily move the position of a character to the right, nor must it be re-tried from the start point of the pattern string, that is, the position where the mode string can be shifted to multiple characters at a time, and the position after the right shift can start from somewhere after the start of the mode string to try matching. Assume that, when a mismatch occurs, s [I] =t [J], 1 ≤ I ≤ n, 1 ≤ j ≤ m. In the next round of comparison, s [I] should be different from T [
Next [
J ] Alignment and backward comparison: for example, t = "abaabcac ",
1.3 BM
Algorithm
T = T & 05; · TM and main string S = S & 05; · Sn, an auxiliary array BM [] is required to implement the BM algorithm. It uses the character value as the subscript of the array. The size of the array depends on the number of possible characters, and is irrelevant to the size of the mode string. For matching of Chinese keywords, the ASCII character set needs to be expanded, and the array size is 256. For any X that belongs to the set Σ → {256, ·,}, the value of BM [x] is: the items corresponding to each character in the array record the position of the character in the mode string for the last time. The idea of the BM algorithm is: if the segment starting from I to return to the beginning of the execution of the Main string matches the character of the mode string t from the right to the left, if the mode string t matches all characters, the match is successful. Otherwise, you need to shift right to start a new round of match. Assume that the match fails to occur in the position J in the pattern string, find the secondary array from the Si-M + J character that cannot be matched with the primary string to obtain the final position value BM [Si-M + J] of the character in the mode string T. If BM [Si-M + J] is equal to zero, it indicates that the Si-M + J character is not in the mode string t, the mode string skips the character and is aligned next to the character; if BM [Si-M + J] is greater than J, it indicates that the final position of this character in the mode string is on the left of J, the mode string t shifts right to the alignment character Si-M + J. If BM [Si-M + J] is smaller than J, it indicates that the last position of the character in the mode string is on the right side of J. The mode string cannot be moved left, and the right side is shifted to a grid. Shift = max (1, m-BM [I-m + J]).
1.4 Sunday
Algorithm
The Sunday algorithm is a faster algorithm proposed by Daniel M. Sunday in 1990 than the BM algorithm. The core idea is: During the matching process, the pattern string is not required to be compared from left to right or from right to left. When a mismatch is found, the algorithm can skip as many characters as possible to perform the next matching, thus improving the matching efficiency. Assume that s [I] ≈ T [J], 1 ≤ I ≤ n, 1 ≤ j ≤ m in case of mismatch. At this time, the matched part is U, and the length of the string U is assumed to be L. 1. Obviously, s [L + I + 1] must participate in the next round of matching, and t [m] should at least move to this position (that is, the mode string T should move at least one character to the right ).
Figure
1 Sunday
Algorithm Mismatch
There are two cases: (1) s [L + I + 1] does not appear in the mode string T. At this time, the mode string T [0] is moved to the character position after s [T + I + 1. 2.
Figure
2 Sunday
The number
1
Situation (2) S [L + I + 1] appears in the mode string. Here s [L + I + 1] from the right side of the pattern string T, that is, by T [M-1], t [M-2],… T [0. If it is found that s [L + I + 1] is the same as a character in T, write down this position as K, 1 ≤ k ≤ m, T [k] = s [L + I + 1]. In this case, the pattern string T should be moved to the right M-K character position, that is, to the T [k] And s [L + I + 1] Alignment position. 3.
Figure
3 Sunday
The number
2
Situation And so on. If the match is complete, the match is successful. Otherwise, move the next round until the rightmost end of the Main string s ends. The worst case of this algorithm is O (n * m ). This algorithm is faster to match short mode strings.
2 zzl
Algorithm
The existing string matching algorithms directly compare the order of the pattern strings from left to right or from right to left. The core idea of the zzl algorithm is: first, search for the first letter of the pattern string T in the primary string S. Store the location of the pattern string each time it is found, extract these locations in sequence, and continue matching the pattern string t from these locations. For frequently-used master and mode strings to be matched, because all storage locations of the mode string in the master string are saved in advance, the matching speed is very fast.
2.1
Preprocessing
Preprocessing mainly completes searching for all the locations where the first character of the mode string appears in the main string and saves it in an array. Search mode string first character algorithm: K = 0; for (I = start; I <S. length-T.length; I ++) {If (S. STR [I] = T. STR [0]) {x [k] = I; k ++; // K indicates the number of times that the first letter of the mode string appears in the main string }}
2.2
Match
Based on preprocessing, the string matching algorithm starts from the position of the searched pattern string in the main string and matches the rest of the pattern string after the first letter. In this case, you can use the BF algorithm and set a counter to record the matching times. The matching algorithm is as follows: V = 0; For (m = 0; m <K; m ++) {for (j = 1; j <t. length; j ++) {If (S. STR [x [m] + 1] = T. STR [J]) {v ++; X [m] ++;} else {v ++; break ;}}}
3
Algorithm Performance Analysis and Experiment Result Analysis
3.1
Algorithm performance analysis
If the pre-processing process of the algorithm is not considered, if the first letter of the mode string appears K times in the main string, the worst case of the zzl algorithm is K * (M-1) <K * m. If the preprocessing process of the algorithm is considered, the total number of comparisons must be added n times, that is, K * m + n.
3.2
Lab results
To evaluate the performance of the algorithm, a text and pattern string are randomly extracted and matched on the same computer using different algorithms. Test the text string S = "from automated teller machines and atomic clocks to mammograms and semiconductors, innumerable products and services rely in some way on technology, measurement, and standards provided by the National Institute of Standards and Technology ", mode string T =" Products and Services ". The BF algorithm, KMP algorithm, BM algorithm, Sunday algorithm, and zzl algorithm are used for matching calculation on the same computer, and the total number of character matching times of each algorithm is counted. The test result is shown in table 1.
Table
1
Matching Algorithm experiment results
Algorithm |
BF |
KMP |
BM |
Sunday |
Zzl |
Total number of character matches for one match |
116 |
95 |
108 |
110 |
23 |
4
Conclusion
For frequently-used master and mode strings to be matched, because all storage locations of the mode string in the master string are saved in advance, the matching speed of the zzl algorithm will be very fast.
References:
1. Edited by Zhu Zhan. data Structure -- use C language (version 3rd) [M]. xi'an: Xi'an Jiao Tong University Press, Wang Cheng, 20042, Liu Jingang. an Improved string matching algorithm [J]. computer Engineering, 643 (2): 62-Christian charras, Thierry lecroq. exact string matching algorithms [Z]. Http://www-igm.univmlv.fr /~ Lecroq/string/ 4. Huang zhongqing, Wang wenyong, and Huang Xiaosheng. Implementation of the network information Audit System Based on Winpcap and improved BM algorithm [Z]. Http://www.ahcit.com/lanmuyd.asp? Id = 1180
Receipt date: January 1, March 17
Modified on: February 1, April 1
Author profile: Ji Fuquan (1981-), male, Master student, main research direction: Artificial Intelligence and expert systems, neural networks; Zhu zhanli, Professor, great guidance principle is quite simple. I don't know how it works? Address: http://wendell07.blog.hexun.com/14112681_d.html