Zzl String Matching Algorithm

Source: Internet
Author: User

Reprinted a paper on the string matching algorithm zzl. The picture is a bit problematic. Let's take a look. This algorithm is actually very simple! It is to first count the position of the first character of the feature string in the main string, and then each matching starts from the storage position. In the sense of sharing, it is still very effective.

 
Font size: Large
Medium
Small
A string matching algorithm that can be used for special purposes.Ji Fuquan Zhu zhanli (Computer College, Xi'an Petroleum University, Xi'an 710065, China) Abstract

The existing string matching algorithms perform sequential matching from left to right or from right to left Based on the pattern string, this paper proposes a special character string matching algorithm, zzl. For frequently-used master and mode strings to be matched, the matching speed of the zzl algorithm is very fast. Keywords

String, pattern matching, Algorithm
String Matching is to find one or all occurrences of a pattern string in a string. String Matching is widely used. For example, in spelling check, language translation, data compression, search engine, network intrusion detection, and computing
For computer virus pattern matching and DNA sequence matching, string matching is required. Many string matching algorithms have been proposed. Traditional BF Algorithms [2, 3]
, KMP Algorithm [2, 3]
And so on. Recently, we have proposed the BM algorithm. [2, 4]
, Sunday Algorithm [2, 3]
. This paper proposes a special character string matching algorithm, zzl. 1
Related algorithm analysisThe meaning of string mode matching is: in the main string S, start from the position to find whether there is a mode string (also known as the mode string) T, if a pattern string with the same pattern string T is found in the primary string S, the pattern string matches the primary string; if a pattern string that is the same as the pattern string t is not found in the main string S, it does not match [1]
. First, make the following assumptions: Main string S: s [1... N], length is N; mode string T: T [1... M], length is m; n ≥ m; 1.1 BF
Algorithm
BF (brute
Force) the core idea of the algorithm is: first, compare s [1] with T [1]. If they are equal, then compare s [2] and T [2]. until T [M]. If s [1] and P [1] are different, T
Move the position of a character to the right and then compare it in sequence. If K, 1 ≤ k ≤ n, and s [k + 1... K + M] = T [1... M], the match is successful; otherwise, the match fails. This algorithm is required in the worst case.
M * (N-M + 1) times, time complexity is O (M * n ). 1.2 KMP
Algorithm
The core idea of the KMP (knuth-Morris-Pratt) algorithm is: In case of mismatch, the primary string does not need to be traced back,
Instead, use the obtained "partially matched" result to shift the pattern string right as far as possible and continue the comparison. It should be emphasized that the mode string does not necessarily move the position of a character to the right, and the right shift is not necessarily required.
Try again from the start point of the mode string, that is, the position of the mode string can be shifted to multiple characters at a time. After the right shift, you can start to try matching somewhere after the start point of the mode string. Assume that, when a mismatch occurs, s [I] =t [J], 1 ≤ I ≤ n, 1 ≤ j ≤ m. In the next round of comparison, s [I] should be different from T [ Next
[ J
] Alignment and backward comparison: for example, t = "abaabcac ", 1.3 BM
Algorithm
T = T & 05; · TM and main string S = S & 05; · Sn, an auxiliary number is required to implement the BM algorithm.
Group BM [
]. It uses the character value as the subscript of the array. The size of the array depends on the number of possible characters, and is irrelevant to the size of the mode string. To match Chinese keywords, you need to expand the ASCII character set, number
Group size is 256. For any X that belongs to the set Σ → {256, ·,}, the value of BM [x] is: the items corresponding to each character in the array record the position of the character in the mode string for the last time. The idea of the BM algorithm is: if the segment starting from I to return to the beginning of the execution of the Main string matches the character of the mode string t from the right to the left, if the mode string t matches all characters, the match is successful; otherwise, the right
Start a new round of matching. Assume that the matching fails in the pattern string, the SI-M + J secondary array of the primary string that cannot be matched to obtain the final position value of the character in the mode string T.
BM [Si-M + J]. If BM [Si-M + J] is equal to zero, it indicates that the Si-M + J character is not in the mode string t, the mode string skips the character and is aligned at the next position of the character. If
BM [Si-M + J] is greater than J, indicating that the last position of this character in the mode string is on the left of J, then the mode string t shifts right to the alignment character Si-M + J; if BM [Si-M + J] is smaller than J,
It indicates that the last position of the character in the mode string is on the right side of J. The mode string cannot be moved left, and the right side is shifted to a grid. Shift = max (1, m-BM [I-m + J]). 1.4 Sunday
AlgorithmThe Sunday algorithm is Daniel.
M. Sunday proposed an algorithm faster than the BM algorithm in 1990. The core idea is: During the matching process, the pattern string is not required to be compared from left to right or from right.
Compare to the left. When no matching is found, the algorithm can skip as many characters as possible to perform the next matching, thus improving the matching efficiency. Assume that s [I] ≈ T [J], 1 ≤ I ≤ n, 1 ≤ j ≤ m in case of mismatch. At this time, the matched part is U, and the length of the string U is assumed to be L. 1. Obviously, s [L + I + 1] must participate in the next round of matching, and t [m] should at least move to this position (that is, the mode string T should move at least one character to the right ).

Figure
1 Sunday
Algorithm MismatchThere are two cases: (1) s [L + I + 1] does not appear in the mode string T. At this time, the mode string T [0] is moved to the character position after s [T + I + 1. 2.

Figure
2 Sunday
The number
1
Situation(2) S [L + I + 1] appears in the mode string. Here s [L + I + 1] from the right side of the pattern string T, that is, press T [M-1],
T [M-2],... T [0. If it is found that s [L + I + 1] is the same as a character in T, write down this position as K, 1 ≤ k ≤ m, and
T [k] = s [L + I + 1]. In this case, the pattern string T should be moved to the right M-K character position, that is, to the T [k] And s [L + I + 1] Alignment position. 3.

Figure
3 Sunday
The number
2
SituationAnd so on. If the match is complete, the match is successful. Otherwise, move the next round until the rightmost end of the Main string s ends. The worst case of this algorithm is O (n * m ). This algorithm is faster to match short mode strings. 2 zzl
AlgorithmThe existing string matching algorithms are compared directly, regardless of the order of the pattern string from left to right or from right to left. The zzl Algorithm
The core idea is: first find the first letter of the pattern string T in the main string S, store its location every time it is found, and then extract these locations in sequence, the matching mode string t starts from these locations. For frequent use
For the primary string to be matched and the mode string, because all storage locations of the mode string in the primary string are saved in advance, the matching speed will be very fast. 2.1
Preprocessing
Preprocessing mainly completes searching for all the locations where the first character of the mode string appears in the main string and saves it in an array. Search mode string first character algorithm: K = 0; for (I = start; I <S. length-T.length; I ++) {If (S. STR [I] = T. STR [0]) {x [k] = I; k ++; // K indicates the number of times that the first letter of the mode string appears in the main string }} 2.2
Match
Based on preprocessing, the string matching algorithm starts from the position of the searched pattern string in the main string and matches the rest of the pattern string after the first letter. In this case, you can use the BF algorithm and set a counter to record the matching times. The matching algorithm is as follows: V = 0; For (m = 0; m <K; m ++) {for (j = 1; j <t. length; j ++) {If (S. STR [x [m] + 1] = T. STR [J]) {v ++; X [m] ++;} else {v ++; break ;}}} 3
Algorithm Performance Analysis and Experiment Result Analysis 3.1
Algorithm performance analysis
If the pre-processing process of the algorithm is not considered, if the first letter of the mode string appears K times in the main string, the worst case of the zzl algorithm is K * (M-1) <K * m. If the preprocessing process of the algorithm is considered, the total number of comparisons must be added n times, that is, K * m + n. 3.2
Lab results
To evaluate the performance of the algorithm, a text and pattern string are randomly extracted and different algorithms are used on the same computer.
Match. Test text Main string S = "from automated teller machines and atomic clocks
Mammograms and semiconductors, innumerable products and services rely in
Some way on technology, measurement, and standards provided by
National Institute of Standards and Technology ", mode string T =" products and
Services ". The BF algorithm, KMP algorithm, BM algorithm, Sunday algorithm, and zzl algorithm are used for matching calculation on the same computer, and the total characters of matching are counted.
Number of matches. The test result is shown in table 1. Table
1
Matching Algorithm experiment results
Algorithm BF KMP BM Sunday Zzl
Total number of character matches for one match 116 95 108 110 23
4
ConclusionFor frequently-used master and mode strings to be matched, because all storage locations of the mode string in the master string are saved in advance, the matching speed of the zzl algorithm will be very fast. References:1. Edited by Zhu Zhan. data Structure -- use C language (version 3rd) [M]. xi'an: Xi'an Jiao Tong University Press, Wang Cheng, 20042, Liu Jingang. an Improved string matching algorithm [J]. computer Engineering, 643 (2): 62-Christian charras, Thierry lecroq. exact string matching algorithms [Z]. http://www-igm.univmlv.fr /~ Lecroq/string/

4 Huang zhongqing, Wang wenyong, Huang Xiaosheng. Implementation of Network Information Audit System Based on Winpcap and improved BM algorithm [Z]. http://www.ahcit.com/lanmuyd.asp? Id = 1180

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.