A string matching algorithm that can be used for special purposes.

Source: Internet
Author: User
A string match that can be used for special purposes.Algorithm   Ji Fuquan Zhu zhanli (Computer College, Xi'an Petroleum University, Xi'an 710065, China) Abstract The existing string matching algorithms perform sequential matching from left to right or from right to left Based on the pattern string, this paper proposes a special character string matching algorithm, zzl. For frequently-used master and mode strings to be matched, the matching speed of the zzl algorithm is very fast.   Keywords String, pattern matching, algorithm string matching is to find one or all occurrences of a pattern string in a string. String Matching is widely used. For example, string matching is required for spelling checks, language translation, data compression, search engines, network intrusion detection, computer virus pattern matching, and DNA sequence matching. Many string matching algorithms have been proposed. Traditional BF Algorithms [2, 3] , KMP Algorithm [2, 3] And so on. Recently, we have proposed the BM algorithm. [2, 4] , Sunday Algorithm [2, 3] . This paper proposes a special character string matching algorithm, zzl. 1 Related algorithm analysis   The meaning of string mode matching is: in the main string S, start from the position to find whether there is a mode string (also known as the mode string) T, if a pattern string with the same pattern string T is found in the primary string S, the pattern string matches the primary string; if a pattern string that is the same as the pattern string t is not found in the main string S, it does not match [1] . First, make the following assumptions: Main string S: s [1... N], length is N; mode string T: T [1... M], length is m; n ≥ m; 1.1 BF Algorithm The core idea of BF (brute force) algorithms is: first, compare s [1] with T [1]. If they are equal, then compare s [2] and T [2]. until T [M]. If s [1] and P [1] are not the same, t moves the position of a character to the right and then compares it in sequence. If K, 1 ≤ k ≤ n, and s [k + 1... K + M] = T [1... M], the match is successful; otherwise, the match fails. In the worst case, M * (N-M + 1) times are compared, and the time complexity is O (M * n ). 1.2 KMP Algorithm   The core idea of the KMP (knuth-Morris-Pratt) algorithm is: In case of mismatch, the primary string does not need to be traced back, instead, use the obtained "partially matched" result to shift the pattern string right as far as possible and continue the comparison. It should be emphasized that the pattern string does not necessarily move the position of a character to the right, nor must it be re-tried from the start point of the pattern string, that is, the position where the mode string can be shifted to multiple characters at a time, and the position after the right shift can start from somewhere after the start of the mode string to try matching. Assume that, when a mismatch occurs, s [I] =t [J], 1 ≤ I ≤ n, 1 ≤ j ≤ m. In the next round of comparison, s [I] should be different from T [ Next [ J ] Alignment and backward comparison: for example, t = "abaabcac ", 1.3 BM Algorithm T = T & 05; · TM and main string S = S & 05; · Sn, an auxiliary array BM [] is required to implement the BM algorithm. It uses the character value as the subscript of the array. The size of the array depends on the number of possible characters, and is irrelevant to the size of the mode string. For matching of Chinese keywords, the ASCII character set needs to be expanded, and the array size is 256. For any X that belongs to the set Σ → {256, ·,}, the value of BM [x] is: the items corresponding to each character in the array record the position of the character in the mode string for the last time. The idea of the BM algorithm is: if the segment starting from I to return to the beginning of the execution of the Main string matches the character of the mode string t from the right to the left, if the mode string t matches all characters, the match is successful. Otherwise, you need to shift right to start a new round of match. Assume that the match fails to occur in the position J in the pattern string, find the secondary array from the Si-M + J character that cannot be matched with the primary string to obtain the final position value BM [Si-M + J] of the character in the mode string T. If BM [Si-M + J] is equal to zero, it indicates that the Si-M + J character is not in the mode string t, the mode string skips the character and is aligned next to the character; if BM [Si-M + J] is greater than J, it indicates that the final position of this character in the mode string is on the left of J, the mode string t shifts right to the alignment character Si-M + J. If BM [Si-M + J] is smaller than J, it indicates that the last position of the character in the mode string is on the right side of J. The mode string cannot be moved left, and the right side is shifted to a grid. Shift = max (1, m-BM [I-m + J]). 1.4 Sunday Algorithm   The Sunday algorithm is a faster algorithm proposed by Daniel M. Sunday in 1990 than the BM algorithm. The core idea is: During the matching process, the pattern string is not required to be compared from left to right or from right to left. When a mismatch is found, the algorithm can skip as many characters as possible to perform the next matching, thus improving the matching efficiency. Assume that s [I] ≈ T [J], 1 ≤ I ≤ n, 1 ≤ j ≤ m in case of mismatch. At this time, the matched part is U, and the length of the string U is assumed to be L. 1. Obviously, s [L + I + 1] must participate in the next round of matching, and t [m] should at least move to this position (that is, the mode string T should move at least one character to the right ).

Figure 1 Sunday Algorithm Mismatch   There are two cases: (1) s [L + I + 1] does not appear in the mode string T. At this time, the mode string T [0] is moved to the character position after s [T + I + 1. 2.

Figure 2 Sunday The number 1 Situation (2) S [L + I + 1] appears in the mode string. Here s [L + I + 1] from the right side of the pattern string T, that is, by T [M-1], t [M-2],… T [0. If it is found that s [L + I + 1] is the same as a character in T, write down this position as K, 1 ≤ k ≤ m, T [k] = s [L + I + 1]. In this case, the pattern string T should be moved to the right M-K character position, that is, to the T [k] And s [L + I + 1] Alignment position. 3.

Figure 3 Sunday The number 2 Situation And so on. If the match is complete, the match is successful. Otherwise, move the next round until the rightmost end of the Main string s ends. The worst case of this algorithm is O (n * m ). This algorithm is faster to match short mode strings. 2 zzl Algorithm   The existing string matching algorithms directly compare the order of the pattern strings from left to right or from right to left. The core idea of the zzl algorithm is: first, search for the first letter of the pattern string T in the primary string S. Store the location of the pattern string each time it is found, extract these locations in sequence, and continue matching the pattern string t from these locations. For frequently-used master and mode strings to be matched, because all storage locations of the mode string in the master string are saved in advance, the matching speed is very fast. 2.1 Preprocessing   Preprocessing mainly completes searching for all the locations where the first character of the mode string appears in the main string and saves it in an array. Search mode string first character algorithm: K = 0; for (I = start; I <S. length-T.length; I ++) {If (S. STR [I] = T. STR [0]) {x [k] = I; k ++; // K indicates the number of times that the first letter of the mode string appears in the main string }} 2.2 Match   Based on preprocessing, the string matching algorithm starts from the position of the searched pattern string in the main string and matches the rest of the pattern string after the first letter. In this case, you can use the BF algorithm and set a counter to record the matching times. The matching algorithm is as follows: V = 0; For (m = 0; m <K; m ++) {for (j = 1; j <t. length; j ++) {If (S. STR [x [m] + 1] = T. STR [J]) {v ++; X [m] ++;} else {v ++; break ;}}} 3 Algorithm Performance Analysis and Experiment Result Analysis   3.1 Algorithm performance analysis   If the pre-processing process of the algorithm is not considered, if the first letter of the mode string appears K times in the main string, the worst case of the zzl algorithm is K * (M-1) <K * m. If the preprocessing process of the algorithm is considered, the total number of comparisons must be added n times, that is, K * m + n. 3.2 Lab results   To evaluate the performance of the algorithm, a text and pattern string are randomly extracted and matched on the same computer using different algorithms. Test the text string S = "from automated teller machines and atomic clocks to mammograms and semiconductors, innumerable products and services rely in some way on technology, measurement, and standards provided by the National Institute of Standards and Technology ", mode string T =" Products and Services ". The BF algorithm, KMP algorithm, BM algorithm, Sunday algorithm, and zzl algorithm are used for matching calculation on the same computer, and the total number of character matching times of each algorithm is counted. The test result is shown in table 1. Table 1 Matching Algorithm experiment results  
Algorithm BF KMP BM Sunday Zzl
Total number of character matches for one match 116 95 108 110 23
4 Conclusion   For frequently-used master and mode strings to be matched, because all storage locations of the mode string in the master string are saved in advance, the matching speed of the zzl algorithm will be very fast. References: 1. Edited by Zhu Zhan. data Structure -- use C language (version 3rd) [M]. xi'an: Xi'an Jiao Tong University Press, Wang Cheng, 20042, Liu Jingang. an Improved string matching algorithm [J]. computer Engineering, 643 (2): 62-Christian charras, Thierry lecroq. exact string matching algorithms [Z]. Http://www-igm.univmlv.fr /~ Lecroq/string/ 4. Huang zhongqing, Wang wenyong, and Huang Xiaosheng. Implementation of the network information Audit System Based on Winpcap and improved BM algorithm [Z]. Http://www.ahcit.com/lanmuyd.asp? Id = 1180 Receipt date: January 1, March 17 Modified on: February 1, April 1 Author profile: Ji Fuquan (1981-), male, Master student, main research direction: Artificial Intelligence and expert systems, neural networks; Zhu zhanli, Professor, great guidance principle is quite simple. I don't know how it works? Address: http://wendell07.blog.hexun.com/14112681_d.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.