String: KMP eentend-KMP automatic trie graph trie tree suffix Array

Source: Internet
Author: User

Concerning the string, there is nothing more than such algorithms and data structures: the trie diagram of the suffix array of the extend-KMP suffix tree in the KMP algorithm of the automatic machine and its application. Of course, these are relatively advanced data structures and algorithms. The most common and familiar data here is KMP. Even so, some people still do not understand KMP, let alone others. Of course, in the general string problem, we only need to use a simple brute-force algorithm to solve the problem. If the brute-force efficiency is too low, we will use a hash. Of course, hash is also a frequently used method in interviews. In this case, some of these algorithms and data structures are rarely asked, but if they are used, they can generally produce good linear complexity algorithms.

To be honest, I have always thought that the string problem is quite complicated. If we use brute force or hash, it is difficult to think about other methods. Of course, some of them can be used for dynamic planning. However, to solve this difficult problem, I carefully studied these algorithms and data structures. Make a note so that you do not forget to think about it for a long time. If you encounter a string problem, it generally does not exceed the range of these methods. Let's take a look at the figure, which mainly describes the relationship between the data structures of these algorithms. The yellow part in the figure mainly describes the key points of these algorithms and data structures.

 

As shown in the figure, the relationship is as follows: Extend-KMP is the extension of KMP; AC ry is the multi-string form of KMP; it is a finite ry; and trie graph is actually a deterministic finite ry; the AC automatic machine, trie diagram, and suffix tree are actually all trie; the suffix array and suffix tree are data structures related to the string suffix set; the suffix pointer in the trie diagram is consistent with the suffix link in the suffix tree.

Next we will explain these algorithms and data structures separately, and analyze and explain the key issues involved.

KMP

First, the main idea of this matching algorithm is to make full use of the last matching result to find the maximum distance that the pattern string can move forward when the matching fails. The maximum distance must be the next array value of the current matching position of the pattern string. That is, max {AJ is the suffix of Pi j <I}. Pi indicates the string a [1... I], and AJ indicates a [1... J]. The next array calculation of the mode string is a self-matching process. It is also the process of calculating next [I] using an existing value of next [1... I-1. We can see that if a [I] = A [next [I-1] + 1] Then next [I] = next [I-1], otherwise, you can move the mode string forward.
The entire process is as follows:
Void next_comp (char * Str ){
Int next [n + 1];
Int K = 0;
Next [1] = 0;
// Cyclic immutability, the beginning of each loop, K = next [I-1]
For (INT I = 2; I <= N; I ++ ){
// If the current position does not match, or it is still pushed to the start of the string, continue to push
While (A [k + 1]! = A [I] & K! = 0 ){
K = next [k];
}
If (A [k + 1] = A [I]) K ++;
Next [I] = K;
}
}
Complexity Analysis: From the process above, we can see that the internal loop continues to execute K = next [K], and this value must be reduced, that is, if a K is not executed at a time, at least 1 is reduced. On the other hand, the initial value of K is 0, while K is always non-negative, obviously, the reduction cannot be greater than the increase, so the complexity of the entire process is O (n ).

The above is the calculation process of the next array, and the matching process of the entire KMP is similar to this.

Extend-KMP

Why is it called extended-KMP? First, let's look at the content it calculates. It requires the suffix of string B and the longest public prefix of string. Extend [I] indicates the longest common prefix length between B [I... B _len] And a, that is, the array to be calculated.

Looking at this array, we can know that KMP can determine whether a is a substring of B and find the first matching position? For the extend [] array, you can use it to directly solve the matching problem. You only need to check whether the element of the extend [] array is equal to len_a. Obviously, this array stores more and more information, that is, the matching length between each position of B and.

The extend of this array is also used in similar KMP processes. First, the longest public prefix length between string a and its own suffix must be calculated. We set it to the next [] array. Of course, here the meaning of the next array and the process in KMP. But its calculation is also the use of the calculated next [1... I-1] To find the size of next [I], the overall idea is the same.

Specifically, we can see that

First in 1... I-1, to find a K, make it meet K + next [k]-1 maximum, that is, let K plus next [k] length as long as possible.

In fact, the following proof process uses K + next [k] after each computation to always increase or decrease, and it obviously has an upper bound, to prove that the complexity of the entire computing process is linear. As shown in, suppose we have found such K and then how to calculate the value of next [I. Set Len = K + next [k]-1 (in the figure, we use AK to represent next [k]). The following describes the situation:

If Len <I that is to say, the length of Len is not covered by AI, we only need to compare a [I... n] The longest public prefix with a is enough. In this case, it is obvious that each comparison will inevitably increase I + next [I]-1 by one.
If Len> = I, it is what we express in the figure. Then we can see that the position I is equal to the element at the position I-k + 1, which is divided into two situations.
If l = next [I-k + 1]> = len-I + 1, that is, l is in the position of the second dotted line, we can see the size of next [I, it must be at least len-I + 1, and then we can start to compare whether the following matches. Obviously, if we compare it once more, it also increases I + A [I]-1 by 1.
If l <len-I + 1 that is to say, l is in the position of the first dotted line, we know that a and AK match in this position, but AK and AI-k + 1 do not match in this position, obviously, A does not match Ai-k + 1 at this position, so the value of next [I] is L.

In this way, the value of next [I] is calculated. From the process above, we can see that next [I] can be calculated directly by K, you must either compare them one by one, but if you need to compare them, the maximum value of K + next [k]-1 will be added to 1 for each comparison. in the whole process, this value only increases and does not decrease, and it has an obvious upper bound k + next [k]-1 <2 * len_a, it can be seen that the number of comparisons is limited to this value, so the total complexity will be O (n.

Trie tree

First, the trie tree is actually a character search tree composed of some strings. The side is represented by characters that represent the string, so that we can go to the O (LEN (STR )) time to determine whether a string belongs to the set. The branches in the trie tree can be implemented using linked lists or arrays, each of which has its own advantages and disadvantages.

Each side of a simple trie tree is represented by one character. However, to save space, an edge can represent a character, which is the compressed representation of trie. Compression means that the space complexity of trie is proportional to the number of word nodes.

AC Automation

The AC automatic mechanism can be seen as an extended form of KMP in the case of multiple strings. It can be used to process multi-mode string matching. Just create a trie tree for these mode strings, and then create a failure pointer for each node, that is, the next function similar to KMP, so that we can know that if the matching fails, where can I start matching again. AC is actually the first letter of two people's names, Aho-corasick.

It should be remembered that when KMP constructs the next array, we construct the previous back, that is, first construct 1... I-1, and then use them to calculate next [I], which is similar here. However, this sequence is embodied in the BFS sequence. The failure pointer of the AC automatic machine has the same function. That is to say, when the pattern string matches on the tire, if it cannot match the keyword of the current node, you should continue matching with the node pointed to by the failure pointer of the current node. A string consisting of the node from the root to which the failed Pointer Points is actually the longest matching string with the suffix of the current node.

The process is as follows:

Http://hi.baidu.com/luyade1987/blog/item/5ba280828dcb9eb96d811972.html of AC (Aho-corasick) automatic machine Algorithm

Similar to self-matching in KMP mode. starting from the root node, for each node: Set the character on the node to K, along the failure pointer of the parent node, until the root node or the current failed pointer node is reached, the character K must be the son node,
In this case, the failure pointer is set to the root node, and in the other case, the character of the current failure pointer node is K to the son node.

We can also perform this operation. If our AC automatic machine contains only one mode string, this process is actually the KMP computing process.

Next, we need to perform text matching:
First, Trie-(mode string set) has a pointer P1 pointing to the root, and a pointer P2 pointing to the string header in the text string. The following operations are similar to KMP: If K is set to the letter indicated by P2 and the node indicated by P1 in trie has the son of K, P2 ++, p1

Point to the son whose character is K. Otherwise, P1 goes up along the failed pointer of the current node until P1 has a son whose character is K or P1 points to the root node. If P1 passes by a node marked as the end of the mode string
The string has already been matched. or if the point where P1 is located can follow the failure pointer to the End Node of a pattern string, the pattern string ending with that node has already been matched.
You can find relevant information in the following link:
Www. cs. uku. fi /~ Kilpelai/bsa05/lectures/slides04.pdf

It mainly constructs three functions goto fail and output based on the mode string.

Q: = 0; // initial state (Root)
For I: = 1 to M do
While G (Q, t [I]) = 0 do
Q: = f (Q); // follow a fail
Q: = g (Q, t [I]); // follow a GOTO
If out (q )! = 0; then print I, out (Q );
Endfor;
----------------------------------------- End of the quote -------------------------------------------------------------------------------------------
Taking Ababa as an example, we can obtain that its KMP next array value is 0 0 1 2 3. The following figure shows the AC automatic mechanism and trie:

Trie Diagram

A trie graph is actually a deterministic automatic machine, which adds the deterministic attribute to the AC. For an AC automatic machine, when it encounters an unmatched node, it may have to perform several backtracing operations before the next matching. However, for a trie graph, a match can be performed at each step. Each input character has a definite state node.

From the above figure, we can also see that the suffix nodes of the trie graph are basically the same as those of the suffix pointers of the AC automatic machine. The difference is that the edges of all character sets are added to the root of the trie graph. In addition, the trie graph will add the character edges in all character sets for each node, and this edge completing process is actually a process for finding the suffix nodes of nodes, but these nodes are all virtual, instead of adding them to the graph, we can find their equivalent nodes, that is, their suffix nodes, so that these sides can point to the suffix node. (For example, in the black node C, it does not actually appear in our initial tire, but we can treat it as a virtual node and point its edge to its suffix node)

The trie diagram mainly uses two concepts to achieve this purpose. One is a suffix node, that is, the node corresponding to the string after removing the first character from the path string of each node. The method for calculating this node is through the suffix node of the Father's Day node. Obviously, the difference between the father's suffix node and Its suffix node is that there is only one tail character missing and it is set to C. Therefore, the C child of the pointer to the parent node of the node is the suffix node of the node. But sometimes his father may not have C children, so he has to find a node equivalent to his father's C. So I encountered a problem of searching for equivalent nodes.

The trie graph also has an edge population operation. The Node pointing to the edge corresponding to the nonexistent character can actually be regarded as a virtual node. We need to find an existing and equivalent node, point this edge to it. In this way, we are actually looking for equivalent nodes.

We can see how to find the equivalent nodes of a node. The so-called equivalence means that they are dangerous and consistent. Then let's look at a node as a dangerous node. The necessary condition is that its path string itself is a dangerous word, or the node corresponding to its path string suffix is a dangerous node. Therefore, we can see that if the path string corresponding to this node is not a dangerous word, it is equivalent to its suffix node. Therefore, when we fill in the edge, we can actually point to the suffix node of the node.

The trie graph actually improves the trie tree and adds additional information. This allows you to easily solve the problem of multi-mode string matching. Like the idea of KMP, The trie diagram also hopes to use the existing matching information to provide guidance for future matching. Some new concepts are proposed. Define the trie tree. The path string that is formed by connecting all the characters on the edge from the root to the path of a node is called the path string of this node. If the path string of a node ends with a dangerous string, the node is a dangerous node: that is, if it reaches this point, it indicates a matching state; otherwise, it is a security node. How can we determine whether a node is dangerous?

The root node is obviously a security node. A node is a dangerous node. Its path string itself is a dangerous word, or its path string suffix (this refers to the remaining part of a string after removing the first character) corresponding to the node (a string corresponding to the node, A dangerous node is a dangerous node that starts from the root node in the trie diagram and arrives along the specified edge of a character in turn.

How can I find the suffix of each node? Here we can use the previous calculation information to obtain it. Specifically, we use the suffix node of the Father's Day node. We only need to remember that the last character of the current node is set to C, then the C branch node of the suffix node of the Father's Day node is the required suffix node. First, we define that the suffix node of the root node is the root node, and the suffix node of the first node is the root node. In this way, we can find the suffix nodes of all nodes layer by layer. However, a problem may occur during this process: the suffix node of Father's Day may not have a C branch. What should I do at this time?

As shown in, if the suffix of the Father's Day node of the current node is set to W, we assume W has c children, and we can see that for the entire c subtree of W, because there is no side C leading to them, they cannot be bad strings, so the risks of these nodes are equivalent to the risks of their suffix nodes, and their suffix nodes, it is actually the c child of the W suffix node. It can be found at the end of this process.

-------------------- Reference: http://huangwei.host7.meyu.net /? Paged = 7

In fact, a trie graph is used to create a deterministic finite automatic machine DFA. Each vertex in the graph is a state, and the transition between States is represented by a directed edge. The trie graph is supplemented by edges Based on tire. In fact, it is derived from an AC automatic machine. The AC automatic machine only saves its suffix nodes, and uses the suffix nodes for redirect during use, it should be the idea of KMP until the corresponding state transfer is found. This article is for reference.

The trie diagram directly saves the value of the AC automatic machine after the state transfer calculation to the current node, so that the suffix node does not have to be iterated. Therefore, each node of the trie graph has | Σ | state transfer (Σ indicates the character set ). The specific construction method can be seen in wc2006 "trie graph construction, utilization and improvement". Let me briefly describe the process:
(1) construct trie and ensure that the root node must have | Σ | son.
(2) traverse trie hierarchically, calculate the suffix node, and mark the node, without | Σ | the son's side population.
Suffix node calculation:
(1) The suffix node of the root node is itself.
(2) The suffix node of the node on the second layer of the trie tree is also the root node.
(3) The suffix nodes of the remaining nodes are the nodes with corresponding state transfer in the suffix nodes of their parent nodes (similar to the iteration process of AC automatic machines ).
Node Tag:
(1) mark itself.
(2) Its suffix nodes are marked.
Edge population:
Fill in the gaps of the current node with the corresponding status transfer of the node with its suffix.
Finally, any node in the trie diagram has a corresponding state transfer. We will use this state transfer for dynamic planning.
If DP [I] [J] is set, it indicates that when the I-th state generates J characters, it is the smallest change value with the DNA sequence.
If the root node in the tire graph is 0, DP [0] [0] = 1 is initialized.
Then, we traverse the graph through BFs. We can see that when the graph is in the J layer, it indicates that the string with the J length is generated.
DP [0] [0] = 1; for I = 1 to M do for each side of the graph (S1, CH, S2) do DP [s2] [I] = min {DP [S1] [I-1] + (txt [I-1]! = CH)}; For each node X do ans = min {DP [x] [m]};

----------------------------------------------- End of the quote -----------------------------------------------------------------------------------

Suffix tree

The suffix tree is actually a trie tree composed of all strings with suffixes. If trie storage is not compressed, the total number of internal and external nodes may reach O (N ^ 2 ). therefore, this storage method cannot be used, because if it is used, the lower bound of complexity is O (n ^ 2), and it will not be lower. Therefore, the compression method must be used to reduce it to O (n ).

Before constructing a string, we first add a word that has not appeared in the string, such as "$". Why? To prevent suffix nodes from appearing inside, if we add "$", it is obvious that there will not be any suffix inside, and we can prove it by Reverse Verification: if such a suffix is an internal node, it means that there will be two "$" on the string path, but this is impossible, because our "$" only appears at the end, it has not appeared before.

During the construction process, we can see what the general construction process is like? For normal construction, assume that the string is a [1 .... n], we start to insert the trie tree from the suffix starting with a [1]. During the insertion, we gradually compare it until we find the unmatched branch, split the original node and add it to the new node. The key to this process is to look for the longest public prefix of the previously inserted strings and sufix [I] In sufix [1]... sufix [I-1. Then the insertion time O (1) can be completed, so the main time is spent searching for the longest public prefix (called head [I. Headi is the longest common prefix of w (I, n) and W (J, n), where J is any positive integer smaller than I. taili makes headi + taili = W (I, n ).

Now the key is the calculation of the longest public prefix head [I. Again, let's consider how to use head [1]... head [I-1] to calculate head [I]. To speed up the search for hi, we need to use the secondary structure-suffix link.

Definition of suffix Link (mccreight arithmetic ):
Make head [I-1] = Az, where A is the second I-1 character of string W. Since Z appears within range I at least twice (because AZ is also a [I-1... n] The longest public prefix with a previous suffix. That is to say, the other suffix is also a string starting with a az. This means that its successor is prefixed with Z, in this case, a [I... n] and its public prefix is Z. {In fact, this property will be used when we calculate the LCP of the suffix array}), so we must have | head [I] |>=| z |, Z is the prefix of head [I. The so-called hi-1 suffix Link (suffix link) is actually a pointer from hi-1 to Z corresponding node D link H [I-1]. Of course, Z may be an empty string. In this case, link hi-1 points to the root node root from hi-1.

Compared with the suffix pointer of the trie tree of the failed pointer of the AC automatic machine, we can find that the Z here is just the suffix of the head [I-1] after removing the first character, the so-called suffix link is actually a link pointing to the suffix of head [I]. This definition is consistent with the position pointed by the suffix pointer in the trie tree. In this way, it is clear how to establish the suffix link of this head [I.

Creation method:
1) The root node's root suffix link points to itself
2) Any suffix link of a non-leaf node is created immediately after the node appears.

The main algorithm framework is as follows:
For I = 1-> N do
Step 1. Search for node hi from link hi-1
Step 2. Add a leaf node leafi
Step 3. Create a suffix link hi for the function down
End

Suffix tree performance analysis:
Then we will talk about the pseudocode in the text box. For a given I, the complexity of step 2 is O (1), but the number of nodes between link hi-1 and HI cannot be determined, so step 1 is always linear. Local estimation failed. You may wish to start from the whole. I + | headi | always increases with I. Therefore, each character in W is traversed only once by the find function, and the overall complexity is O (n.

This analysis is similar to the extend-KMP complexity analysis.

Suffix Array

In fact, the suffix array sorts the strings in Lexicographic Order, and then stores the sorted order in an array SA, the array element represents the starting index of the Suffix in the original string. Through this, we can easily obtain another array rank []. Rank [I] represents the ranking of the original suffix A [I... n] in the SA array.

This data structure mainly involves two aspects. One is how to sort these suffixes quickly. There are many methods. Here we only show the multiplication algorithm. This method is better understood, the idea is also clever.

After the suffix array is obtained, if you want to play a strong role, you also need to find the longest public prefix LCS for each suffix. Therefore, the computing of LCS is also a key point.

First, let's look at the sorting. If we use a common sorting algorithm, we need nlogn comparisons, but each comparison requires O (N). In this way, the total complexity will be O (N * nlogn ).

The multiplication algorithm is like this, mainly the I-th sorting, the hour of the comparison utilizes the sorting result of the I-1 times, so that the comparison can be completed in O (1) Time:
We first sort all the characters starting from the positions of the original string with a length of 1, and then sort the characters starting from these positions with a length of 2, followed by a 2 ^ I order until 2 ^ I> = n. we can see that in the middle, a total of n log sorting is required. Then let's look at the I-th sorting, how the hour took advantage of the I-1-th sorting result.

For example, if we need to compare the strings starting with a [J] And a [k] with a length of 2 ^ I, we can divide them into two parts:
A [J] string starting with 2 ^ I = A [J] starting with 2 ^ (I-1) Length + A [J + 2 ^ (I-1)] start 2 ^ (I-1) Length
A [k] starts with a string of 2 ^ I = A [k] starting with 2 ^ (I-1) Length + A [K + 2 ^ (I-1)] start 2 ^ (I-1) Length
To compare a string starting with a [J] with a string starting with 2 ^ I and a string starting with a [k] with a length of 2 ^ I, we only need to first compare the first part, for example, if the two parts are equal and then compare the 2nd parts, and the size of the two parts has been sorted, we can give them a rank value, and only compare their rank values to get the size relationship, in this way, the comparison can be completed in O (1) time. In addition, if our sorting algorithm is O (n), the complexity of the entire algorithm is O (nlogn.

Let's look at the calculation of LCS. If we want to calculate any two suffixes of LCS [I] [J], we have a conclusion:

Set I <j LCP (I, j) = min {LCP (K-1, k) | I + 1 = <k <= J} LCP theorem here I, j Refers to sa [I] SA [J].

To prove the above conclusion, we must first prove this: For any 1 = <I <j <k <= N, LCP (I, K) = min {LCP (I, j), LCP (j, k.

In fact, if you want to find the longest common prefix length of I j, you only need to find the minimum LCS length of the adjacent suffix between I j. In this way, we only need to find the length of LCS in the adjacent Suffix of the SA array and convert it into a rmq problem, that is, the minimum value in the interval. This can be solved by O (1. The problem becomes: how to calculate the LCS length of the adjacent Suffix in the SA array in O (n) time.

This problem, if O (n) is required, makes use of the following conclusion: to define a one-dimensional array height, make height [I] = LCP (I-1, I) 1 <I <= N and set height [1] = 0. How can we calculate the height array as efficiently as possible?

For ease of description, set H [I] = height [rank [I], that is, height [I] = H [SA [I], while H array satisfies one property:

For I> 1 and rank [I]> 1 there must be H [I]> = H [I-1]-1.

Why is this conclusion? In fact, it is in line with the above suffix link of the suffix tree. H [I] = height [rank [I] is actually our original suffix A [I... n] The longest common prefix with a string, and H [I-1] is a [I-1... n] The longest common prefix of a string. And we can see that if we put a [I-1... n] after removing the first character, it becomes a [I... n], let's assume a [I-1... n] the adjacent suffix string is xyyyyyy. Here their LCS length is H [I]. Suffix string xyyyyyy. After removing X, it is yyyyyyy. If it is not closer to a [I... n], then H [I] = H [I-1]-1, if a [I... n] neighbor is not it, then H [I] can only be bigger than H [I-1]-1 is not smaller than it.

In this way, we can calculate H [I] in O (n) time. Because H [I] cannot exceed n at most, and it can be reduced to less than 1 at a time.

After calculation, we can calculate the height array based on height [I] = H [SA [I], and then obtain the length of LCS in the adjacent Suffix of SA.

Summary:
In fact, we can see that the above algorithm has one thing in common: Use the obtained computing results to get the next calculation result, and try to use the existing information to reduce the calculation workload.

Reprinted from: http://duanple.blog.163.com/blog/static/709717672009825004092/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.