Clear explanation of ukkonen suffix tree algorithm

Source: Internet
Author: User

There is a translation article on this site, which clearly explains the suffix tree algorithm named ukkonen. This article is well written, but it still makes a mistake.

I implemented the so-called ukkonen algorithm according to the instructions in this article, but an error occurred during the test. Afterwards, I conducted a lot of troubleshooting (because I thought it was definitely not my problem in the article at the beginning. It took me two days to solve the problem and finally solved the problem by putting aside the "correct article" point of view)

The following is the ukkonen algorithm I reorganized. The example is "abcabxabcd ".

The C ++ source code for generating the number of suffixes and plotting in the http://www.oschina.net/code/snippet_593413_38384

When 'A', 'B', and 'C' are inserted in sequence, the suffix tree is as follows:

,

In the figure, a green node indicates an active node, and a yellow node indicates a root node. If no active edge exists, the activity length is zero.

The black arrow in the figure indicates an edge in the format of X :( N, #). X indicates the edge pointing to the leaf node. X indicates the first character of the edge, n is the start position of the edge. Normal edges list all the characters contained in them.

The value of the leaf node, indicating the starting position of the suffix of the leaf node.

The preceding example shows that a new edge and leaf node are inserted when an existing edge (the first character) does not exist in a zero-length activity point is inserted. The remaining number of suffixes is + 1 before scanning and-1 after insertion, so the value is 0.

Next, insert 'A' (current scan position 3 ):

Because the character 'a' already exists in an edge, we set the active triple to (root, 'A', 1), the number of remaining suffixes, and complete the scanning of this character. In fact, the active edge is not expressed as 'A', but 3 -- this indicates that the active edge starts with a character whose text index is 3. That is, the activity triple is (root, 3, 1 ). The number of remaining suffixes is + 1 before scanning, but no suffix is inserted this time. Therefore, the remaining suffix is 1.

In the figure, the green arrow indicates an active edge, and the number following the colon at the end of the edge indicates the active length.

Next, insert 'B' (current scan position 4 ):

Because the character 'B' already exists in the next position of the active edge, we set the active triple to (root, 3, 2), and the remaining suffix is 2, and complete the scanning of this character.

Next, insert 'X' (current scan position 5 ):

To insert 'x', the current side is "abcabx", and the activity length is 2. The next character of this side is 'C' rather than 'x'. Therefore, we split this side, that is, the AB and C :( 2, #) sides and the nodes between them in the figure, and add a new edge x :( 5, #) and leaf node 3 to the node.

Why is leaf node 3? Because the current suffix is 3, it indicates that the suffix we inserted is "ABx", and the part before the suffix is "ABC", because "AB" is hidden in the existing suffix tree, the actually inserted edge starts from 'x', but the leaf node starts from the second 'A' and Its Index is 3. To simplify the calculation, the current scan position is + 1-Number of remaining suffixes.

After inserting a new edge, the remaining suffix tree is-1. because the number of remaining suffixes is greater than 0, we need to re-insert the suffix, until the number of remaining suffixes is = 0 or the suffix is hidden.

After inserting a new edge and node, you need to update the active triple. Because the active node is the root node, the operation is: Activity edge (INDEX) + 1, activity length-1, so the new triple is (root, 4, 1 ).

The position of text index 4, with the character 'B'. Based on this, we determine that the new edge is B :( 1 ,#).

Because of the loop, we split the edge and insert the new edge X: (5, #) and leaf node 4.

At the same time, according to the rules, we need to add a suffix pointer. The suffix pointer will be added to the process of scanning a character, because the split edge appears between new internal nodes (from old to new ). Assume that the new nodes in a scan are A, B, and C in sequence. Then, add the suffix pointer A-> B and B-> C.

In the figure, the suffix pointer is represented by a Red Arrow.

In this case, the updated triplet is (root, 5, 0), and the remaining suffix is 1.

In the next insert operation, a new edge X is added to the root node (active node), which is the same as the rule for inserting characters at the positions 1, 2, and 3 of the index. (5 ,#) and leaf node 5.

The remaining suffix is 0. This scan ends.

Next, insert 'A' (current scan position 6 ):


Update the productkey, devicename, and devicesecret to (root, 6, 1), and the number of remaining suffixes is 1.

Next, insert 'B' (current scan position 7 ):


Update the productkey, devicename, and devicesecret to (root,), and the number of remaining suffixes to 2. As a new vertex is reached, the productkey is reset to (Green)

Next, insert 'C' (current scan position 8 ):


Update the productkey, devicename, and devicesecret to (Green, 8, 1), and the number of remaining suffixes to 3;

Next, insert 'D' (current scan position 9 ):


A series of suffixes are inserted:

Number of remaining suffixes + 1;

Split side C: (2, #) => 0. Then, because the active vertex has a suffix pointer, the active vertex is reset to this vertex (the root reaches this vertex through side B ), the activity side and activity length are kept as 8 and 1

Split the C: (2, #) => 1 side to generate a new suffix pointer. At this time, the activity point does not have a suffix pointer. The active node is not the root node. There are two methods to locate the next activity point.

1) In short, we know the number of remaining suffixes and the current scan location. In other words, we know the current suffix to be inserted, so we can search for the Suffix from the root node.

The current scan position is 9 and the remaining suffix is 2. Therefore, the suffix to be inserted starts from 8 and is inserted in [8, 9], that is, "cd". The reset activity triple is (root, 8, 1): root node, current scan location-number of remaining suffixes + 1, number of remaining suffixes-1

Then, modify the productkey, devicename, and devicesecret until the activity length is 0 or less than the active edge length.

Therefore, we have found the node where the root node reaches through edge C.

2) Update the activity triples to (current activity point, current scan location-current activity length, current activity length)

From this activity point to the parent node, each time you move to the parent node, you must make the active edge (INDEX) minus the length of the passing edge. The length of the current activity plus the length of the passing edge

Until the root node, the operation at this time is the same as the activity node at the root node: Activity edge (INDEX) + 1, activity length-1

Or the moving node has a suffix pointer, so we move an active node along the suffix pointer, and the activity edge and activity length remain unchanged.

After reaching the root node or moving the suffix pointer, you must also modify it and go along the suffix tree until the active length is 0 or less than the active edge length.

No matter which method is used, the two methods will reach the same point. Generally, when the tree is small, the first one is simpler, and the second is better when the tree is complex.

Split the C: (2, #) => 1 side to generate a new suffix pointer. The active triplet is updated to (root, 9, 0), and the number of remaining suffixes is 1.

Insert new edge D: (9, #) and new vertex 9

This scan ends.

............

This process can continue, knowing that a termination identifier is accepted or the end operation is performed.


Because the number of remaining suffixes has been reduced to 0 after the 'D' scan, the end operation is to set the root node as a suffix to identify the node, representing an empty suffix.

In fact, because 'D' appears only once in the string, its behavior is the same as the scan termination identifier. If we do not really insert a termination identifier (or even we do not need to compare the termination identifier and the next character in the activity position ), replacing each operation that adds only edge nodes with the ending identifier with the leaf node attribute is a standard end operation.

In general, we only make one modification, that is, the update Rule of the active triple.

Specifically, it is the update Rule When the active vertex is not the root node and there is no suffix pointer.

All graphs in this article are generated using graphviz.

Clear explanation of ukkonen suffix tree algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.