I want to tell you a story about the suffix array. For a while, I was interviewing at a company in Seattle, when I was curious about how to most effectively create an executable binary file diff
. My research brought me a suffix array and a suffix tree. The suffix array simply sorts all the suffixes of the string and stores them in the ordered list. The suffix tree is similar, but more like a list BSTree
. These algorithms are fairly simple, and once you have a sort operation, they have fast performance. The problem they solved was to find the longest common substring between two strings (or in this case a byte list).
You can easily create a suffix array in Python:
>>>Magic="Abracadabra">>>Magic_sa=[]>>>ForIInchRange(0,Len(Magic)):...Magic_sa.Append(Magic[I:])...>>>Magic_sa[' Abracadabra ',' Bracadabra ',' Racadabra ',' Acadabra ',' Cadabra ',' Adabra ',' Dabra ',www.rcsx.org' Abra ',' Bra ',' Ra ', ' a ' ]>>> magic_sa = sorted (magic_sa) >>> magic_sa[ ' a ' ' Abra ' ' Acadabra ' Span class= "s" > ' Adabra ' ' bra ' ' Bracadabra ' ' Dabra ' Ra ' ' Racadabra ' ]>>>
As you can see, I just remove the suffix of the string sequentially, and then sort the list. But what does this do for me? Once I have this list, I can find any suffix I want through this list of binary searches. This example is very primitive, but in the actual code, you can quickly do it, you can track all the original index, so you can refer to the original location of the suffix. It is very fast compared to other search algorithms and is useful for things like DNA analysis.
Back to Seattle's interview. I was interviewed by C + + programmers in this cold room for a job in Java. You can conclude that this is not a very interesting interview and I would never think I would get the job. Over the years, I didn't write any C + +, and the job was for Java, when I was a Java expert. The next interviewer came, and he asked me, "How do I find substrings in a string?" ”
That's great! I have been studying the problem in my spare time. Of course I know! I jumped up and went to the whiteboard and explained to that guy how to make a suffix tree, how it improves search performance, how faster a modified heap is sorted, how the suffix tree works, why it's better than a three-pronged search tree, and how to implement it in C. I think if I can show how to write in C, then this will prove that I'm not just a core competency of Java code workers.
The guy was shocked, just as I opened a bag of fresh durian in the interview room. He looked at the board and stammered, "Well, am I looking for something about the Boyer-moore search algorithm?" Do you know? I said with a frown: " Yes, just like 10 years ago." He shook his head, took his things, and got up and said, "Well, I 'll let everyone know what I think." ”
A few minutes later, the next interviewer came. He looked up at the whiteboard, laughed and laughed at me, and then asked me another C + + template meta-programming question I couldn't answer. I didn't get the job.
Challenge Practice
In this exercise, you will use my Python mini-session and create your own suffix array search class. The class will use a string, split it into a list of suffixes, and then do the following:
find_shortest
Find the shortest substring starting with it. In the example above, if I search abra
, then it should return abra
instead abracadabra
.
find_longest
Find the oldest string that starts with it. If I search abra
, you return abracadabra
.
find_all
Finds all the substrings starting with it. This means that the abra
return abra
and abracadabra
.
You will need to do a good automated test of this and perform some performance measurements. We will use them in future exercises. When you're done, you'll need to do a research study to complete this exercise.
Research Learning
- Once your test is working properly, use your
BSTree
rewrite it to sort and search for suffixes. You can also use each BSTreeNode
to value
track where the substring exists in the original string. You can then keep the original string.
BStree
How do I change your code for different search operations? Is it simpler or harder to make?
Deep learning
Thoroughly study suffix arrays and their applications. They are very useful, but not well known by most programmers.
You can easily create a suffix array in Python