You can easily create a suffix array in Python

Source: Internet
Author: User

I want to tell you a story about the suffix array. For a while, I was interviewing at a company in Seattle, when I was curious about how to most effectively create an executable binary file diff . My research brought me a suffix array and a suffix tree. The suffix array simply sorts all the suffixes of the string and stores them in the ordered list. The suffix tree is similar, but more like a list BSTree . These algorithms are fairly simple, and once you have a sort operation, they have fast performance. The problem they solved was to find the longest common substring between two strings (or in this case a byte list).

You can easily create a suffix array in Python:

>>>Magic="Abracadabra">>>Magic_sa=[]>>>ForIInchRange(0,Len(Magic)):...Magic_sa.Append(Magic[I:])...>>>Magic_sa[' Abracadabra ',' Bracadabra ',' Racadabra ',' Acadabra ',' Cadabra ',' Adabra ',' Dabra ',www.rcsx.org' Abra ',' Bra ',' Ra ', ' a ' ]>>> magic_sa = sorted (magic_sa) >>> magic_sa[ ' a '  ' Abra '  ' Acadabra '  Span class= "s" > ' Adabra '  ' bra '  ' Bracadabra '  ' Dabra '  Ra '  ' Racadabra ' ]>>>     

As you can see, I just remove the suffix of the string sequentially, and then sort the list. But what does this do for me? Once I have this list, I can find any suffix I want through this list of binary searches. This example is very primitive, but in the actual code, you can quickly do it, you can track all the original index, so you can refer to the original location of the suffix. It is very fast compared to other search algorithms and is useful for things like DNA analysis.

Back to Seattle's interview. I was interviewed by C + + programmers in this cold room for a job in Java. You can conclude that this is not a very interesting interview and I would never think I would get the job. Over the years, I didn't write any C + +, and the job was for Java, when I was a Java expert. The next interviewer came, and he asked me, "How do I find substrings in a string?" ”

That's great! I have been studying the problem in my spare time. Of course I know! I jumped up and went to the whiteboard and explained to that guy how to make a suffix tree, how it improves search performance, how faster a modified heap is sorted, how the suffix tree works, why it's better than a three-pronged search tree, and how to implement it in C. I think if I can show how to write in C, then this will prove that I'm not just a core competency of Java code workers.

The guy was shocked, just as I opened a bag of fresh durian in the interview room. He looked at the board and stammered, "Well, am I looking for something about the Boyer-moore search algorithm?" Do you know? I said with a frown: " Yes, just like 10 years ago." He shook his head, took his things, and got up and said, "Well, I 'll let everyone know what I think." ”

A few minutes later, the next interviewer came. He looked up at the whiteboard, laughed and laughed at me, and then asked me another C + + template meta-programming question I couldn't answer. I didn't get the job.

Challenge Practice

In this exercise, you will use my Python mini-session and create your own suffix array search class. The class will use a string, split it into a list of suffixes, and then do the following:

find_shortest

Find the shortest substring starting with it. In the example above, if I search abra , then it should return abra instead abracadabra .

find_longest

Find the oldest string that starts with it. If I search abra , you return abracadabra .

find_all

Finds all the substrings starting with it. This means that the abra return abra and abracadabra .

You will need to do a good automated test of this and perform some performance measurements. We will use them in future exercises. When you're done, you'll need to do a research study to complete this exercise.

Research Learning
    • Once your test is working properly, use your BSTree rewrite it to sort and search for suffixes. You can also use each BSTreeNode to value track where the substring exists in the original string. You can then keep the original string.
    • BStreeHow do I change your code for different search operations? Is it simpler or harder to make?
Deep learning

Thoroughly study suffix arrays and their applications. They are very useful, but not well known by most programmers.

You can easily create a suffix array in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.