KMP algorithm and a classic Probability Problem

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Consider an event, which has two results with equal probability. For example, if a coin is thrown, the chances of front and back are equal. Now we want to know how long it will take for me to get a specific sequence if I keep throwing coins.

Sequence 1: Negative, positive, and negative
Sequence 2: Negative, positive, positive

First, I threw a coin repeatedly until the last three throwing results form sequence 1. Then I wrote down how many times I threw it to get the sequence I wanted. Repeat this process and I can calculate the average number of throwing times required by sequence 1. Similarly, when a coin is repeatedly thrown into Sequence Two, the expected number of times also has an average value. Which of the two averages is big and small? In other words, will the average number of throwing attempts required by sequence 1 be less or the average number of times required by sequence 2 be less?

Most people will think that the two sequences will appear at the same speed, because in all the eight ternary combinations of "positive" and "inverse, "Anyway" and "anyway" each account for 1/8, and their probability is equal. In fact, we will see that it takes less times to roll out sequence 2. Consider the following question: in the n-bit 01 sequence consisting of "positive" and "inverse", how many sequences end with sequence 1 but have never seen sequence 1 before? How many sequences end with sequence 2 but have not appeared before? When N is relatively small, the answer is the same (for example, n = 3 is the only situation that meets the requirements), but by the time n is greater, the gap between the two is more obvious: the number of the latter is greater than the number of the former. Let's take a look at n = 6. For sequence 1, only the following five sequences meet the requirements:

Backend
Both positive and negative
Positive and Negative
Positive and Negative
Positive and Negative

But for sequence 2, there are seven conforming sequences:

Reverse and reverse
Anyway, anyway
Anyway
Positive and Negative
Positive, reverse, and positive
Positive and positive
Positive and Negative

You can use computer programming enumeration to calculate other values of N. The calculation result is the same as that of the previous one: in the n-bit 01 sequence, the case that ends with sequence 2 but does not contain sequence 2 is no less than the case that ends with sequence 1 but does not contain sequence 1. This shows that sequence 2 appears after the nth coin is thrown, and its probability is not less than the probability that sequence 1 appears. Obviously, when N increases gradually, this probability will show a downward trend. At the same time, as N increases, the probabilities of the two sequences will gradually increase from equal to equal, the probability of the nth throwing to produce sequence 2 decreases slowly, or more often when n is smaller. Therefore, in general, the expected number of coin throwing times required by sequence 2 is smaller.
Although we have verified this conclusion through a series of observations and believe that this conclusion is correct (although there is no strict proof), we still do not quite accept this conclusion. This situation is contrary to our intuition and does not match our life experience. At this moment, we urgently need an explanation to explain the causes of this unexpected anomaly.

If you do not perform several experiments in person, it is difficult for you to understand this subtle gap. Considering the actual process of the game, the "anyway" sequence will obviously appear earlier. Assume that we get the sequence "anyway" at one time ". If we need a "back-to-back" sequence, the next throwing result will end the current round, and the next one will be the front, and you have to start from scratch again. If we need a sequence of "Anyway positive", the next throwing result will be the front end of the current round, and the next one will be the opposite, at least I will not go to zero, this is equivalent to having a front-end as a new start. You only need to have two front-ends. In this case, it is more likely to roll out "anyway" in advance.
Learn more about the above ideasKMP AlgorithmUsers will suddenly realize that this is the basic idea of the KMP algorithm! Consider the following question: in the current string, we look for the first position where the substring "anyway" appears. If the first two words of the pattern string are "anyway" and the next word in the main string is "positive", the match is successful, if the next word of the Main string is "inverse", the current matching position of the mode string will be removed to the first word. Consider a more complex example: we hope to find the sub-string Abbaba in the main string, and now we have found abbab in the main string. If the next character of the Main string is a, the match is successful. If the next character of the Main string is B, the maximum matching position of the mode string is dropped to the third character, I only need to start matching from ABB, instead of starting from scratch.
We can use the KMP algorithm to solve the above problems perfectly. First, an array C, C [I, 0] indicates that the mode string matches the I character. When the next character of the Main string is 0 (inverse, where will the matching position of the mode string be returned? Similarly, C [I, 1] indicates that the mode string matches the I character. When the next character of the Main string is 1 (positive, where is the matching position of the new pattern string. If f [I, j] indicates how many situations exactly match the J position of the pattern string after I throw a coin, F [I, j] = Σ F (I-1, k) + Σ F (I-1, L), where k satisfies C [K, 0] = J, l satisfies C [L, 1] = J. Divide f [I, j] by the I power of 2, and then we get the corresponding probability value. Or, more directly, if p [I, j] indicates that after the I-th coin is thrown, the pattern string that can be matched at the far is the probability of the J-th digit, then P [I, j] = Σ (P (I-1, k)/2) + Σ (P (I-1, L)/2 ). Note: we should also add a special probability value p [I, *], which indicates the probability that the primary string has been successfully matched before the I character, in this way, the sum of each column in the table can be 1.

Let's take a look at the output results of the program:
Pattern 1: s [] = "ABA"
Master string position 1 2 3 4 5 6 7 8 9 10
Matching to s [0] 0.5000 0.2500 0.2500 0.2500 0.2188 0.1875 0.1641 0.1445 0.1270 0.1113
Matching to s [1] 0.5000 0.5000 0.3750 0.3125 0.2813 0.2500 0.2188 0.1914 0.1680 0.1475
Matching to s [2] 0.0000 0.2500 0.2500 0.1875 0.1563 0.1406 0.1250 0.1094 0.0957 0.0840
Matching to s [3] 0.0000 0.0000 0.1250 0.1250 0.0938 0.0781 0.0703 0.0625 0.0547 0.0479
0.0000 0.0000 0.0000 0.1250 0.2500 0.3438 0.4219 0.4922 0.5547 0.6094 found

Pattern 2: s [] = "ABB"
Master string position 1 2 3 4 5 6 7 8 9 10
Matching to s [0] 0.5000 0.2500 0.1250 0.0625 0.0313 0.0156 0.0078 0.0039 0.0020 0.0010
Matching to s [1] 0.5000 0.5000 0.5000 0.4375 0.3750 0.3125 0.2578 0.2109 0.1719 0.1396
Matching to s [2] 0.0000 0.2500 0.2500 0.2500 0.2188 0.1875 0.1563 0.1289 0.1055 0.0859
Matching to s [3] 0.0000 0.0000 0.1250 0.1250 0.1250 0.1094 0.0938 0.0781 0.0645 0.0527
0.0000 0.0000 0.0000 0.1250 0.2500 0.3750 0.4844 0.5781 0.6563 0.7207 found

Now we can clearly see that the probability of Sequence Two appearing in advance is much greater. Note that according to our probability definition, the sum of numbers in each column in the table is 1. At the same time, the sum of numbers in the last and second rows (with infinite numbers) should also be 1, because the probability in the last row is the result of the cumulative probability value in the last and second rows, according to the definition of the probability of the last row, the probability of matching already found for the infinite length of the Main string should be 1. Therefore, we can also regard the penultimate row as the probability that the mode string matches successfully for the first time at the position I of the Main string. Based on this result, we can calculate the expected throwing count.

Matrix67 original
Please indicate the source of the post

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

KMP algorithm and a classic Probability Problem

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

KMP algorithm and a classic Probability Problem

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support