All DNA are composed of a series of nucleotides abbreviated as a, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it's sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "aaaaacccccaaaaaccccccaaaaagggttt", return:["AAAAACCCCC", "CCCCCAAAAA"].
Problem: Given a sequence of strings that represents a sequence of DNA, it is found that there is a recurrence of a subsequence of length 10.
The examples in the topic are non-overlapping repeating strings, and actually overlapping strings are counted, for example, the 11-bit "AAAAAAAAAA" contains two repeating subsequence with a length of 10 "aaaaaaaaaa". This is a question that is not clearly stated.
Clear the topic, the realization of the idea is relatively simple:
- Puts all contiguous substrings of length 10 in S in Map<string, int> ss_cnt, number of occurrences of successive strings
- Treat [0, 9] as a window, subtract 1 from the value of the window string in ss_cnt, and then determine if there is a window string in ss_cnt, which means that the window string is duplicated.
- Move the window one to the right, continue repeating the second step until the window is moved to the far right side
1 /**2 * Repeat substrings can overlap. 3 */4vector<string> findrepeateddnasequences (strings) {5unordered_set<string>Res;6 7unordered_map<string,int>ss_cnt;8 9 intLen =Ten;Ten One for(inti =0; i + Len-1< S.size (); i++) { A stringstr =S.substr (i, Len); -ss_cnt[str]++; - } the - inti =0 ; - while(i + Len-1<s.size ()) { - + stringCur =S.substr (i, Len); -ss_cnt[cur]--; + A if(Ss_cnt[cur] >0) { at Res.insert (cur); - } - -ss_cnt[cur]++; -i++; - } in -vector<string>result; to +unordered_set<string>:: iterator S_iter; - for(S_iter = Res.begin (); S_iter! = Res.end (); s_iter++) { theResult.push_back (*s_iter); * } $ Panax Notoginseng returnresult; -}
[Leetcode] 187. Thinking of repeated DNA sequences problem solving