All DNA are composed of a series of nucleotides abbreviated as a, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it's sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "aaaaacccccaaaaaccccccaaaaagggttt", return:["AAAAACCCCC", "CCCCCAAAAA"].
Very good idea: convert into bit operations.
Algorithm analysis
First consider the binary encoding of the ACGT
A-00
C-01
G-10
T-11
In the case of encoding, the combination of each 10-bit string is a number, and the 10-bit string has 20 bits; Generally speaking, int has 4 bytes, 32 bits, which can be used to correspond to a 10-bit string. For example
Acgtacgtac-00011011000110110001
AAAAAAAAAA-00000000000000000000
20-bit binary number, at most 2^20 kind of combination, so the size of hash table is 2^20, that is 1024 * 1024, the hash table is designed as BOOL hashtable[1024 * 1024];
Vector<string> findrepeateddnasequences (string s) { int hashmap[1048576] = {0}; Vector<string> ans; int len = S.size (), hashnum = 0; if (Len < one) return ans; for (int i = 0;i < 9;++i) hashnum = hashnum << 2 | (S[i]-' A ' + 1)% 5; for (int i = 9;i < Len;++i) if (Hashmap[hashnum = (Hashnum << 2 | (S[i]-' A ' + 1)% 5) & 0xfffff]++ = = 1) ans.push_back (S.substr (i-9,10)); return ans;}
Leetcode () Repeated DNA sequences look very enjoyable!