Question
All DNA are composed of a series of nucleotides abbreviated as a, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it's sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "Aaaaacccccaaaaaccccccaaaaagggttt", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Solution--Bit manipulation
Original idea was to use a set to store each substring. Time complexity is O (n) and space cost is O (n). But for details of the space cost, a char are 2 bytes, so we need bytes to store a substring and therefore (20n) space.
If we represent DNA substring by integer, the space was cut to (4n).
1 PublicList<string>findrepeateddnasequences (String s) {2list<string> result =NewArraylist<string>();3 4 intLen =s.length ();5 if(Len < 10) {6 returnresult;7 }8 9Map<character, integer> map =NewHashmap<character, integer>();TenMap.put (' A ', 0); OneMap.put (' C ', 1); AMap.put (' G ', 2); -Map.put (' T ', 3); - theset<integer> temp =NewHashset<integer>(); -Set<integer> added =NewHashset<integer>(); - - inthash = 0; + for(inti = 0; i < Len; i++) { - if(I < 9) { + //Each ACGT fit 2 bits, so left shift 2 Ahash = (hash << 2) +Map.get (S.charat (i)); at}Else { -hash = (hash << 2) +Map.get (S.charat (i)); - //Make length of hash to be -hash = hash & (1 << 20)-1; - - if(Temp.contains (hash) &&!Added.contains (hash)) { inResult.add (s.substring (i-9, i + 1)); -Added.add (hash);//Track added to}Else { + Temp.add (hash); - } the } * $ }Panax Notoginseng - returnresult; the}
Repeated DNA sequences solution