Tuesday, August 18, 2015

Repeated DNA Sequences | Leetcode

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].
---

Solution: Go through the string one character at a time. For each position i, take the current 10-character substring starting at position i. Maintain a map to remember whether we have seen the string (seen.get(t) == true), or we have already added it to our answer (seen.get(t) == false). If we have not seen the string yet, just add it to the map. The encode() method to convert a string into an integer is just to save memory to pass leetcode's submission.

The solution given here is somewhat optimal in memory use, but not optimal in speed. We could have chosen to update the last encode() value given the old lost character and the newly-gained character, for each next 10-character string. But that would make the code much more complicated and not easily readable. Therefore I have favored readability over performance in the code below, since it passes leetcode's submission anyway.
public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        Map<Integer, Boolean> seen = new HashMap<>();
        List<String> answer = new ArrayList<>();
        
        for (int i = 0, last = s.length() - 10; i <= last; i++) {
            String t = s.substring(i, i + 10);
            int code = encode(t);
            
            if (seen.containsKey(code)) {
                if (seen.get(code)) {
                    answer.add(t);
                    seen.put(code, false);
                }
            } else {
                seen.put(code, true);
            }
        }
        
        return answer;
    }
    
    private int encode(String s) {
        int code = 0;
        for (int i = 0; i < s.length(); i++) {
            switch (s.charAt(i)) {
                case 'A': code++;
                case 'C': code++;
                case 'G': code++;
                case 'T': code++;
            }
            code *= 4;
        }
        return code;
    }
}

No comments: