Saving Bits

A biologist wants to encode a large genetic sequence consisting of the letters C, G, A, and T.

In his data, C and G were the most common letters, so he chose the encoding C -> 1, G -> 0, A -> 00, T -> 01. For example, CCGT is encoded as 11001.

Later, the biologist noticed 11001 in his codebase. He concluded that it represented CCGT. Is he correct?

The following code allows you encode a DNA string. Try running the code, and then change the string in line 12 to see if you can find other strings with the same encoding!

encoding = {'C' : '1', 'G' : '0', 'A' : '00', 'T': '01'}

def encode(sequence):
    #Create an empty string for the encoding of the sequence
    encoded_string = ""
    
    #For each letter, append its encoding to the encoded string
    for letter in range(len(sequence)):
        encoded_string += encoding[sequence[letter]]
    return encoded_string

print(encode('CCGT'))
Python 3


Bonus 1: Can you describe a better encoding scheme?

Bonus 2: Can you write code that returns all possible genetic sequences from a given encoding?

×

Problem Loading...

Note Loading...

Set Loading...