# Saving Bits

A biologist wants to encode a large genetic sequence consisting of the letters C, G, A, and T.

In his data, C and G were the most common letters, so he chose the encoding C -> 1, G -> 0, A -> 00, T -> 01. For example, CCGT is encoded as 11001.

Later, the biologist noticed 11001 in his codebase. He concluded that it represented CCGT. Is he correct?

The following code allows you encode a DNA string. Try running the code, and then change the string in line 12 to see if you can find other strings with the same encoding!

encoding = {'C' : '1', 'G' : '0', 'A' : '00', 'T': '01'}

def encode(sequence):
#Create an empty string for the encoding of the sequence
encoded_string = ""

#For each letter, append its encoding to the encoded string
for letter in range(len(sequence)):
encoded_string += encoding[sequence[letter]]
return encoded_string

print(encode('CCGT'))
Python 3

Bonus 1: Can you describe a better encoding scheme?

Bonus 2: Can you write code that returns all possible genetic sequences from a given encoding?

×