### Computational Biology

Forensics is one of the oldest fields of science. Crime has plagued humanity since the very beginning, and science has been enlisted to analyze and interpret the evidence left behind.

No field of science has contributed as much to forensic science as DNA sequencing and profiling. Nearly every human cell in your body contains a complete copy of your genome. We leave some of these cells behind everywhere we go without even realizing it: hairs, flakes of skin, and drops of saliva or blood include millions of cells. The DNA from these cells can be used to uniquely link criminals to the crimes they commit.

Television shows famously depict DNA sequencing as a quick and simple method for finding a criminal and bringing them to justice. In this course, you’ll put that depiction to the test.

# DNA Fingerprints

DNA profiling is often referred to as DNA fingerprinting. Fingerprints are well known to be unique to each person. Most countries scan the fingerprints of visitors to generate a unique identifier, and many mobile phones use fingerprints as a quick biometric password.

A fingerprint is an arrangement of ridges and valleys that develops before birth and is a pseudo-random function of the exact activity, orientation, and movements of the fetus. Millions of cells make up the fingerprint, and their relative arrangement persists throughout a person's life. Fingerprint scanners don't capture a complete dataset of the arrangement of each of these cells, that's far more information than is necessary.

Instead, they usually capture a fingerprint as an image which is divided into 400 non-overlapping sites. Rare features on the ridges including islands, crossovers, spurs, and bifurcations are identified and assigned one of four directions based on their orientation, while all other sites are ignored. A sample fingerprint is considered a match if the rare feature sites (usually 10) have matching orientations and types.

Based on these parameters, how many different fingerprint configurations are there?

Note: $n \choose m$ is the combination function, often called "$n$ choose $m$."

# DNA Fingerprints

Even using this fuzzy-matching method which only requires a minimum number of matching features, there are still billions of times more possible fingerprint configurations than there are people on Earth. Capturing the position and orientation of the millions of cells on a fingerprint isn't necessary to generate a unique identifier.

To get an idea of the uniqueness of a DNA fingerprint, let's estimate the number of possible genome arrangements: The genetic code is written using a four-character alphabet called nucleotides. There are three billion nucleotides in the human genome, split across 23 pairs of chromosomes. That means there are an astounding $4^{3000000000}$ possible genomes. That’s a number with over one billion digits. To uniquely identify each of the 7 billion people on Earth, we only need to capture a minuscule fraction of that diversity.

If each genetic site (or locus) in a genome can be one of four nucleotides and are random, at least how many sites would we need to match to uniquely identify every human being?

# DNA Fingerprints

The Federal Bureau of Investigation (FBI) maintains a database and testing protocol for DNA profiling known as the Combined DNA Index System (or CODIS). Every DNA profile must have sequenced at least 13 loci, each of which can appear in one of 20 different nucleotide sequences. Recently the FBI has added an additional 7 loci to more uniquely identify an individual. This set of loci captures only a tiny fraction of the diversity of human beings, but is more than enough to provide an extremely tiny probability for a false positive match—the number that is usually quoted in criminal trials and on television.

The CODIS profile and the methods used to reduce three billion nucleotides down to 20 loci will be investigated in the DNA Forensics quiz.

# DNA Fingerprints

In the past 10 years, the 1000 Genomes Project has far surpassed their mandate and sequenced thousands of human genomes. They've found that humans are 99.9% genetically identical. Most of the differences between individuals consist of single nucleotide differences or polymorphisms (SNP) that have resulted from random mutations throughout human history. We’ve found that two random people's DNA differs on average by about one SNP mutation per 1000 base-pairs.

The tiny 0.1% difference between individual genomes is responsible for all the variance of human morphology, health, and ethnicity. It is only thanks to the amazing ability of the human brain to notice and amplify differences and ignore the unchanging unifying characteristics that makes us feel so different to a person we run into on the street.

# DNA Fingerprints

When we investigate the central dogma of molecular biology and the genetic code in the next chapter, we’ll see why the genetic code is surprisingly error-tolerant to single nucleotide mutations (SNPs).

Consider these two DNA sequences taken from different organisms. Short DNA strands like these are very easy to sequence, and most DNA profiles only target the sequence of interest, rather than tackling an entire gene or genome. We can see that there are 8 positions that differ between these two strings, so there must have been at least 8 mutation events or DNA copying errors.

But mistakes at the DNA level don't always translate to mistakes in protein blueprints. The instructions that cells follow to change a DNA sequence into a protein is often very robust to SNPs. Use the translation function in the Python environment below to find the protein encoded in these strings.

For now, just click Run code in the environment below and take a look at the code if you're interested. You’ll build this function and many others yourself in the next chapter.

How many mutations propagate from the DNA to the protein?

seq1 = 'TCTGCTTTAACTTAT'
seq2 = 'AGTGCGCTGACCTAC'

# A function that translates DNA into a protein sequence.
def translate(DNA):
# A python dictionary data structure to translate.
dna_to_pro = {'ATG': 'M', 'GCG': 'A', 'TCA': 'S', 'GAA': 'E', 'GGG': 'G', 'GGT': 'G', 'AAA': 'K', 'GAG': 'E', 'AAT': 'N', 'CTA': 'L',
'CAT': 'H', 'TCG': 'S', 'TAG': 'STOP', 'GTG': 'V', 'TAT': 'Y', 'CCT': 'P', 'ACT': 'T', 'TCC': 's', 'CAG': 'Q', 'CCA': 'P',
'TAA': 'STOP', 'AGA': 'R', 'ACG': 'T', 'CAA': 'Q', 'TGT': 'C', 'GCT': 'A', 'TTC': 'F', 'AGT': 'S', 'ATA': 'I', 'TTA': 'L',
'CCG': 'P', 'ATC': 'I', 'TTT': 'F', 'CGT': 'R', 'TGA': 'STOP', 'GTA': 'V', 'TCT': 'S', 'CAC': 'H', 'GTT': 'V', 'GAT': 'D',
'CGA': 'R', 'GGA': 'G', 'GTC': 'V', 'GGC': 'G', 'TGC': 'C', 'CTG': 'L', 'CTC': 'L', 'CGC': 'R', 'CGG': 'R', 'AAC': 'N',
'GCC': 'A', 'ATT': 'I', 'AGG': 'R', 'GAC': 'D', 'ACC': 'T', 'AGC': 'S', 'TAC': 'Y', 'ACA': 'T', 'AAG': 'K', 'GCA': 'A',
'TTG': 'L', 'CCC': 'P', 'CTT': 'L', 'TGG': 'W'}
protein = []
start = 0
# Step through the DNA sequence and translate.
while start + 2 < len(DNA):
codon = DNA[start:start + 3]
protein.append(dna_to_pro[codon])
start += 3
return ''.join(protein)

# Print the translated sequences.
print("Seq1: ", seq1, "Translated: ", translate(seq1))
print("Seq2: ", seq2, "Translated: ", translate(seq2))

Define two DNA sequences to be translated into a protein sequence.

Define a function to take a DNA sequence as an argument, and output the translated protein sequence.

A dictionary data structure matching a DNA triplet to a single protein letter. This dictionary is the genetic code!

Iterate through the DNA sequence three letters at a time until the end.

For each DNA triplet, append the corresponding protein letter to a growing protein sequence following the genetic code.

Return the translated protein sequences.

You need to be connected to run code

# DNA Fingerprints

But the genetic code isn't robust to all SNP mutations. If it were, we wouldn’t have the observed diversity we see in physical appearance, proclivities, and unfortunate genetic diseases. The traits that result from genetics are called phenotypes. In humans, some common phenotypes are eye color or hair curliness, but genes also encode complex traits like the likelihood of developing cancer, or even behavioral characteristics like drug addiction and dependence.

Polydactyly is a phenotype when an individual has excessive numbers of fingers or toes. Cats typically have 18 toes (four on each front foot, and five on each back foot), but this cat with polydactyly has four extras, for a total of 22 toes.

By comparing the genetic sequence of five control cats and five polydactyly cats, we can track down the exact position where an SNP can result in extra toes.

Use the Python environment below to compare the case and control sequences with a $\chi$-squared statistical test. Again, just click Run code and interpret the results; you'll learn how to do this analysis in the Genomics chapter.

Which position likely correlates with polydactyly in cats?

## Load in 5 sequences of control cats (18 toes)
## and 5 sequences of case cats (22 toes)
import numpy as np
import scipy.stats as stats
import warnings
warnings.filterwarnings("ignore")

cases=   ['ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGGCTAATTTGTCTCAGGCCTGCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTGGA','ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAACGTGGTCTAGA','GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGAAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGAAATGAGTAGGAAGTCCACCGTGGTCTAGA','ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGGCTAATTTGTCTCAGGCCTGCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTGGA','ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAACGTGGTCTAGA','GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGAAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGAAATGAGTAGGAAGTCCACCGTGGTCTAGA']
controls=['ACCTTGTACTGTATATTATGACCAAATGACTTTTTCCCCCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACAGAAATGAGTAGGAAGTCCACCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACAGTAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCATTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACAGTAATGAGTAGGAGGTCCAGCGTGGTCTAGA','GCCTTGTACTGTATTTTATGACCAAATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTGCGTCTTAAAGAGACACAGTAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACAGAAATGAGTAGGAAGTCCAACGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAAATGACTTTTTCCCCCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACAGAAATGAGTAGGAAGTCCACCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACAGTAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCATTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACAGTAATGAGTAGGAGGTCCAGCGTGGTCTAGA','GCCTTGTACTGTATTTTATGACCAAATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTGCGTCTTAAAGAGACACAGTAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACAGAAATGAGTAGGAAGTCCAACGTGGTCTAGA']
length = len(cases[1])

## Find the frequency of each nucleotide in each position, to find the positions
## where the sequences are different.

def freq_lists(dna_list):
n = len(dna_list[0])
A = [0]*n
T = [0]*n
G = [0]*n
C = [0]*n
for dna in dna_list:
for index, base in enumerate(dna):
if base == 'A':
A[index] += 1
elif base == 'C':
C[index] += 1
elif base == 'G':
G[index] += 1
elif base == 'T':
T[index] += 1
return A, C, G, T
freqcases = np.array(freq_lists(cases))/5
freqcontrols = np.array(freq_lists(controls))/5

## Compare the nucleotide frequencies in the normal cats, and the many-toed cats
result = []
for nuc in range(4): # For each possible marker nucleotide, run a statistical test for each position
test = []
for i in range(length):
n,p = stats.chisquare([freqcases[nuc,i], freqcontrols[nuc,i]])
test.append([n,p])
test = np.array(test)
result.append([np.nanargmin(test[:,1]),np.nanmin(test[:,1])])

## Output the statistical test results
print("Position, P-value")
print("A nucleotide:", result[0])
print("C nucleotide:", result[1])
print("G nucleotide:", result[2])
print("T nucleotide:", result[3])

Define two sets of DNA sequences. 10 from "control" cats with a normal number of toes, and 10 from "case" cats with 22 toes.

Define a function to collect frequency statistics on the DNA nucleotides in each position of the sequence. Do some parts of the sequence tend to be different for polydactyl cats?

For each nucleotide at each position, run a statistical test to determine whether there is a significant difference in frequency between the control and case cats.

Compare the frequency statistics between case and control using a chi-squared statistical test. If the two frequencies are very different, the "p-value" will be low.

Print the positions and nucleotides which showed the most statistical difference between case cats and control cats. A low "p-value" indicates that these differences are unlikely to arise by chance.

You need to be connected to run code

Note: In statistical tests, the $\bf{p}$-value gives the probability of a result arising from random chance.

# DNA Fingerprints

If a polydactyl cat with 22 toes were to breed with a normal control cat, how many toes would the kittens have? Offsprings tend to share a mixture of their parents' mutations. In the Sequencing chapter, we'll investigate how mutations and the traits they cause have mixed and evolved throughout human history.

Consider four villages in the ancient world. Since the only way to travel is on foot, the long distance between the villages limits population mixing, and only rarely will an individual or family move to the next village over. Due to their isolation, the people living in each village tend to have their own unique set of mutations called a gene pool.

If you could make a list of all the mutations present in each person, what pattern do you think you would find?

# DNA Fingerprints

Select one or more

We'll generate datasets from the 1000 Genomes Project to capture hundreds of thousands of mutations between thousands of individual genomes. Armed with this big data, we can extract a startling amount of geographic and ancestral information. Just like well-known genealogy and ancestry services, we will use principal component analysis to connect ethnic groups in East Africa, the Caribbean, and the USA.

The millions of unique features in a human genome can serve as far more than a fingerprint: given enough data, we can gain insight into human health and history, and even trace life back to its origins.

# DNA Fingerprints

×