This is part of the "DNA Fingerprint" section of the Computational biology course. The mechanics of how you'd apply the Chi-Square test in the scenario isn't explained and I would have liked more info on it since its such a useful tool in statistical analysis.
The scenario is:
- You are given 2 arrays: one with gene sequences from 10 cats with polydactyly (cases) and the other with gene sequences from 10 normal cats(controls). Each gene sequence is 113 bases long. How would you go about conducting a chi-square test to find the position and nucleotide change which is the most likely cause for polydactyly. This is what I'd like to discuss.
Read on if you'd like to understand the mechanics of the code from Brilliant .
- sum up the number(frequency) of A,C,T or G nucleotides per position for the cases and controls which produces 2 separate arrays, one for cases and another for controls. Each array is a 2D NumPy array with 4 nested arrays (each 113 bases long) for each nucleotide. Each array is of the form:
[[113 frequencies of Nucleotide A] , [113 frequencies of Nucleotide T] , [113 frequencies of Nucleotide G] , [113 frequencies of Nucleotide C]]
- take the corresponding values for nucleotide and position from the cases array and from the controls array and plug into the "chisquare" function (from "scipy.stats") as one array as follows.
n,p = chisquare([cases[nucleotide,position], controls[nucleotide,position]])
- it returns the chi-square statistic as "n" and the probability as "p"
- after looking at the documentation for the "chisquare" function, I found that passing a single array into the function causes it to calculate the average between the values in the array which is used uniformly as the expected value. Surely this cannot be correct as you want to be doing cases - control or observed - expected to calculate the chi-square statistic.