How much memory would it take to store my DNA?
A few kilobytes? A megabyte? A gigabyte? I found this nifty exercise on an MIT problem set.
Problem 1: It’s in the Genes
DNA consists of a sequence of codons, each having three nucleotides (any of the four nucleotides, U, G, C,or T). Consider a 500-codon-long sample of DNA and the corresponding protein of 500 amino acids. You wonder about the information associated with this sample of DNA and the information associated with the resulting protein.
- How many bits do you need to specify a single codon?
- If you represent the information as a sequence of 500 codons, how many bits do you need to encode the sample of DNA?
- There are twenty amino acids. How may bits do you need to specify an amino acid?
- If you represent the information as a sequence of 500 amino acids, how many bits do you need to encode the protein?
- Note that the number of bits you need is different using the two approaches. If you were designing a system to store the information needed to manufacture a protein, which approach would you use? Discuss the advantages and disadvantages of the two.
- You have the opportunity to design a new life form which uses the same set of nucleotides and the same set of 20 amino acids, but uses longer codons. Your first design decision is how many nucleotides to include in one codon. If you include only three, then you can specify only one amino acid per codon. If you include five, you can specify a sequence of two amino acids. Consider all cases of 11 or fewer nucleotides per codon, and calculate for each case the “information
compression” which we will define here as one minus the ratio between the number of bits per amino acid for the new form of DNA and the number of bits per amino acid in Nature’s original code, namely 6.
The codon compression is interesting. What else do I need to know. I need to know how many amino acids/codons in human DNA. The Human Genome Project
gives the count of base pairs in human DNA as 3 billion (20,000-25,000 genes).
So, we have 3 billion base pairs, 3 base pairs per codon. That's 1 billion codons. Each codons encodes, at most, 20 amino acids. 20 amino acids is about 4.3 bits (log base 2). Thats 4.3 billion bits = 513 megabytes (/8/1024/1024). You can get 512 MB flash memory for about $35. Cost of sequencing entire genome: $3 million. Randall Parker at FuturePundit has a nice piece on Helicos and Solexa proposing to bring down the cost to about $5K.
Posted by torque at February 14, 2006 12:49 AM