DNA sequencing contest
Per Ian Sample at The Guardian, Peter Diamandis of the X-Prize Foundation is considering offering $10m for the first company to sequence the genetic code of 100 people in weeks.
"The value of having the human genome doesn't really occur until you have it for tens or hundreds of thousands of people, so the prize will make that happen," Dr Diamandis said. "To say this gene correlates with adult onset diabetes, that this gene reacts badly with that drug, you need a huge statistical database."
DNA sequencing startups
The Scientist has a nice short list:
How much memory would it take to store my DNA?
A few kilobytes? A megabyte? A gigabyte? I found this nifty exercise on an MIT problem set.
Problem 1: It’s in the Genes
DNA consists of a sequence of codons, each having three nucleotides (any of the four nucleotides, U, G, C,or T). Consider a 500-codon-long sample of DNA and the corresponding protein of 500 amino acids. You wonder about the information associated with this sample of DNA and the information associated with the resulting protein.
- How many bits do you need to specify a single codon?
- If you represent the information as a sequence of 500 codons, how many bits do you need to encode the sample of DNA?
- There are twenty amino acids. How may bits do you need to specify an amino acid?
- If you represent the information as a sequence of 500 amino acids, how many bits do you need to encode the protein?
- Note that the number of bits you need is different using the two approaches. If you were designing a system to store the information needed to manufacture a protein, which approach would you use? Discuss the advantages and disadvantages of the two.
- You have the opportunity to design a new life form which uses the same set of nucleotides and the same set of 20 amino acids, but uses longer codons. Your first design decision is how many nucleotides to include in one codon. If you include only three, then you can specify only one amino acid per codon. If you include five, you can specify a sequence of two amino acids. Consider all cases of 11 or fewer nucleotides per codon, and calculate for each case the “information
compression” which we will define here as one minus the ratio between the number of bits per amino acid for the new form of DNA and the number of bits per amino acid in Nature’s original code, namely 6.
The codon compression is interesting. What else do I need to know. I need to know how many amino acids/codons in human DNA. The Human Genome Project
gives the count of base pairs in human DNA as 3 billion (20,000-25,000 genes).
So, we have 3 billion base pairs, 3 base pairs per codon. That's 1 billion codons. Each codons encodes, at most, 20 amino acids. 20 amino acids is about 4.3 bits (log base 2). Thats 4.3 billion bits = 513 megabytes (/8/1024/1024). You can get 512 MB flash memory for about $35. Cost of sequencing entire genome: $3 million. Randall Parker at FuturePundit has a nice piece on Helicos and Solexa proposing to bring down the cost to about $5K.
Setting the sector size - why 4 KB?
Microsoft gives the scoop in Windows NT Boot Process and Hard Disk Constraints. Here's the skinny. The partition table which Windows uses to track hard disk partitions has a field size of 32-bits. The number of sectors in a partition is thus limited to 2^32 = 4,294,967,296. The rest is multiplication: a 4KB sector size permits a maximum partition size of 4*1,024 bytes*4,294,967,296 = 17,592,186,044,416 bytes = 16 TB. The next sector size down, 2KB, permits a 8 TB partition. Hmm, I think I could have chosen a 2KB sector size.
A few tips on setting up a 7.5 TB array - the Vtrak m500p
You read that correctly - that's 7.5 TB, as in terabyte. Yowsas... Configuration: Dell 1800, Adaptec 39320 Ultra320 SCSI card, Vtrak m500p. Here are the tips:
- When you initialize, select 4 kilobyte sectors, rather than the default 512 bytes. Windows can support huge disk sizes, but there is a limit to the number of sectors it can handle.
- Make sure to set Target ID and Lun ID in the WebPAM. I left it at the default - blank - and was unable to read the drive in Disk Manager.
- After setting the Target ID and Lun ID, you should be able to initialize the disk. Use GPT! (Otherwise, you'll only get a fraction of your storage.)
- You might timeout trying to format the drive. If so, take the SCSI speed down to 160 MBps if you are at 320. That did the trick for me - then step it back up to 320 when you start using it*.
*Yes, that makes me nervous.