Earlier in the year I wrote an article about the growth curve of worldwide sequencing capacity based on current and expected placements of next-generation sequencing instruments. And while worldwide capacity increases at least at a doubling every year for the next five years, I am equally excited about the progress that has been made on the analysis side of the industry.
Everyone from Eric Green at the NHGRI to David Dooling at Washington University’s Genome Center to Kevin Davies at Bio-IT World rightly acknowledges that the $1,000 genome is a misnomer if you decide to include the cost of computing, analyzing, storing, and querying the data. And if you don’t – well, you’ll be a touch over budget.
Nevertheless, recent innovations have changed the playing field. Just two years ago the debate was whether MAQ, the fastest algorithm at the time for mapping reads, was accurate enough based on its gapless strategy. Today, there are some electrifying algorithms out there based on the Burrows Wheeler transform that move 10 times the number of reads through the same compute pipelines. And where Burrows Wheeler doesn’t work – long reads like the kinds we’re seeing more and more – there are fantastic new approaches like GASSST (which we use to get a 7x increase in mapping speeds with arbitrarily long reads and arbitrarily large gaps).
I gave a talk at ABRF this year entitled, “The Bioinformatics Bottleneck,” where I challenged an audience full of bioinformaticians that if we keep getting stuck arguing about the fastest algorithm to map reads and whether to store image files, the industry would simply move on without us. (No, Steve Lincoln, I didn’t out-Steve you.) Illumina, Life Tech, Roche, and now PacBio are going to keep making machines. Beijing Genomics is going to keep buying them. In the heat of the moment I think I said something about not only throwing out the image files, but also throwing out the reads themselves – after all it’s the representation of variation, not the reads, that researchers and clinicians will ultimately care about. As late as March 2010 this was met with shock and even disdain; but today great companies like Complete Genomics and important projects like the Personal Genome Project are doing just that – providing us with the variant calls only; not the reads. Unless of course you want them – your choice, by the way, and may your disk drives spin forever.
Now that’s not to say that we don’t need to keep mapping. And so we do. GenomeQuest is on track to have mapped 100 billion reads in 2010 at year end – about 11 million an hour. Of course if our systems were 100% utilized – an impossible dream for any data center – the number would be far higher. And next year we expect it will be five times that at least. Mapping goes on, powered by innovations in software, and innovations in hardware as well. Earlier this year we announced a partnership with SGI that enabled us to build a purpose-designed architecture designed specifically for mapping. By ensuring reference data availability on every compute node, we minimize network traffic; through significant redundancy in both compute and head nodes we can ensure quality of service. By the end of 2011 we expect to be able to map 300 whole human genomes at 30x coverage per month, or 6,000 exomes per month.
The idea that so many genomes can move through a uniform, industry tested workflow is enchanting. And it sets our sights on the next realm of opportunity: analysis of thousands of genomes at the same time. Projects such as the 1,000 Genome Project (actually planning more like 2,000 – and do you really think Durbin and Altschuler will stop there?) are troves of undiscovered knowledge on the basis of disease. Being able to overlay that on top of your own 100 exome project provides critical information on background. So using the same technology, GenomeQuest has an exciting new suite of products for multi-genome analysis that are currently available for early access. As the year rolls on into 2011, these workflows for computing allelic frequency, performing large-scale tumor/normal studies, and asking population-scale questions across thousands of genomes will be further enhanced in close collaboration with our users.
The end game for healthcare is genomic medicine, enabled by the sequencing of individual patients and the large-scale comparison against populations – from phenotype to clinical presentation to genotype to treatment to response. That’s why we’ve partnered with Beth Israel Deaconess Medical Center – to work closely on the development of whole genome reports that can actually be usable by pathologists, like any other lab test. The idea is to map clinical actions to specific variations and present them in a way that a trained genomics physician can ultimately guide the course of treatment. There is a long road ahead and plenty of stakeholders involved, but being in the clinic in 2010 drives our thinking about the research and clinical development applications that exist today in pharma and academia.
Genomics is still a bit of the wild west – in a single day I can have a meeting with a pharma executive, a seminar on an FDA position-paper, a discussion with agbio researchers on the genomics of polyploid organisms, and a deep dive with a team of bioinformatics talking about word-based hashing algorithms. When genomics is in the clinic this’ll have to stop.
But in the meantime…