



The title of the article on the new $50,000 ION Torrent machine by Kevin Davies at Bio-IT World says it all: “Watson meets Moore” as Ion Torrent Introduces Semiconductor Sequencing




The title of the article on the new $50,000 ION Torrent machine by Kevin Davies at Bio-IT World says it all: “Watson meets Moore” as Ion Torrent Introduces Semiconductor Sequencing
Ryan McBride from Xconomy posted this nice article on GQ yesterday. After our discussion, he asked me if the “Google of Genomics” metaphor applied and I did not deny, thus the title: “GenomeQuest Wants to Be the Google of DNA Data Searches”
GenomeWeb reports on the HHS Advisory groups proposal to “limit the ability of holders of gene patents to keep others from using those genes for diagnostic and research purposes.” The GLR recaps the evolution of the debate and the sponsors an interesting dialog in a recent post Up Next in Gene Patents: Waiting for a Ruling (Again) and SACGHS Meets (Again).
Gene patents were also a hot topic at the Molecular Medicine Tri-Conference last week in San Francisco. One of the talks “Gene Patents in Molecular Diagnostics: Valuable Assets or Impediments?”. The speaker, Frances Toneguzzo, Ph.D., Director of Corporate Research and Licensing at MGH brought up an interesting perspective. She discussed the idea that limiting patents on genes is a slippery slope since other forms of biomarker patents, such as “image biomarkers” could be eliminated from patent protection.
If you are a developer or a technical type, this one is for you.
Over at Depth-First there is a blog post about an application in the cheminformatics field: PubCouch: Streams aren’t just for Pipeline Pilot. The author illustrates how a well abstracted Web service avoids the costly database Extract-Transform-Load operations so familiar to most life science development. In the example, the author streams the entire contents of the PubChem FTP server to PubCouch, a web-service based on the NoSQL style document-oriented database CouchDB. CouchDB doesn’t rely on a database, instead it computes the PubChem relationships “on-the-fly” using an approach based on MapReduce.
So what you say?
The vision is this: Since modern Web-based programming (aka RESTful architecture) hides the details of massive data and computing resources, programmers can focus on “what to do” and not “how to do it” and that increases productivity.
GenomeQuest’s developers have thought deeply about what a scalable computational biology engine should look like in the cloud-based, MapReduce paradigm. If you want to read a primer on the GQ Engine, feel free to check it out.
Soon, we’ll publish the full-blown URL API so that large-scale biological data and computation can be assembled from any Internet connected desktop, using the language of the Web. A command line interface to our Web-services can be found here.
A final remark: Deepak Singh from business|bytes|genes|molecules wonders aloud what is the role of Pipeline Pilot in this new programming paradigm? I’m guessing within a domain, the value proposition might be limited, but across domains these tools will continue to be able to solve even bigger problems by leveraging better designed Web-services.
A plot of the Evolution of Computer Capacity and Costs shows that compute power will be 1,000X cheaper in 10 years. How much lower can it go? As this happens the relative cost of managing another computer goes asymptotic to zero, regardless of whether its hosted internally or externally. I don’t think there is an economic argument that shows everyone belongs on the cloud based just on hardware and system administration cost.
Dave Dooling at PolITiGenomics finds two good reasons for considering cloud options: when organizations have peak demands for compute power and when limitations on space/power/cooling preclude building a system in-house. These are two good reasons, but hardly enough to justify all the cloud computing hype.
So, what’s the argument for cloud computing?
Unlike computing which gets cheaper every year, people cost more every year. So, it makes sense to evaluate the annual software development and maintenance costs, the cost of managing the reference databases; integrating and maintaining new applications, the productivity of the end-users and how to change the ratio of end-user-to-support-programmer from 2-to-1 to 10-to-1 or 20-to-1. Cloud computing defined as “Infrastructure” (computers, networks, and storage) doesn’t alleviate these costs.
Variant reports are not the right deliverable for a re-sequencing study.
A well written technical blog ‘MassGenomics‘ written by Dan Koboldt illustrates why. Dan says “What’s more, with the advent of next-generation sequencing, I hate to tell you, but people are going to be reporting a lot of false positives. I guarantee it. So when you filter all of the variants, you might actually remove the ones you’re looking for.”
Its easy to see why researchers are not enthusiastic about tabular reports. They want to get into the data on their own, without intermediaries, and they want software to facilitate that, not be in the way.
Our concept “Sequence Data Management” (SDM) doesn’t fit the primary/secondary/tertiary analysis informatics categories. Why? Because we’ve coupled the alignment step with the analysis step in one-shot. Why is that better? Biologists can to compute the data on their own and mine the data in an easy to use web application. They are able to “finish the pipeline on their own”. Hopefully, this will to more interesting biological conclusions and more enthusiastic end-users of NGS technology.
At the CHI NGS conference, I chaired a roundtable of key managers and influencers discussing the opportunity and challenges to adoption of “cloud computing” for NGS applications. As a first observation, the session was well attended and people are thinking deeply about cloud issues. About 16 participated including representatives from major pharmaceuticals, agroscience, major medical research core labs, and the NIH.
Here is a transcript of my notes from the roundtable:
My takeaways? Cloud computing is becoming viable in the minds of the industry. A few solvable roadblocks remain. With infinite computing and infinite data, managing the data and turning it into insight remains the challenge and the opportunity.
We’ve heard lots of requests from customers not only to provide them with powerful methods for detection variants across multiple experiments (or phenotypes, or organisms, or lines), but for unifying all of this data to find knowledge that spans these experiments.
Of course we have our variant calling workflow, just as we integrate with other variant calling workflows. All of these produce GenomeQuest-native browsable, mineable, and queryable databases. And because of the GQ Engine, we can easily combine sets of 10s or 100s or even 1,000s of these variant databases into a single queryable entity with “web-speed query performance.”
Nevertheless, while our customers get the benefit of the combined data, they often ask for more. So today I jumped in to the APIs of GenomeQuest and tried to address the simple problem of building a table of SNPs that span a series of experiments. Each SNP should have the specific allele called for each experiment in which it was found. A simple little table designed to be the input into any of a number of linkage disequalibrium mapping packages. I made a GQ Plug-in: 5 lines of code to make it accessible in the user interface, and another 100 lines of code (I’m wordy) on the back-end to build the table and present it. And so, the multi-experiment haplotype table is born. I might even convince the development team to include it in our next live push.
If you want to hear more or check out the code, drop me a line.
Richard J. Resnick
VP Software and Services