Upcoming Improvements to the GenomeQuest Engine

As the product manager at GenomeQuest, I’m very excited to tell you about a couple of really great new features in the GQ-Engine. Features that add to the growing library of high quality NGS components available to GQ platform developers and end users.

Fast local alignments of NGS reads

NGS read mappers typically align reads by trying to fit the entire read into the reference sequence. This is referred to as a global alignment, or best fit, strategy. While this works great for short genomic reads, it is not always the best possible solution for longer reads. When a read gets longer the chance of it matching the reference sequence over its entire length decreases.

The shortcomings of global alignment algorithms become readily apparent in RNA-seq studies where a single read can span multiple exons. These exons can be right next to each other in the mRNA, but separated by megabases of intronic sequence on the genome. The only way to align such reads is to use a local alignment strategy that can map different parts of a read to different positions on the reference sequence.

We have added local alignment capabilities to our GASSST read mapper while keeping the existing speed and scaling. This allows us to analyze NGS-sized data sets regardless of the read length or sample source and gets us ready for PacBio and Ion Torrent. It also supports our new RNA-seq workflow that maps the transcriptome directly to the genomic reference sequence.

Improved support for Paired End (PE) read handling

We have added exciting new possibilities to work with PE reads to the GQ-Engine. By examining all possible alignment combinations for a read pair we can keep the most likely alignment pairs for further analysis. This strategy to find “happy pairs” can be parameterized on the command line and takes into account the expected distance between the reads, the orientation of the alignment (fwd/rev strands), and the number of mismatches and indels that are needed to align the reads at those positions. All of this happens in memory while computing the alignments, and is much more efficient and exhaustive than the post-alignment processing strategies typically implemented by other read mappers.

Because the PE mapping strategy is fully integrated into the GQ-Engine, we have complete flexibility working with the results. We can, for example, decide to also keep the single end reads that are mapped with high confidence (we will). As well, we can dump all non-happy pairs into a separate alignment database to look for interesting things like, copy number variation or structural variations.

For our web interface users, using the PE read mapping strategy will be completely transparent. When you map a PE read database, we will ask confirmation of the expected insert size and read orientation. That’s all.

Interval Indexing and Positional Based Annotation

Interval indexing within the GQ-Engine adds the ability to very quickly find the overlap between different sets of intervals. Examples of use cases are: “find all alignments overlapping with exons of known genes”, or “find all SNPs in my data set that are already known in dbSNP”. This technology will support many use cases in the GenomeQuest platform. To start, it will speed up the existing variant annotation workflow and drive the RNA-seq workflow. More applications will follow soon.

The GenomeQuest 7.1 release is planned for Friday the 8th of July 2011. I hope to see you there.

Henk Heus, Ph.D.
VP Product Management & Services
GenomeQuest Inc