While DNA sequencing technology has undergone a major revolution, the read lengths afforded by most current generation sequencing technologies still remain small (in the range of hundreds of DNA base-pairs). These short reads are then stitched together with algorithms that exploit the overlap between these reads in order to reconstruct the DNA. Long repeated regions of the DNA greater than this short read length, which are common, are not resolvable with this technology, requiring sequencers capable of accurate long reads. Nanopore sequencing promises to address this problem, by increasing the read lengths by orders of magnitude (up to 10K-100K bases). However, nanopore sequencers built to date, have higher error rates than the short-read technologies, which if unresolved could limit their applications. There are many algorithmic challenges in realizing this potential, due to many transformations between the DNA nucleotide and the nanopore current output. Impairments in nanopore sequencers include inter-symbol interference (multiple bases affect each observation), random channel variations, insertions/deletions and noise. In this talk we develop signal-level mathematical models for the nanopore channel, which allows us to develop information theoretic bounds for its decoding capability. We will apply these to some experimental nanopore sequencer data to develop some preliminary understanding of the trade-offs in their performance. We will also use the insights from this modeling to develop novel nanopore alignments techniques which we evaluate using real datasets.
This talk is joint work with Wei Mao, Dhaivat Joshi, Sreeram Kannan and Shunfu Mao.