Title: Information Theory in Next-Generation DNA Sequencing Technologies
The completion of the first human genome sequence in the early 2000's represented a major success for the scientific community. It also represented the return on a significant investment - billions of dollars and a little over a decade's worth of work. DNA sequencing costs have since fallen at a rate even faster than the exponential improvements predicted by Moore's Law in computing. Today, a complete human genome can be sequenced for closer to a few thousand dollars in about a week. A major factor in this improvement was the shift from the serial reads used in the human genome project to massively parallel sequencing technologies. These new technologies however present new problems - how can we understand a genome given millions of short reads drawn randomly from the underlying sequence and potentially degraded by noise in the process? Bioinformatics boasts a plethora of algorithms that meet this challenge in practice but it's not immediately obvious what 'optimal' analyses might look like. This talk will focus on the work of David Tse and others that have used the perspective of information theory to establish fundamental bounds on our ability to interpret next generation sequencing output. In particular, I will focus on the problems of assembling an unknown sequence de novo and of variant calling when the population of molecules being sequenced is not homogenous.
Title: Towards a Compression Theory for Metric Embeddings
Embedding matrices are widely used in diverse domains such as NLP, recommendation systems, information retrieval, and computer vision. For large-scale datasets, embedding matrices can consume large amounts of memory, and compression for these objects is desirable. However, the relevant distortion metrics applicable to embeddings make them significantly more compressible than classical sources considered in information theory. Furthermore, related results such as the Johnson-Lindenstrauss theorem are formulated in terms of reduction in dimension rather than codelength. Additionally, embeddings come with certain query-time constraints, such as in the sense of a succinct data structure. In this talk, we formulate the problem of metric embedding compression from the information theoretic lense and review related literature in information theory, signal processing, and dimensionality reduction. We adapt pre-existing results to establish some lower and upper bound in some regimes. Finally, we cover some of the state-of-the-art compression algorithms used to compress embeddings in practice.