Image
Stanford EE

SPLASH: Statistically Primary aLignment Agnostic Sequence Hashing

Summary
Prof Julia Salzman (Stanford)
Packard 202
Oct
27
Date(s)
Content

Abstract: Today, analyses of genomic data are conducted using a paradigm that originated decades ago, in a different era of data generation and availability. This paradigm requires specialized workflows which cannot be formally analyzed and are also a barrier to entry for mathematicians and statisticians. Further, this paradigm fundamentally limits the scope of biological discovery from genomic data. I will illustrate examples of the biological problems in genomics using examples from detecting viral strain variation and variation in the immune system. I will then discuss a new unifying paradigm for multiple seemingly disparate areas of genomic analysis: SPLASH (Statistically Primary aLignment Agnostic Sequence Homing). SPLASH treats genomic data as it is: a discrete data analysis problem on strings of A/C/G/T. I will present the formulation of the problem SPLASH seeks to address, the basic – and simple! – idea behind the algorithm, as well as the simple new statistical tests developed to enable SPLASH to have appropriate statistical power, and of independent interest. The novel component of this test is its use of concentration inequalities, that provide exact p values in finite samples. Time permitting, I will discuss new open problems generated by the SPLASH framework. This is joint work with many co-authors, including the group of Sebastian Deorwicz which has implemented SPLASH.

Bio: Julia Salzman is an Associate Professor in the Department of Biomedical Data Science, Biochemistry and Statistics (by Courtesy). She received her A.B. in Mathematics from Princeton University Magna Cum Laude and Ph.D. from Stanford University in the Department of Statistics supervised by Dr. Persi Diaconis. As a postdoctoral scholar in Dr. Patrick Brown’s lab, Dr. Salzman developed statistical algorithms that led to the discovery of a ubiquitous expression of circular RNA missed by other computational and experimental approaches for decades. Her research spans the interface of statistical methodology and genomics aiming to use data driven experiments to uncover organizing principles of biological regulation, historically focused on RNA processing. Recently her group has introduced a new approach to sequencing analysis called SPLASH that performs inference on raw sequencing data, bypassing genome alignment. This approach is providing new insights into genome regulation in several biological domains.