
DNA and RNA sequencing data analysis often begins with aligning sequencing reads to a reference genome, with the reference represented as a linear string of bases. But linearity leads to reference bias, a tendency to miss or misreport alignments containing non-reference alleles, which can confound downstream statistical and biological results. This is a major concern in human genomics; we do not want diagnostics and therapeutics to be differentially effective depending on a patient's genetic background.
Meanwhile, recent Bioinformatics advances allow us to index and align sequencing reads to references that include many population variants. I will describe this journey from the early days of efficient genome indexing, continuing through graph-shaped references and references that include many genomes. I will emphasize recent results, including results from my group and collaborators showing how to optimize simple and complex pan-genome representations for effective avoidance of reference bias. Much of this work is collaborative with Travis Gagie, Christina Boucher, Alan Kuhnle and others.
Suggested Readings:
- Alignment of Next-Generation Sequencing Reads. (PDF available on web calendar).
- Computational pan-genomics: status, promises and challenges.
- FORGe: prioritizing variants for graph genomes.
- Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.
Winter 2021 workshop will be held remotely via Zoom. Contact Katie Kanagawa (kkanagaw@stanford.edu) for Zoom dial-in details.
Bio: Benjamin Langmead, Associate Professor, Department of Computer Science, Johns Hopkins University