Image
Stanford EE

GBZ File Format for Pangenome Graphs

Summary
Prof Jouni Sirén (UC Santa Cruz)
Packard 202
Oct
6
Date(s)
Content

Abstract: Pangenome graphs representing aligned genome assemblies are being shared in the text-based Graphical Fragment Assembly format. As the number of assemblies grows, there is a need for a file format that can store the highly repetitive data space-efficiently. We propose the GBZ file format as a new interchange format for pangenome graphs. The format has three somewhat contradictory aims:

1) Store the graph and the paths representing the assemblies space-efficiently.

2) Enable loading the data quickly into useful in-memory data structures.

3) Keep the format simple to allow independent implementations optimized for different purposes.

The key part of the GBZ format is the GBWT data structure, which encodes the paths using the Burrows-Wheeler transform (BWT). The BWT is partitioned between the nodes of the graph, and the fragment stored in each node tells where each path visiting the node continues from that node.

Bio: Jouni Sirén is an Associate Research Scientist at the UC Santa Cruz Genomics Institute, working at the intersection of algorithm engineering and bioinformatics. He received his PhD in Computer Science from the University of Helsinki in 2012. Prior to joining the UC Santa Cruz in 2018, he was a postdoctoral researcher at the University of Chile in 2013-2014 and at the Wellcome Sanger Institute in 2015-2017. His research interests include data compression, space-efficient data structures, string algorithms, and pangenome graphs. Most of his current work is done in the context of the Human Pangenome Reference Consortium, which aims to sequence and assemble genomes from individuals from diverse populations in order to better represent the human genomic landscape.