EE information theory is guiding improved ways to model and compress data

image of genome compression team
November 2014

A team led by Stanford electrical engineers has compressed a completely sequenced human genome to just 2.5 megabytes – small enough to attach to an email. The engineers used what is known as reference-based compression, relying on a human genome sequence that is already known and available. Their compression has improved on the previous record by 37 percent. The genome the team compressed was that of James Watson, who co-discovered the structure of DNA more than 60 years ago.

"On the surface, this might not seem like a problem for electrical engineers," said Tsachy Weissman, an associate professor of Electrical Engineering. "But our work in information theory is guiding the development of new and improved ways to model and compress the incredibly voluminous genomic data the world is amassing." In addition to Weissman, the team included Golan Yona, a senior research engineer in Electrical Engineering, and Dmitri Pavlichin, a post-doctoral scholar in Applied Physics and Electrical Engineering.

In recording quality scores, DNA sequencers introduce all sorts of imperfections that are collectively considered "noise." Different sequencers have different noise characteristics. Weissman and his team are developing theory and algorithms for processing the quality scores in a way that reduces the noise and at the same time results in significant compression. Counterintuitive as it might sound at first, they are using lossy compression as a mechanism not only for considerable reduction in storage requirements, but also for enhancing the integrity of the data.

"But, in fact, it is quite intuitive," Weissman said. "Lossy compression, when done right, forces the compressor to discard the part of the signal which is hardest to compress, namely, the noise."

 

For the full story, visit engineering.stanford.edu/news/making-personalized-medicine-practical

 

Image credit: Rod Searcey