The memory system presents many problems in computer architecture. A fundamental issue is worsening hardware variability and environmental sensitivity due to manufacturing difficulties in the nanometer nodes. As a consequence, memories often limit the resiliency and energy-efficiency of computing platforms from the embedded to the cloud and supercomputers. To help address these challenges, I propose the design of opportunistic memory systems that exploit and cope with hardware variation to improve both resiliency and energy efficiency.
The majority of my talk will focus on the concept of Software-Defined Error-Correcting Codes, a new methodology which co-designs ECC hardware with system software. This makes it possible to heuristically recover up to roughly 90% of the time from detected-but-uncorrectable errors (DUEs) in memory. This is achieved without any overheads in hardware parity storage, decoding latency, nor decoding energy in the common cases when DUEs do not occur. The software-based recovery policy leverages knowledge about the hardware ECC implementation and a small amount of side information about applications' instructions and data. Using typical codes such as SECDED and ChipKill, Software-Defined ECC could prevent many system crashes and energy/time-intensive checkpoint rollbacks that would otherwise result from memory DUEs. This technique might be especially useful to improve the resiliency of supercomputers and safety-critical real-time embedded systems by increasing the mean-time-to-failure.
Mark Gottscho is a PhD Candidate in the Electrical Engineering Department at UCLA, and is advised by Prof. Puneet Gupta. His current research interests in computer architecture are focused on memory systems and hardware reliability. He received his BS in 2011 and his MS in 2014 from the UCLA EE department. In 2014, his MS work earned an Honorable Mention from the NSF Graduate Research Fellowship Program and the department's Outstanding Master's Research award. This year, he won a fellowship from UCLA for his PhD work as well as the highly competitive Qualcomm Innovation Fellowship for his proposal on Software-Defined ECC. He is a student member of both the IEEE and the ACM.