Enabling Effective Error Mitigation in Systems that Use Memory Chips with On-Die Error Correcting Codes
Allen 101, Linvill Room
Abstract: Generational improvements to main memory storage density are central to enabling system support for modern data-intensive workloads. Unfortunately, aggressive density scaling seriously impacts memory reliability, requiring error-mitigation mechanisms to prevent increasingly frequent memory errors from causing system-level failures. However, naïvely combining components that use different error-mitigation mechanisms breeds insidious reliability problems that can be difficult to anticipate, diagnose, and address.
In this talk, we discuss our recent efforts to understand the system-level reliability implications of DRAM on-die error-correcting codes (on-die ECC), a self-contained error-mitigation mechanism prevalent within modern DRAM chips. Through a combination of real-chip experiments, statistical analyses, and simulation, we (i) show that on-die ECC obfuscates the statistical properties of main memory errors in a manner specific to the on-die ECC implementation used by a given chip and (ii) build a detailed understanding of how this obfuscation occurs, its consequences, and how those consequences can be overcome.
Throughout our studies, we develop two new testing techniques, EIN and BEER, which provide insight into (i) a given on-die ECC implementation’s behavior and (ii) the underlying raw bit errors when the implementation fails to correct errors. Using what we learn, we contribute three new error profiling algorithms, REAPER, BEEP, and HARP, that synergistically enable the system to quickly and safely identify bits that are at risk of error in memory chips that use on-die ECC. Finally, we conclude by discussing the critical need for transparency in DRAM reliability characteristics in order to enable DRAM consumers to better understand and adapt commodity DRAM chips to their system-specific needs. We hope and believe that the analyses, techniques, and results we provide will enable the community to better understand and tackle current and future reliability challenges and adapt commodity memory to new advantageous applications.
Speaker Bio: Minesh Patel recently earned his Ph.D. from ETH Zürich, where his dissertation has been recognized with the William Carter Award for its progress toward understanding and addressing reliability challenges in modern DRAM chips. During his graduate studies, Minesh contributed to various topics related to memory systems performance, reliability, and security. As part of that work, he received best paper awards from MICRO’19 and DSN’17 for pioneering new techniques to study memory errors in modern chips. Before graduate school, Minesh earned dual B.S. degrees from UT Austin in physics and electrical engineering. His current research interests center on reconciling the disconnect between component- and system-level dependability goals by developing new architecture- and system-level policies, interfaces, and mechanisms that enable the system to quickly and efficiently identify and achieve its (possibly time-varying) dependability requirements.