System software support for CXL memory systems
CoDa E401
Abstract: Compute express link (CXL) memory allows multiple, physically attached hosts to dynamically share memory at cacheline granularity. We call such a configuration a CXL pod. Pods provide an intermediate hardware configuration between a network of machines, each with their private memory, and a shared memory multiprocessor with a unified memory, accessible to all machines. We believe this multi-host, shared CXL memory will be an attractive alternative to a small-scale distributed system, but such a hardware system requires software support to be successful.
We will present Tigon, a transactional database optimized for a CXL pod. Tigon synchronizes cross-host concurrent data accesses via atomic operations on CXL memory, which is more efficient than network-based approaches. However, Tigon must address the limitations of CXL memory: its higher latency and lower bandwidth relative to local memory, and its limited support for hardware cache coherence across hosts. Using TPC-C and a variant of YCSB, Tigon achieves up to 2.8x higher throughput compared to an optimized shared-nothing database that uses CXL memory as transport and up to 14.4x higher throughput compared to an RDMA-based distributed database.
We also talk about the problem of partial failure, where processes accessing shared CXL memory can fail independently. System software should not block live threads during either death or recovery. Lock-free data structures are a first step toward tolerating partial failures, but recoverability (the ability to determine whether an in-progress operation completed) is also needed.
Bio: Emmett Witchel is a professor in computer science at the University of Texas at Austin, where he has been on the faculty since 2004, after receiving his PhD at MIT. Prof. Witchel's research interests include operating systems, security, architecture, and concurrency. His recent work has been on system support for CXL memory, serverless computing, persistent memory, and trusted execution environments. He co-chaired architectural support for programming languages and operating systems (ASPLOS) in 2019 and co-general chaired the symposium on operating systems principles (SOSP) in 2024. His publishing recognition includes best paper awards at both SOSP and operating systems design and implementation (OSDI) as well as IEEE Micro top picks and research highlights in Communications of the ACM (CACM). He is a fellow of the ACM.