
Rethinking Cloud System Software and Abstractions for True Elasticity in the Cloud-Native Era
CoDa E160 Fortinet Seminar Room
Abstract: Resource elasticity is fundamental to cloud computing. The more quickly a cloud platform can allocate resources to match the demand of each user request as it arrives, the less resources need to be pre-provisioned to meet performance requirements. However, even serverless computing platforms — which can boot sandboxes in 10s to 100s of milliseconds — are not sufficiently elastic to avoid over-provisioning expensive resources (e.g., Native over-provisions memory by 16x for the Azure Functions trace to keep around warm sandboxes that minimize cold starts). A key obstacle for true elasticity is that today’s cloud platforms are stuck retrofitting system software designed for a more traditional execution model of cloud computing based on long-running virtual machines that provide each user application with a POSIX-like interface.
While providing a POSIX interface was important in the early days of cloud computing to ease migration from on premise clusters, today developers build cloud-native applications, in which user-provided computations interact with a variety of cloud services (e.g. storage, AI inference, data analytics engines) over REST APIs. This talk will explore redesigning the cloud-native application programming interface and how it enables co-designing a much more efficient and elastic execution system. I will present Dandelion, a new elastic cloud platform with a declarative programming model, in which applications are expressed as DAGs of pure compute functions and HTTP-based communication functions. This allows Dandelion to securely execute compute functions in lightweight sandboxes that could start in hundreds of microseconds, since pure functions do not rely on extra software environments such as a guest OS to provide a POSIX interface. Dandelion makes it practical to boot a sandbox on-demand for every request, decreasing performance variability by two to three orders of magnitude compared to Firecracker and reducing committed memory by 96% on average when running the Azure Functions trace. I will discuss the implications of true elasticity for cloud applications like interactive data analytics and emerging agentic AI workflows.
Bio: Ana Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Ana's work focuses on computer system design for large-scale applications such as cloud computing services, data analytics, and machine learning. Before joining ETH in August 2020, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University.