We describe, and demonstrate, a novel foundation for datacenter communication: a new "event based" protocol that can dispense with the need for conventional heartbeats and timeouts at the network layer --paving a new path for efficient recovery for distributed algorithms as they scale. We then show how this can be composed into arbitrary graph-based distributed communications for application infrastructures.
A modern datacenter has many thousands of servers connected to each other through many hundreds of switches. The switches are usually configured into a spanning tree with the servers at the leaves. While this approach simplifies routing, it has some serious shortcomings. For example, servers don't know when an interior link has failed, so they use timeouts as a way of guessing that they need to fail over to a new path. Since failovers generate lots of messages, they often result in other servers timing out. Sometimes, several minutes elapse before the system quiesces and normal operations can resume. These failover and latency storms would be tolerable if they were rare, but they have been observed to occur several times a day in modern datacenters.
The Earth Computing Network Fabric (ECNF) takes a different approach. An ECNF segment within a datacenter has no switches within its own fabric. Instead, each cell combines the compute functions of a server with the routing functions of a switch. Each cell has multiple ports (7±2), and each port of a cell is directly connected to a port of another cell via a link. Because the link is a dedicated channel between exactly two cells, we can use the Earth Computing Link Protocol (ECLP) instead of standard protocols, such as Ethernet or TCP/IP.
In this talk, we'll explain the problems with the way datacenters are built today and show how the Earth Computing design avoids many difficult problems while providing additional functionality.
If you can, attend this talk live. Following the formal presentation, we are planning to demonstrate and discuss "interesting things" off camera. For example, questions on the proprietary nature of the Implementation will be addressed only during the extended session when the camera is turned off.
The following background may be helpful to computer science students unfamiliar with the nature of time in physics: