We describe, and demonstrate, a novel foundation for datacenter communication: a new "event based" protocol that can dispense with the need for conventional heartbeats and timeouts at the network layer --paving a new path for efficient recovery for distributed algorithms as they scale. We then show how this can be composed into arbitrary graph-based distributed communications for application infrastructures.
A modern datacenter has many thousands of servers connected to each other through many hundreds of switches. The switches are usually configured into a spanning tree with the servers at the leaves. While this approach simplifies routing, it has some serious shortcomings. For example, servers don't know when an interior link has failed, so they use timeouts as a way of guessing that they need to fail over to a new path. Since failovers generate lots of messages, they often result in other servers timing out. Sometimes, several minutes elapse before the system quiesces and normal operations can resume. These failover and latency storms would be tolerable if they were rare, but they have been observed to occur several times a day in modern datacenters.
The Earth Computing Network Fabric (ECNF) takes a different approach. An ECNF segment within a datacenter has no switches within its own fabric. Instead, each cell combines the compute functions of a server with the routing functions of a switch. Each cell has multiple ports (7±2), and each port of a cell is directly connected to a port of another cell via a link. Because the link is a dedicated channel between exactly two cells, we can use the Earth Computing Link Protocol (ECLP) instead of standard protocols, such as Ethernet or TCP/IP.
In this talk, we'll explain the problems with the way datacenters are built today and show how the Earth Computing design avoids many difficult problems while providing additional functionality.
If you can, attend this talk live. Following the formal presentation, we are planning to demonstrate and discuss "interesting things" off camera. For example, questions on the proprietary nature of the Implementation will be addressed only during the extended session when the camera is turned off.
The following background may be helpful to computer science students unfamiliar with the nature of time in physics:
- 2014 EE380 Talk. The physics of time introduction for Computer Science).
- 2016 PWL Talk. "Lamport's Unfinished revolution", starts at timestamp 32:30).
Paul Borrill is founder and CEO of EARTH Computing, and is a leading industry expert on the foundations of IT Infrastructures. He has served: on Apple's Infrastructure team, as VP/CTO for VERITAS Software; VP/Chief Architect for Storage Systems at Quantum Corporation; Distinguished Engineer, Director of Architecture & Performance and Chief Scientist of Information Resources at Sun Microsystems. He founded and chaired the first two years of the Storage Networking Industry Association (SNIA), and served as Vice President of Technical Activities, Vice President of Standards, and member of the Governing Board of the IEEE Computer Society. His lifelong interest in dependable computing came from working with NASA, designing computer systems & software for an experiment which performed extraordinarily well on flight 51F of the Space Shuttle. Paul earned his Ph.D in physics from University College London and is a graduate of the Stanford Executive Program.
Alan Karp is Principal Architect at EARTH Computing, where he specializes in secure distributed systems and usable security. He was a Principal Scientist in the Office of the CTO of HP's Enterprise Services organization where he was the Technical Architect for the Enterprise Services On-demand solution broker. Previously, he worked in HP Labs on usable security. Dr. Karp was Senior Technical Contributor and Chief Scientist at Hewlett-Packard's E-speak Operation, the group responsible for bringing HP's e- speak technology to market. He was also one of the architects of the chips in Intel's Itanium processor line. Dr. Karp received his Ph.D. in Astronomy from the University of Maryland, spent two years in the General Sciences Department at IBM Research, and one year as an assistant professor of physics at Dartmouth College before joining IBM's Palo Alto Scientific Center. He moved to Hewlett-Packard Laboratories when IBM closed its Scientific Center. He has published over 100 papers and conference proceedings and holds more than 70 Patents.