As machine learning models are trained on ever-larger and more complex datasets, it has become standard to distribute this training across multiple physical computing devices. Such an approach offers a number of potential benefits, including reduced training time and storage needs due to parallelization. Distributed stochastic gradient descent (SGD) is a common iterative framework for training machine learning models: in each iteration, local workers compute parameter updates on a local dataset. These are then sent to a central server, which aggregates the local updates and pushes global parameters back to local workers to begin a new iteration. Distributed SGD, however, can be expensive in practice: training a typical deep learning model might require several days and thousands of dollars on commercial cloud platforms. Cloud-based services that allow occasional worker failures (e.g., locating some workers on Amazon spot or Google preemptible instances) can reduce this cost, but may also reduce the training accuracy. We quantify the effect of worker failure and recovery rates on the model accuracy and wall-clock training time, and show both analytically and experimentally that these performance bounds can be used to optimize the SGD worker configurations. In particular, we can optimize the number of workers that utilize spot or preemptible instances. Compared to heuristic worker configuration strategies and standard on-demand instances, we dramatically reduce the cost of training a model, with modest increases in training time and the same level of accuracy. Finally, we discuss implications of our work for federated learning environments, which use a variant of distributed SGD. Two major challenges in federated learning are unpredictable worker failures and a heterogeneous (non-i.i.d.) distribution of data across the workers, and we show that our characterization of distributed SGD's performance under worker failures can be adapted to this setting.
The ISL Colloquium meets weekly during the academic year. Seminars are each Thursday at 4:30pm PT, unless indicated otherwise.
Until further notice, the ISL Colloquium convenes exclusively via Zoom (on Thursdays at 4:30pm PT) due to the ongoing pandemic. To avoid "Zoom-bombing", we ask attendees to input their email address here https://stanford.zoom.us/meeting/register/tJckfuCurzkvEtKKOBvDCrPv3McapgP6HygJ to receive the Zoom meeting details via email.
Carlee Joe-Wong is an Assistant Professor of Electrical and Computer Engineering at Carnegie Mellon University. She received her A.B., M.A., and Ph.D. degrees from Princeton University in 2011, 2013, and 2016, respectively. Dr. Joe-Wong's research is in optimizing networked systems, particularly on applying machine learning and pricing to data and computing networks. From 2013 to 2014, she was the Director of Advanced Research at DataMi, a startup she co-founded from her Ph.D. research on mobile data pricing. She has received a few awards for her work, including the ARO Young Investigator Award in 2019, the NSF CAREER Award in 2018, and the INFORMS ISS Design Science Award in 2014.