Straggler-Resilient Federated Learning: Leveraging the Interplay Between Statistical Accuracy and System Heterogeneity

Submitted by admin on Mon, 10/28/2024 - 01:24

Federated learning is a novel paradigm that involves learning from data samples distributed across a large network of clients while the data remains local. It is, however, known that federated learning is prone to multiple system challenges including system heterogeneity where clients have different computation and communication capabilities. Such heterogeneity in clients’ computation speed has a negative effect on the scalability of federated learning algorithms and causes significant slow-down in their runtime due to slow devices (stragglers). In this paper, we propose FLANP, a novel straggler-resilient federated learning meta-algorithm that incorporates statistical characteristics of the clients’ data to adaptively select the clients in order to speed up the learning procedure. The key idea of FLANP is to start the training procedure with faster nodes and gradually involve the slower ones in the model training once the statistical accuracy of the current participating nodes’ data is reached, while the final model for each stage is used as a warm-start model for the next stage. Our theoretical results characterize the speedup provided by the meta-algorithm FLANP in comparison to standard federated benchmarks for strongly convex losses and i.i.d. samples. For particular instances, FLANP slashes the overall expected runtime by a factor of $\mathcal {O}(\ln (Ns))$ , where $N$ and $s$ denote the total number of nodes and the number of samples per node, respectively. In experiments, FLANP demonstrates significant speedups in wall-clock time -up to $6 \times $ – compared to standard federated learning benchmarks.

Distributed Coding and Computation

Amirhossein Reisizadeh

Isidoros Tziotis

Hamed Hassani

Aryan Mokhtari

Ramtin Pedarsani

�鶹��ýAV