Â鶹´«Ã½AV

The Capacity Region of Distributed Multi-User Secret Sharing

Submitted by admin on Mon, 10/28/2024 - 01:24

In this paper, we study the problem of distributed multi-user secret sharing, including a trusted master node, $N\in \mathbb {N}$ storage nodes, and $K$ users, where each user has access to the contents of a subset of storage nodes. Each user has an independent secret message with certain rate, defined as the size of the message normalized by the size of a storage node.

List-Decodable Coded Computing: Breaking the Adversarial Toleration Barrier

Submitted by admin on Mon, 10/28/2024 - 01:24

We consider the problem of coded computing, where a computational task is performed in a distributed fashion in the presence of adversarial workers. We propose techniques to break the adversarial toleration threshold barrier previously known in coded computing. More specifically, we leverage list-decoding techniques for folded Reed-Solomon codes and propose novel algorithms to recover the correct codeword using side information.

ϵ-Approximate Coded Matrix Multiplication Is Nearly Twice as Efficient as Exact Multiplication

Submitted by admin on Mon, 10/28/2024 - 01:24

We study coded distributed matrix multiplication from an approximate recovery viewpoint. We consider a system of $P$ computation nodes where each node stores $1/m$ of each multiplicand via linear encoding. Our main result shows that the matrix product can be recovered with $\epsilon $ relative error from any $m$ of the $P$ nodes for any $\epsilon > 0$ . We obtain this result through a careful specialization of MatDot codes — a class of matrix multiplication codes previously developed in the context of exact recovery ( $\epsilon =0$ ).

Slow and Stale Gradients Can Win the Race

Submitted by admin on Mon, 10/28/2024 - 01:24

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect the convergence error. In this work, we present a novel theoretical characterization of the speedup offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time).

Coded Computing via Binary Linear Codes: Designs and Performance Limits

Submitted by admin on Mon, 10/28/2024 - 01:24

We consider the problem of coded distributed computing where a large linear computational job, such as a matrix multiplication, is divided into $k$ smaller tasks, encoded using an $(n,k)$ linear code, and performed over $n$ distributed nodes. The goal is to reduce the average execution time of the computational job. We provide a connection between the problem of characterizing the average execution time of a coded distributed computing system and the problem of analyzing the error probability of codes of length $n$ used over erasure channels.

Compressing Gradients by Exploiting Temporal Correlation in Momentum-SGD

Submitted by admin on Mon, 10/28/2024 - 01:24

An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latter, none has considered leveraging the temporal correlations that exist in consecutive vector updates. An important example is distributed momentum-SGD where temporal correlation is enhanced by the low-pass-filtering effect of applying momentum.

Factored LT and Factored Raptor Codes for Large-Scale Distributed Matrix Multiplication

Submitted by admin on Mon, 10/28/2024 - 01:24

We propose two coding schemes for distributed matrix multiplication in the presence of stragglers. These coding schemes are adaptations of Luby Transform (LT) codes and Raptor codes to distributed matrix multiplication and are termed Factored LT (FLT) codes and Factored Raptor (FRT) codes. We show that all nodes in the Tanner graph of a randomly sampled code have a tree-like neighborhood with high probability. This ensures that the density evolution analysis gives a reasonable estimate of the average recovery threshold of FLT codes.

SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization

Submitted by admin on Mon, 10/28/2024 - 01:24

In this paper, we propose and analyze SQuARM-SGD, a communication-efficient algorithm for decentralized training of large-scale machine learning models over a network. In SQuARM-SGD, each node performs a fixed number of local SGD steps using Nesterov’s momentum and then sends sparsified and quantized updates to its neighbors regulated by a locally computable triggering criterion.

Coded Sequential Matrix Multiplication for Straggler Mitigation

Submitted by admin on Mon, 10/28/2024 - 01:24

In this work, we consider a sequence of $J$ matrix multiplication jobs which needs to be distributed by a master across multiple worker nodes. For $i\in \{1,2,\ldots,J\}$ , job- $i$ begins in round- $i$ and has to be completed by round- $(i+T)$ . In order to provide resiliency against slow workers (stragglers), previous works focus on coding across workers, which is the special case of $T=0$ . We propose here two schemes with $T > 0$ , which allow for coding across workers as well as the dimension of time.

Communication-Efficient and Byzantine-Robust Distributed Learning With Error Feedback

Submitted by admin on Mon, 10/28/2024 - 01:24

We develop a communication-efficient distributed learning algorithm that is robust against Byzantine worker machines. We propose and analyze a distributed gradient-descent algorithm that performs a simple thresholding based on gradient norms to mitigate Byzantine failures. We show the (statistical) error-rate of our algorithm matches that of Yin et al. (2018), which uses more complicated schemes (coordinate-wise median, trimmed mean).