Sequential, Active, and Reinforcement Learning | �鶹��ýAV Information Theory Society

2021

Sequential, Active, and Reinforcement Learning

Guest editors

Vincent Y. F. Tan

Yao Xie

Sequential methods underpin many of the most powerful learning techniques, such as reinforcement learning, multi-armed bandits, online convex optimization, and active learning. Although many practical algorithms have been developed for sequential learning, there is a strong need to develop theoretical foundations and to understand fundamental limits. Herein lies an excellent opportunity for information theory to provide answers given its vast arsenal of versatile techniques. At the same time, sequential learning has already started to motivate new problems and insights in information theory and has led to new perspectives. This special issue seeks to fertilize new topics at the intersection of information theory and sequential, active, and reinforcement learning, promoting synergy along the way.

Liyan Xie Shaofeng Zou Yao Xie Venugopal V. Veeravalli

Online detection of changes in stochastic systems, referred to as sequential change detection or quickest change detection, is an important research topic in statistics, signal processing, and information theory, and has a wide range of applications. This survey starts with the basics of sequential change detection, and then moves on to generalizations and extensions of sequential change detection theory and methods. We also discuss some new dimensions that emerge at the intersection of sequential change detection with other areas, along with a selection of modern applications and remarks on open questions.

Chengshuai Shi Cong Shen

We study the notoriously difficult no-sensing adversarial multi-player multi-armed bandits (MP-MAB) problem from a new perspective. Instead of focusing on the hardness of multiple players, we introduce a new dimension of hardness, called attackability. All adversaries can be categorized based on the attackability and we introduce Adversary-Adaptive Collision-Communication (A2C2), a family of algorithms with forced-collision communication among players. Both attackability-aware and unaware settings are studied, and information-theoretic tools of the Z-channel model and error-correction coding are utilized to address the challenge of implicit communication without collision information in an adversarial environment. For the more challenging attackability-unaware problem, we propose a simple method to estimate the attackability enabled by a novel error-detection repetition code and randomized communication for synchronization. Theoretical analysis proves that asymptotic attackability-dependent sublinear regret can be achieved, with or without knowing the attackability. In particular, the asymptotic regret does not have an exponential dependence on the number of players, revealing a fundamental tradeoff between the two dimensions of hardness in this problem.

Konstantinos E. Nikolakakis Dionysios S. Kalogerias Or Sheffet Anand D. Sarwate

We study the best-arm identification problem in multi-armed bandits with stochastic rewards when the goal is to identify the arm with the highest quantile at a fixed, prescribed level. First, we propose a successive elimination algorithm for strictly optimal best-arm identification, show that it is $\delta $ -PAC and characterize its sample complexity. Further, we provide a lower bound on the expected number of pulls, showing that the proposed algorithm is essentially optimal up to logarithmic factors. Both upper and lower complexity bounds depend on a special definition of the associated suboptimality gap, designed in particular for the quantile bandit problem — as we show, when the gap approaches zero, best-arm identification is impossible. Second, motivated by applications where the rewards are private information, we provide a differentially private successive elimination algorithm whose sample complexity is finite even for distributions with infinite support and characterize its sample complexity. Our algorithms do not require prior knowledge of either the suboptimality gap or other statistical information related to the bandit problem at hand.

Samarth Gupta Gauri Joshi Osman Yağan

In this paper we consider the problem of best-arm identification in multi-armed bandits in the fixed confidence setting, where the goal is to identify, with probability $1-\delta $ for some $\delta >0$ , the arm with the highest mean reward in minimum possible samples from the set of arms $\mathcal {K}$ . Most existing best-arm identification algorithms and analyses operate under the assumption that the rewards corresponding to different arms are independent of each other. We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms in the form of upper bounds on expected conditional reward of an arm, given a reward realization from another arm. Our proposed algorithm C-LUCB, which generalizes the LUCB algorithm utilizes this partial knowledge of correlations to sharply reduce the sample complexity of best-arm identification. More interestingly, we show that the total samples obtained by C-LUCB are of the form $\mathrm {O}\left({\sum _{k \in \mathcal {C}} \log \left({\frac {1}{\delta }}\right)}\right)$ as opposed to the typical $\mathrm {O}\left({\sum _{k \in \mathcal {K}} \log \left({\frac {1}{\delta }}\right)}\right)$ samples required in the independent reward setting. The improvement comes, as the $\mathrm {O}(\log (1/\delta))$ term is summed only for the set of competitive arms $\mathcal {C}$ , which is a subset of the original set of arms $\mathcal {K}$ . The size of the set $\mathcal {C}$ , depending on the problem setting, can be as small as 2, and hence using C-LUCB in the correlated bandits setting can lead to significant performance improvements. Our theoretical findings are supported by experiments on the Movielens and Goodreads recommendation datasets.

Anusha Lalitha Andrea Goldsmith

We study a decentralized cooperative multi-agent multi-armed bandit (MAB) problem with K arms and N agents connected over a network. In this model, each arm's reward distribution is the same for every agent, and rewards are drawn independently across agents and over time steps. At each iteration, agents independently choose an arm to play and exchange at most poly(K) real-valued messages with their neighbors. Existing lower bound on the average per-agent regret over the network shows that cooperation with other agents can achieve a reduction of O(\frac 1N) over when playing in isolation. Motivated by this, we study a message-passing algorithm that can be combined with existing Bayesian MAB algorithms. Using this, we propose a decentralized Thompson Sampling (TS) algorithm and a decentralized Bayes-UCB algorithm. Under decentralized TS for bounded rewards, we establish a problem-dependent upper bound on the average per-agent regret in terms of the number of agents in the network and the network topology. For Bernoulli rewards, our upper bound asymptotically matches the lower bound. We empirically show that the proposed decentralized TS algorithm incurs significantly lesser per-agent regret than previously proposed algorithms. Furthermore, we show that the proposed decentralized TS can be extended to general bandit problems, where posterior distribution cannot be computed in closed form. We combine our decentralized TS algorithm with Variational Inference to apply our proposed algorithm to complex realistic reward distributions in a computationally efficient manner. We implement our proposed decentralized TS under gossip protocol and over time-varying networks, where each communication link has a fixed probability of failure and show that it incurs logarithmic regret.

Ilai Bistritz Tavor Z. Baharav Amir Leshem Nicholas Bambos

Consider N cooperative but non-communicating players where each plays one out of M arms for T turns. Players have different utilities for each arm, represented as an N×M matrix. These utilities are unknown to the players. In each turn, players select an arm and receive a noisy observation of their utility for it. However, if any other players selected the same arm in that turn, all colliding players will receive zero utility due to the conflict. No communication between the players is possible. We propose two distributed algorithms which learn fair matchings between players and arms while minimizing the regret. We show that our first algorithm learns a max-min fairness matching with near- O(logT) regret (up to a loglogT factor). However, if one has a known target Quality of Service (QoS) (which may vary between players) then we show that our second algorithm learns a matching where all players obtain an expected reward of at least their QoS with constant regret, given that such a matching exists. In particular, if the max-min value is known, a max-min fairness matching can be learned with O(1) regret.

Vivek Bagaria Tavor Z. Baharav Govinda M. Kamath David N. Tse

The celebrated Monte Carlo method estimates an expensive-to-compute quantity by random sampling. Bandit-based Monte Carlo optimization is a general technique for computing the minimum of many such expensive-to-compute quantities by adaptive random sampling. The technique converts an optimization problem into a statistical estimation problem which is then solved via multi-armed bandits. We apply this technique to solve the problem of high-dimensional $k$ -nearest neighbors, developing an algorithm which we prove is able to identify exact nearest neighbors with high probability. We show that under regularity assumptions on a dataset of $n$ points in $d$ -dimensional space, the complexity of our algorithm scales logarithmically with the dimension of the data as $O\left({(n+d)\log ^{2} \frac {nd}{\delta }}\right)$ for error probability $\delta $ , rather than linearly as in exact computation requiring $O(nd)$ . We corroborate our theoretical results with numerical simulations, showing that our algorithm outperforms both exact computation and state-of-the-art algorithms such as kGraph, NGT, and LSH on real datasets.

Junyu Zhang Amrit Singh Bedi Mengdi Wang Alec Koppel

We study the estimation of risk-sensitive policies in reinforcement learning (RL) problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent, and hence modifications of Bellman's equations are required, which often do not permit efficient fixed point schemes. To ameliorate this issue, we propose a new definition of risk, which we call caution, as a penalty function added to the dual of the linear programming (LP) formulation of tabular RL. Caution quantifies the distributional risk of a policy, and is a function of the policy's long-term state occupancy measure. When the risk is a convex function of the occupancy measure, we propose to solve this problem in an online model-free manner with a stochastic variant of primal-dual method that uses Kullback-Lieber (KL) divergence as its proximal term. We establish that the sample complexity required to attain near-optimal solutions matches tight dependencies on the cardinality of the spaces, but also depends on the infinity norm of the risk measure's gradient. Moreover, when the risk is non-convex, we propose a block-coordinate augmentation of the aforementioned approach, and establish its convergence to KKT points. Experiments demonstrate that this approach improves the reliability of reward accumulation without additional computation as compared to risk-neutral LP solvers, both for convex and non-convex risks, respectively, for the cases of KL divergence to a prior, and the variance.

Salman Habib Allison Beemer Jörg Kliewer

In this work, we consider the decoding of short sparse graph-based channel codes via reinforcement learning (RL). Specifically, we focus on low-density parity-check (LDPC) codes, which for example have been standardized in the context of 5G cellular communication systems due to their excellent error correcting performance. LDPC codes are typically decoded via belief propagation on the corresponding bipartite (Tanner) graph of the code via flooding, i.e., all check and variable nodes in the Tanner graph are updated at once. We model the node-wise sequential LDPC scheduling scheme as a Markov decision process (MDP), and obtain optimized check node (CN) scheduling policies via RL to improve sequential decoding performance as compared to flooding. In each RL step, an agent decides which CN to schedule next by observing a reward associated with each choice. Repeated scheduling enables the agent to discover the optimized CN scheduling policy which is later incorporated in our RL-based sequential LDPC decoder. In order to reduce RL complexity, we propose a novel graph-induced CN clustering approach to partition the state space of the MDP in such a way that dependencies between clusters are minimized. Compared to standard decoding approaches from the literature, some of our proposed RL schemes not only improve the decoding performance, but also reduce the decoding complexity dramatically once the scheduling policy is learned. By concatenating an outer Hamming code with an inner LDPC code which is decoded based on our learned policy, we demonstrate significant improvements in the decoding performance compared to other LDPC decoding policies.

Daniel Vial Vijay Subramanian

We devise algorithms for the policy evaluation problem in reinforcement learning, assuming access to a simulator and certain side information called the supergraph. Our algorithms explore backward from high-cost states to find high-value ones, in contrast to approaches that work forward from all states. While several papers have demonstrated the utility of backward exploration empirically, we conduct rigorous analyses which show that our algorithms can reduce average-case sample complexity from $O(S \log S)$ to as low as $O(\log S)$ . Analytically, we adapt tools from the network science literature to provide a new methodology for reinforcement learning problems.

Shuang Qiu Zhuoran Yang Jieping Ye Zhaoran Wang

Actor-critic algorithm and their extensions have made great achievements in real-world decision-making problems. In contrast to its empirical success, the theoretical understanding of the actor-critic seems unsatisfactory. Most existing results only show the asymptotic convergence, which is developed mainly based on approximating the dynamic system of the actor and critic using ordinary differential equations. However, the finite-time convergence analysis of the actor-critic algorithm remains to be explored. The main challenges lie in the nonconvexity of parameterized policies, the coupling of the updates for actor and critic, and the data sampling dependency in online settings. In this paper, we provide a finite-time convergence analysis for an online actor-critic algorithm under the infinite-horizon average reward setting. In the critic step, we give a theoretical analysis of the TD(0) algorithm for the average reward with dependent data in online settings. Besides, we show that the sequence of actor iterates converges in a sublinear rate to a stationary point up to some irremovable bias due to the value function approximation by linear functions. To the best of our knowledge, our work seems to provide the first finite-time convergence analysis for an online actor-critic algorithm with TD learning.

Michael K. Cohen Elliot Catt Marcus Hutter

Reinforcement learners are agents that learn to pick actions that lead to high reward. Ideally, the value of a reinforcement learner’s policy approaches optimality—where the optimal informed policy is the one which maximizes reward. Unfortunately, we show that if an agent is guaranteed to be “asymptotically optimal” in any (stochastically computable) environment, then subject to an assumption about the true environment, this agent will be either “destroyed” or “incapacitated” with probability 1. Much work in reinforcement learning uses an ergodicity assumption to avoid this problem. Often, doing theoretical research under simplifying assumptions prepares us to provide practical solutions even in the absence of those assumptions, but the ergodicity assumption in reinforcement learning may have led us entirely astray in preparing safe and effective exploration strategies for agents in dangerous environments. Rather than assuming away the problem, we present an agent, Mentee, with the modest guarantee of approaching the performance of a mentor, doing safe exploration instead of reckless exploration. Critically, Mentee’s exploration probability depends on the expected information gain from exploring. In a simple non-ergodic environment with a weak mentor, we find Mentee outperforms existing asymptotically optimal agents and its mentor.

Michael K. Cohen Badri Vellambi Marcus Hutter

Algorithmic Information Theory has inspired intractable constructions of general intelligence (AGI), and undiscovered tractable approximations are likely feasible. Reinforcement Learning (RL), the dominant paradigm by which an agent might learn to solve arbitrary solvable problems, gives an agent a dangerous incentive: to gain arbitrary “power” in order to intervene in the provision of their own reward. We review the arguments that generally intelligent algorithmic-information-theoretic reinforcement learners such as Hutter’s [2] AIXI would seek arbitrary power, including over us. Then, using an information-theoretic exploration schedule, and a setup inspired by causal influence theory, we present a variant of AIXI which learns to not seek arbitrary power; we call it “unambitious”. We show that our agent learns to accrue reward at least as well as a human mentor, while relying on that mentor with diminishing probability. And given a formal assumption that we probe empirically, we show that eventually, the agent’s world-model incorporates the following true fact: intervening in the “outside world” will have no effect on reward acquisition; hence, it has no incentive to shape the outside world.

Jaehyeok Shin Aaditya Ramdas Alessandro Rinaldo

We develop a nonparametric extension of the sequential generalized likelihood ratio (GLR) test and corresponding time-uniform confidence sequences for the mean of a univariate distribution. By utilizing a geometric interpretation of the GLR statistic, we derive a simple analytic upper bound on the probability that it exceeds any prespecified boundary; these are intractable to approximate via simulations due to infinite horizon of the tests and the composite nonparametric nulls under consideration. Using time-uniform boundary-crossing inequalities, we carry out a unified nonasymptotic analysis of expected sample sizes of one-sided and open-ended tests over nonparametric classes of distributions (including sub-Gaussian, sub-exponential, sub-gamma, and exponential families). Finally, we present a flexible and practical method to construct time-uniform confidence sequences that are easily tunable to be uniformly close to the pointwise Chernoff bound over any target time interval.

Shubhanshu Shekhar Mohammad Ghavamzadeh Tara Javidi

We construct and analyze active learning algorithms for the problem of binary classification with abstention, in which the learner has an additional option to withhold its decision on certain points in the input space. We consider this problem in the fixed-cost setting, where the learner incurs a cost $\lambda \in (0, 1/2)$ every time the abstain option is invoked. Our proposed algorithm can work with the three most commonly used active learning query models, namely, membership-query, pool-based, and stream-based models. We obtain a high probability upper-bound on the excess risk of our algorithm, and establish its minimax near-optimality by deriving matching lower-bound (modulo polylogarithmic factors). Since our algorithm relies on the knowledge of the smoothness parameters of the regression function, we also describe a new strategy to adapt to these unknown parameters in a data-driven manner under an additional quality assumption. We show that using this strategy our algorithm achieves the same performance in terms of excess risk as their counterparts with the knowledge of the smoothness parameters. We end the paper with a discussion about the extension of our results to the setting of bounded rate of abstention.

Shachar Shayovitz Meir Feder

Modern machine learning systems require massive amounts of labeled training data in order to achieve high accuracy rates which is very expensive in terms of time and cost. Active learning is an approach which uses feedback to only label the most informative data points and significantly reduce the labeling effort. Many heuristics for selecting data points have been developed in recent years which are usually tailored to a specific task and a general unified framework is lacking. In this work, a new information theoretic criterion is proposed based on a minimax log-loss regret formulation of the active learning problem. First, a Redundancy Capacity theorem for active learning is derived along with an optimal learner. This leads to a new active learning criterion which naturally induces an exploration - exploitation trade-off in feature selection and generalizes previously proposed heuristic criteria. The new criterion is compared analytically and via empirical simulation to other commonly used information theoretic active learning criteria. Next, the linear hyper-plane hypotheses class with possibly asymmetric label noise is considered. The achievable performance for the proposed criterion is analyzed using a new low complexity greedy algorithm based on the Posterior Matching scheme for communication with feedback. It is shown that for general label noise and bounded feature distribution, the new information theoretic criterion decays exponentially fast to zero.

Meng-Che Chang Matthieu R. Bloch

We consider a situation in which a decision maker takes sequential and adaptive sensing actions to collect measurements and estimate an unknown parameter taking finitely many values, in the presence of an adversary who also collects measurements whenever a sensing action is taken. This situation can be viewed as an abstraction in which to analyze the mitigation of information leakage inherent to control actions in systems with feedback, such as cyber-physical systems. Specifically, we formulate an evasive active hypothesis problem in which the objective is for the decision maker to control the risk of its test while minimizing the detection ability of the adversary, measured in terms of the asymptotic error exponent ratio between the adversary and the decision maker. We develop bounds on the exponent ratio that offer insight into optimal strategies that the decision maker can deploy to evade the adversary's detection. We illustrate the results with a numerical example corresponding to the detection of a wireless transmission.

Liyan Xie Yao Xie

We present a new non-parametric statistic, called the weighed l2 divergence, based on empirical distributions for sequential change detection. We start by constructing the weighed l2 divergence as a fundamental building block for two-sample tests and change detection. The proposed statistic is proved to attain the optimal sample complexity in the offline setting. We then study the sequential change detection using the weighed l2 divergence and characterize the fundamental performance metrics, including the average run length (ARL) and the expected detection delay (EDD). We also present practical algorithms to find the optimal projection to handle high-dimensional data and the optimal weights, which is critical to quick detection since, in such settings, there are not many post-change samples. Simulation results and real data examples are provided to validate the good performance of the proposed method.

Georgios Rovatsos George V. Moustakides Venugopal V. Veeravalli

The problem of sequentially detecting a moving anomaly is studied, in which the anomaly affects different parts of a sensor network over time. Each network sensor is characterized by a pre- and post-change distribution. Initially, the observations of each sensor are generated according to the corresponding pre-change distribution. After some unknown but deterministic time instant, a moving anomaly emerges, affecting different sets of sensors as time progresses. Our goal is to design a stopping procedure to detect the emergence of the anomaly as quickly as possible, subject to false alarms constraints. The problem is studied in a quickest change detection framework where it is assumed that the evolution of the anomaly is unknown but deterministic. A modification of Lorden's detection delay is proposed to account for the trajectory of the anomaly that maximizes the detection delay of a detection procedure. It is established that a Cumulative Sum-type test solves the resulting sequential detection problem exactly when the sensors are homogeneous. For the case of heterogeneous sensors, the proposed detection scheme can be modified to provide a first-order asymptotically optimal algorithm.

Deniz Sargun Can Emre Koksal

We study the robust transient and quickest change detection problems with unknown post-change distributions under finite alphabets. When the distribution after the change point is unknown, the change in the distribution of observations may occur in multiple ways without much structure on the observations, whereas, before the change point, a false alarm is highly structured, following a particular sample path with respect to the distribution of observations and the detection scheme. We first characterize these likely events for the deviation and propose a method to test the empirical distribution, relative to the most likely way for it to occur as an outlier. We show performance guarantees and prove asymptotic optimality for a robust quickest change detection problem up to a multiplicative constant before benchmarking our method with the finite moving average (FMA) method, generalized likelihood ratio test (GLRT) and M-statistic kernel (MSK) change point detection method under 4 different performance criteria including the run time complexity. Finally, we apply our method on economic market indicators and climate data. Our method successfully captures the regime shifts during times of historical significance for the markets and identifies the current climate change phenomenon to be a highly likely regime shift rather than a random event.

Haider Al-Lawati Tharindu B. Adikari Stark C. Draper

Large-scale learning and optimization problems are often solved in parallel. In a master-worker distributed setup, worker nodes are most often assigned fixed-sized minibatches of data points to process. However, workers may take different amounts of time to complete their per-batch calculations. To deal with such variability in processing times, an alternative approach has recently been proposed wherein each worker is assigned a fixed duration to complete the calculations associated with each batch. This fixed-time approach results in time-varying minibatch sizes and has been shown to outperform the fixed minibatch approach in synchronous optimization. In this paper we make a number of contributions in the analysis and experimental verification of such systems. First, we formally present a system model of an asynchronous optimization scheme with variable-sized minibatches and derive the expected minibatch size. Second, we show that for our fixed-time asynchronous approach, the expected gradient staleness does not depend on the number of workers contrary to existing schemes. Third, we prove that for convex smooth objective functions the asynchronous variable minibatch method achieves the optimal regret and optimality gap bounds. Finally, we run experiments comparing the performances of the asynchronous fixed-time and fixed-minibatch methods. We present results for CIFAR-10 and ImageNet.

Guanghui Lan Yi Zhou

In this paper, we introduce an asynchronous decentralized accelerated stochastic gradient descent type of algorithm for decentralized stochastic optimization. Considering communication and synchronization costs are the major bottlenecks for decentralized optimization, we attempt to reduce these costs from an algorithmic design aspect, in particular, we are able to reduce the number of agents involved in one round of update via randomization. Our major contribution is to develop a class of accelerated randomized decentralized algorithms for solving general convex composite problems. We establish $\mathcal {O}(1/\epsilon)$ (resp., $\mathcal {O}(1/\sqrt {\epsilon })$ ) communication complexity and $\mathcal {O}(1/\epsilon ^{2})$ (resp., $\mathcal {O}(1/\epsilon)$ ) sampling complexity for solving general convex (resp., strongly convex) problems. It worths mentioning that our proposing algorithm only sublinear depends on the Lipschitz constant if there is a smooth component presented in the objective function. Moreover, we also conduct some preliminary numerical experiments to demonstrate the advantages of our proposing algorithms over the state-of-the-art synchronous decentralized algorithm.

�鶹��ýAV

Table of Contents