Restless Bandits with Many Arms: Beating the Central Limit Theorem


Restless bandits play an important role in recommender systems, active learning, revenue management, and beyond. In this problem, we allocate limited resources across a collection of “arms”, choosing to engage or “pull” some arms and leave others idle. An arm generates rewards and changes state stochastically according to a probability distribution that depends on whether it was pulled.  While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms N. Thus, there is substantial value in understanding the performance of policies that scale computationally to large N.

We study the optimality gap, i.e., the loss in expected performance compared to an optimal policy, for finite-horizon restless bandits in a classical asymptotic regime proposed by Whittle. In this regime, N grows while holding constant the fraction of arms that can be pulled per period. Intuition from the Central Limit Theorem and the tightest previous theoretical bounds suggest that this optimality gap should grow like O(sqrt(N)). Surprisingly, we outperform this bound. We characterize a non-degeneracy condition and a wide class of novel scalable policies, called fluid-priority policies, in which the optimality gap is O(1). These include most widely-used index policies. When this non-degeneracy condition does not hold, we show that fluid-priority policies nevertheless have an optimality gap that is O(sqrt(N)), significantly generalizing the class of policies for which convergence rates are known. We demonstrate that fluid-priority policies offer state-of-the-art performance on a collection of restless bandit problems in numerical experiments.

This paper (https://arxiv.org/pdf/2107.11911.pdf), with Xiangyu Zhang, is the winner of the 2021 INFORMS Applied Probability Society Student Paper Prize.


Peter Frazier is the Eleanor and Howard Morgan Professor of Operations Research and Information Engineering at Cornell University. He is also a Senior Staff Scientist at Uber. At Cornell, he does research in Bayesian optimization, multi-armed bandits, and other areas at the interface between machine learning and operations research. He also leads Cornell’s COVID-19 Mathematical Modeling Team, which helped design Cornell’s asymptomatic COVID-19 testing program and provides university leadership with decision support for the pandemic. At Uber, he managed UberPool’s data science group and currently helps to design Uber’s pricing and incentive systems. 


Zoom Meeting Details