DrCiaraPike-Burke

Faculty of Natural Sciences, Department of Mathematics

Lecturer in Statistics

Contact

+44 (0)20 7594 2976c.pike-burke

Location

522Huxley BuildingSouth Kensington Campus

Summary

Publications

Lugosi G, Pike-Burke C, Savalle P-A, 2023, Bandit problems with fidelity rewards, Journal of Machine Learning Research, Vol: 24, Pages: 1-44, ISSN: 1532-4435

The fidelity bandits problem is a variant of the K-armed bandit problem in which the reward of each arm is augmented by a fidelity reward that provides the player with an additional payoff depending on how ‘loyal’ the player has been to that arm in the past. We propose two models for fidelity. In the loyalty-points model the amount of extra reward depends on the number of times the arm has previously been played. In the subscription model the additional reward depends on the current number of consecutive draws of the arm. We consider both stochastic and adversarial problems. Since single-arm strategies are not always optimal in stochastic problems, the notion of regret in the adversarial setting needs careful adjustment. We introduce three possible notions of regret and investigate which can be bounded sublinearly. We study in detail the special cases of increasing, decreasing and coupon (where the player gets an additional reward after every m plays ofan arm) fidelity rewards. For the models which do not necessarily enjoy sublinear regret, we provide a worst case lower bound. For those models which exhibit sublinear regret, we provide algorithms and bound their regret.

Journal article

Robert A, Pike-Burke C, Faisal A, 2023, Sample complexity of goal-conditioned hierarchical reinforcement learning, Neural Information Processing Systems (NeurIPS 2023)

Hierarchical Reinforcement Learning (HRL) algorithms can perform planning at multiple levels of abstraction. Empirical results have shown that state or temporal abstractions might significantly improve the sample efficiency of algorithms. Yet, we still do not have a complete understanding of the basis of those efficiency gains nor any theoretically grounded design rules. In this paper, we derive a lowerbound on the sample complexity for the considered class of goal-conditioned HRL algorithms. The proposed lower bound empowers us to quantify the benefits ofhierarchical decomposition and leads to the design of a simple Q-learning-type algorithm that leverages hierarchical decompositions. We empirically validate our theoretical findings by investigating the sample complexity of the proposed hierarchical algorithm on a spectrum of tasks (hierarchical n-rooms, Gymnasium’s Taxi). The hierarchical n-rooms tasks were designed to allow us to dial their complexity over multiple orders of magnitude. Our theory and algorithmic findings provide a step towards answering the foundational question of quantifying the improvement hierarchical decomposition offers over monolithic solutions in reinforcement learning.

Conference paper

Johnson E, Pike-Burke C, Rebeschini P, 2023, Optimal convergence rate for exact policy mirror descent in discounted Markov decision processes, Neural Information Processing Systems (NeurIPS 2023)

Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, PMDalgorithmically regularises the policy improvement step of PI. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor γ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free γ-rate of PI can be achieved by the general family of unregularised PMD algorithms underan adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the γ-rate is optimal for PMD methods as well as PI, and that the adaptive step-size is necessary for PMD to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of theperformance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the firstdimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.

Conference paper

Vakili S, Ahmed D, Bernacchia A, Pike-Burke Cet al., 2023, Delayed feedback in kernel bandits, 40th International Conference on Machine Learning, Publisher: ML Research Press, Pages: 34779-34792, ISSN: 2640-3498

Black box optimisation of an unknown function from expensive and noisy evaluations is a ubiquitous problem in machine learning, academic research and industrial production. An abstraction of the problem can be formulated as a kernel based bandit problem (also known as Bayesian optimisation), where a learner aims at optimising a kernelized function through sequential noisy observations. The existing work predominantly assumes feedback is immediately available; an assumption which fails in many real world situations, including recommendation systems, clinical trials and hyperparameter tuning. We consider a kernel bandit problem under stochastically delayed feedback, and propose an algorithm with O~(Γk(T)T−−−−−−√+E[τ]) regret, where T is the number of time steps, Γk(T) is the maximum information gain of the kernel with T observations, and τ is the delay random variable. This represents a significant improvement over the state of the art regret bound of O~(Γk(T)T−−√+E[τ]Γk(T)) reported in (Verma et al., 2022). In particular, for very non-smooth kernels, the information gain grows almost linearly in time, trivializing the existing results. We also validate our theoretical results with simulations.

Conference paper

van der Hoeven D, Pike-Burke C, Qiu H, Cesa-Bianchi Net al., 2023, Trading-off payments and accuracy in online classification with paid stochastic experts, 40th International Conference on Machine Learning, Publisher: ML Research Press, Pages: 34809-34830, ISSN: 2640-3498

We investigate online classification with paid stochastic experts. Here, before making their prediction, each expert must be paid. The amount that we pay each expert directly influences the accuracy of their prediction through some unknown Lipschitz “productivity” function. In each round, the learner must decide how much to pay each expert and then make a prediction. They incur a cost equal to a weighted sum of the prediction error and upfront payments for all experts. We introduce an online learning algorithm whose total cost after T rounds exceeds that of a predictor which knows the productivity of all experts in advance by at most O(K2(lnT)T−−√) where K is the number of experts. In order to achieve this result, we combine Lipschitz bandits and online classification with surrogate losses. These tools allow us to improve upon the bound of order T2/3 one would obtain in the standard Lipschitz bandit setting. Our algorithm is empirically evaluated on synthetic data.

Conference paper

Howson B, Pike-Burke C, Filippi S, 2023, Optimism and delays in episodic reinforcement learning, Artificial Intelligence and Statistics (AISTATS 2023), Publisher: PMLR, Pages: 1-34

There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updatingthe policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purposeapproaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret in-creases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.

Conference paper

Howson B, Pike-Burke C, Filippi S, 2023, Delayed feedback in generalised linear bandits revisited, Artificial Intelligence and Statistics s (AISTATS 2023), Publisher: PMLR, Pages: 1-25, ISSN: 2640-3498

The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of ̃O(d√T + d3/2E[τ ] ), where E[τ ] denotes the expected delay, d is the dimension and T is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was ̃O(√dT √d + E[τ ] ). We verify our theoretical results through experiments on simulated data.

Conference paper

Howson B, Pike-Burke C, Filippi S, 2023, Delayed Feedback in Generalised Linear Bandits Revisited, Pages: 6095-6119

The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of Õ(d√T +d3/2E[τ]), where E[τ] denotes the expected delay, d is the dimension and T is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was Õ(√dT√d + E[τ]). We verify our theoretical results through experiments on simulated data.

Abstract
Cite

Conference paper

Howson B, Pike-Burke C, Filippi S, 2023, Optimism and Delays in Episodic Reinforcement Learning, Pages: 6061-6094

There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.

Abstract
Cite

Conference paper

Monaci M, Pike-Burke C, Santini A, 2022, Exact algorithms for the 0-1 Time-Bomb Knapsack Problem, Computers and Operations Research, Vol: 145, ISSN: 0305-0548

We consider a stochastic version of the 0–1 Knapsack Problem in which, in addition to profit and weight, each item is associated with a probability of exploding and destroying all the contents of the knapsack. The objective is to maximise the expected profit of the selected items. The resulting problem, denoted as 0–1 Time-Bomb Knapsack Problem (01-TB-KP), has applications in logistics and cloud computing scheduling. We introduce a nonlinear mathematical formulation of the problem, study its computational complexity, and propose techniques to derive upper and lower bounds using convex optimisation and integer linear programming. We present three exact approaches based on enumeration, branch and bound, and dynamic programming, and computationally evaluate their performance on a large set of benchmark instances. The computational analysis shows that the proposed methods outperform the direct application of nonlinear solvers on the mathematical model, and provide high quality solutions in a limited amount of time.

Journal article

Garcelon E, Perchet V, Pike-Burke C, Pirotta Met al., 2021, Local differential privacy for regret minimization in reinforcement learning, Advances in Neural Information Processing Systems, Publisher: NeurIPS, Pages: 1-13, ISSN: 1049-5258

Reinforcement learning algorithms are widely used in domains where it is desirableto provide a personalized service. In these domains it is common that user datacontains sensitive information that needs to be protected from third parties. Motivated by this, we study privacy in the context of finite-horizon Markov DecisionProcesses (MDPs) by requiring information to be obfuscated on the user side.We formulate this notion of privacy for RL by leveraging the local differentialprivacy (LDP) framework. We establish a lower bound for regret minimization infinite-horizon MDPs with LDP guarantees which shows that guaranteeing privacyhas a multiplicative effect on the regret. This result shows that while LDP isan appealing notion of privacy, it makes the learning problem significantly morecomplex. Finally, we present an optimistic algorithm that simultaneously satisfiesε-LDP requirements, and achieves √K/ε regret in any finite-horizon MDP afterK episodes, matching the lower bound dependency on the number of episodes K.

Conference paper

Neu G, Pike-Burke C, 2020, A unifying view of optimism in episodic reinforcement learning, Neural Information Processing Systems (NeurIPS 2020), ISSN: 1049-5258

The principle of “optimism in the face of uncertainty” underpins many theoretically successful reinforcement learning algorithms. In this paper we provide a general framework for designing, analyzing and implementing such algorithms in the episodic reinforcement learning problem. This framework is built upon Lagrangian duality, and demonstrates that every model-optimistic algorithm that constructs anoptimistic MDP has an equivalent representation as a value-optimistic dynamic programming algorithm. Typically, it was thought that these two classes of algorithms were distinct, with model-optimistic algorithms benefiting from a cleaner probabilistic analysis while value-optimistic algorithms are easier to implement and thus more practical. With the framework developed in this paper, we show that it is possible to get the best of both worlds by providing a class of algorithms which have a computationally efficient dynamic-programming implementation and also a simple probabilistic analysis. Besides being able to capture many existing algorithms in the tabular setting, our framework can also address large-scale problems under realizable function approximation, where it enables a simple model-based analysis of some recently proposed methods.

Conference paper

Pike-Burke C, Grunewalder S, 2019, Recovering bandits, 33rd Conference on Neural Information Processing Systems (NeurIPS), Publisher: Neural Information Processing Systems (NIPS), Pages: 1-10, ISSN: 1049-5258

We study the recovering bandits problem, a variant of the stochastic multi-armed bandit problem where the expected reward of each arm varies according to some unknown function of the time since the arm was last played. While being a natural extension of the classical bandit problem that arises in many real-world settings, this variation is accompanied by significant difficulties. In particular, methods need to plan ahead and estimate many more quantities than in the classical bandit setting. In this work, we explore the use of Gaussian processes to tackle the estimation and planing problem. We also discuss different regret definitions that let us quantify the performance of the methods. To improve computational efficiency of the methods, we provide an optimistic planning approximation. We complement these discussions with regret bounds and empirical studies

Conference paper

Pike-Burke C, Agrawal S, Szepesvári C, Grünewälder Set al., 2018, Bandits with delayed, aggregated anonymous feedback, 35th International Conference on Machine Learning, Publisher: PMLR, Pages: 4105-4113, ISSN: 2640-3498

We study a variant of the stochastic K-armed bandit problem, which we call "bandits with delayed, aggregated anonymous feedback”. In this problem, when the player pulls an arm, a reward is generated, however it is not immediately observed. Instead, at the end of each round the player observes only the sum of a number of previously generated rewards which happen to arrive in the given round. The rewards are stochastically delayed and due to the aggregated nature of the observations, the information of which arm led to a particular reward is lost. The question is what is the cost of the information loss due to this delayed, aggregated anonymous feedback? Previous works have studied bandits with stochastic, non-anonymous delays and found that the regret increases only by an additive factor relating to the expected delay. In this paper, we show that this additive regret increase can be maintained in the harder delayed, aggregated anonymous feedback setting when the expected delay (or a bound on it) is known. We provide an algorithm that matches the worst case regret of the non-anonymous problem exactly when the delays are bounded, and up to logarithmic factors or an additive variance term for unbounded delays.

Conference paper

Pike-Burke C, Grunewalder S, 2017, Optimistic planning for the stochastic knapsack problem, 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Publisher: MICROTOME PUBLISHING, Pages: 1114-1122, ISSN: 2640-3498

The stochastic knapsack problem is a stochastic resource allocation problem that arises frequently and yet is exceptionally hard to solve. We derive and study an optimistic planning algorithm specifically designed for the stochastic knapsack problem. Unlike other optimistic planning algorithms for MDPs, our algorithm, OpStoK, avoids the use of discounting and is adaptive to the amount of resources available. We achieve this behavior by means of a concentration inequality that simultaneously applies to capacity and reward estimates. Crucially, we are able to guarantee that the aforementioned confidence regions hold collectively over all time steps by an application of Doob’s inequality. We demonstrate that the method returns an ε-optimal solution to the stochastic knapsack problem with high probability. To the best of our knowledge, our algorithm is the first which provides such guarantees for the stochastic knapsack problem. Furthermore, our algorithm is an anytime algorithm and will return a good solution even if stopped prematurely. This is particularly important given the difficulty of the problem. We also provide theoretical conditions to guarantee OpStoK does not expand all policies and demonstrate favorable performance in a simple experimental setting.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=01054471&limit=30&person=true