TY - CPAPER AB - Distributional reinforcement learning (Distributional RL)maintains the entire probability distribution of the reward-to-go, i.e. the return, providing a more principled approach to account for the uncertainty associated with policy performance, which may be beneficial for trading off exploration and exploitation and policy learning in general. Previous work in distributional RL focused mainly on computing the state-action-return distributions, here we model the state-return distributions. This enables us to translate successful conventional RL algorithms that are based on state values into distributional RL. We formulate the distributional Bell-man operation as an inference-based auto-encoding process that minimises Wasserstein metrics between target/model re-turn distributions. Our algorithm, BDPG (Bayesian Distributional Policy Gradients), uses adversarial training in joint-contrastive learning to learn a variational posterior from there turns. Moreover, we can now interpret the return prediction uncertainty as an information gain, which allows to obtain anew curiosity measure that helps BDPG steer exploration actively and efficiently. In our experiments, Atari 2600 games and MuJoCo tasks, we demonstrate how BDPG learns generally faster and with higher asymptotic performance than reference distributional RL algorithms, including well known hard exploration tasks. AU - Li,L AU - Faisal,A PB - AAAI PY - 2020/// TI - Bayesian distributional policy gradients UR - http://hdl.handle.net/10044/1/86088 ER -