Reinforcement Learning

Module aims

The course provides both basic and advanced knowledge in reinforcement learning across three core skills: theory, implementation, and evaluation. Students will learn the fundamentals of both tabular reinforcement learning and deep reinforcement learning, and will gain experience in designing and implementing these methods for practical applications.

Specifically, students will:

  • Learn the theoretical foundations of reinforcement learning (Markov decision processes & dynamic programming).
  • Learn the algorithmic foundations of reinforcement learning (temporal difference and Monte-Carlo learning).
  • Gain experience in framing low-dimensional problems and implementing solutions using tabular reinforcement learning.
  • Learn about the motivation behind deep reinforcement learning and its relevance to high-dimensional applications, such as playing video games, and robotics.
  • Discover the state-of-the-art deep reinforcement learning algorithms such as Deep Q Networks (DQN), Proximal Policy Optimisation (PPO), and Soft Actor Critic (SAC).
  • Implement and experiment with a range of different deep reinforcement learning algorithms in Python and PyTorch, and learn how to visualise and evaluate their performance.

Learning outcomes

Upon completion of this module, students should be able to:
1.Describe the core principles of autonomous systems learning.
2.Calculate mathematical solutions to problems using reinforcement learning theory.
3.Compare and contrast a range of reinforcement learning approaches.
4.Propose solutions to decision making problems using knowledge of the state-of-the-art.
5.Translate mathematical concepts into software to solve practical problems using Python and PyTorch.
6.Evaluate the performance of a range of methods and propose appropriate improvements.
7. Summarize complex data through clear visualisations to assist with evaluation.

Module syllabus

The first half of the course will include:

  • Introduction to Reinforcement Learning and its Mathematical Foundations
  • The Markov Decision Process Framework
    • Markov Reward Processes
    • The Policy
    • Markov Decision Processes
  • Dynamic Programming
  • Model-Free Learning & Control
    • Monte-Carlo Learning 
    • Temporal Difference Learning

 
The second half of the course will include:

  • Motivation for function approximation:
    • High-dimensional state and action spaces
    • Continuous state and action spaces
  • Deep Q-learning:
    • Q update through back propagation
    • Experience replay buffer
    • Target and Q networks
  • Policy gradients:
    • The REINFORCE algorithm
    • Policy update through back propagation
    • Proximal Policy Optimisation
  • Advanced topics:
    • Soft Actor Critic
    • Learning from demonstration
    • Model-based reinforcement learning

Teaching methods

The module will be delivered in two halves. The first half will focus on the underpinning theory to reinforcement learning and the second half will focus on applications with deep reinforcement learning. Each half will have one coursework. Each half will consist of both lectures and computer lab sessions.

The courseworks and exams are structured to cover three different core skills: theory, implementation, and evaluation. Coursework 1 assesses fundamental theory and mathematical solutions. Coursework 2 assesses practical application through implementation and evaluation. The exam covers both theory and evaluation.
Reinforcement learning has a strong practical element and is best appreciated through implementation and evaluation. True understanding of the meaning behind the various theoretical concepts is only realised through hands-on experience and observing the effects of various design choices. As such, the coursework will have a high level of involvement and will contribute 50% towards the overall grade.

An online service will be used as a discussion forum for the module. 

Assessments

The first coursework will focus on mathematical and theoretical understanding of the foundations of reinforcement learning. The coursework will encompass translating real-world problems into mathematical formulations in the reinforcement learning framework, as well as the solution “by hand” of simple Markov Decision Processes enabling students to evaluate their understanding of the theory. The coursework will be solvable on pen-and-paper and are complemented by lab practical where students can develop code to solve the problems.
The second coursework will involve implementing a number of different deep reinforcement learning algorithms, in Python and PyTorch. During lab sessions, students will be provided with basic tutorials for implementing these methods for a particular learning task. The coursework will then require students to implement similar methods, but for a different task. Both tasks will involve a robot navigating a “maze”, but they will differ in the maze layouts and the state and action spaces. The coursework will contain some basic implementations that all students should be able to achieve, some more challenging implementations assessing the core of the lecture material, and some advanced implementation which will challenge the top students.

For the first and second coursework, students will work independently and submit individually. Courseworks involve “tasks” which the students must solve using reinforcement learning, and these tasks are unique to each student to prevent plagiarism. To achieve this, tasks will be automatically generated using the student’s CID number.
For the first coursework, each student will submit a document containing their worked solutions. Students may submit accompanying visualisations. Assessment will be marked by teaching assistants that assess the work quantitatively based on a marking scheme.
For the second coursework, each student will submit a piece of code, which will be assessed by using it to train an agent using deep reinforcement learning, via an automated system. The performance of the agent on an unknown task (one which the student has not seen before) will count towards some of the grades. The remaining grades will come from a report, which will describe their implementation, and include visualisations which can then be checked against ground-truth data.

Reading list

Reading list

Module leaders

Dr Nicole Salomons
Professor Aldo Faisal