Machine Learning Systems and Hardware

Module aims

This module presents the systems and hardware foundations of modern machine learning, bridging neural network workloads, compiler optimizations, and advanced AI hardware. The course covers cutting-edge AI hardware, including GPUs, TPUs, and wafer-scale accelerators, highlighting the design trade-offs between cloud and edge deployment. Topics such as efficient ML (quantization, pruning), algorithm–hardware co-design, federated learning, and system support for large language models (LLMs) are also presented. Through lectures, tutorials, and hands-on programming lab sessions, students gain practical skills and a systems-level understanding of ML deployment, preparing them for advanced research or industry roles in ML system design.

Learning outcomes

Upon successful completion of this module you will be able to:
1. Understand the computation of deep neural networks and examine workload characteristics using analytical models.
2. Analyse the architectural features and design trade-offs of modern ML hardware, including GPUs, TPUs, and emerging AI accelerators.
3. Implement and evaluate efficient ML techniques such as quantization, pruning, and knowledge distillation.
4. Understand the principles of algorithm–hardware co-design and apply AutoML techniques for architecture search and optimization.

Module syllabus

1. Introduction & Module Overview
•    Introduction to motivate learning
•    Learning outcomes
•    Overview of module structure
2. DNN Workload Analysis
•    Recap of MLP, CNN and AttNN
•    Computation Breakdown (Operators)
•    Roofline Model
 •    Analysis of Memory-Bound and Compute-Bound
3. ML Compiler
•    Construction of computational graph
•    Optimizations of computational graph (Kernel/Operator Fusion)
•    Scheduling optimization
4. GPU Hardware
•    GPU architecture
•    TensorCore
5. ML Hardware
•    Google TPU / Samsung NPU
•    Understand the design focus of cloud and edge chips
•    Cerebras / Tenstorrent / SambaNova
•    Understand wafer-scale, NoC, runtime reconfigurability for advanced AI Hardware
6. Efficient ML
•    Quantization / Pruning / Low-rank factorization
•    Knowledge Distillation
7. AutoML and Algorithm-Hardware Co-design
•    Neural Architecture Search

Teaching methods

The module will be taught through lectures, backed up by practical exercises, that you will solve in-class. Graduate Teaching Assistants (GTAs) will be on hand to provide advice and feedback.  There will be one assessed coursework which is designed to reinforce your understanding of the theoretical aspects of the material as well as give you hands-on experience.

An online service will be used as a discussion forum for the module.

Assessments

The assessment will include one individual submission (code and report) and an exam.

1.    Assignment (20%): Implement and optimize the implementation of machine learning operators.
2.    Exam (80%): Questions are based on taught lectures.
        
The provided text states that detailed written feedback will be provided for each assignment, along with class-wide feedback that highlights common pitfalls and provides suggestions for improvement. In addition, tutorials serve as another opportunity for clarifying feedback, addressing specific concerns, and guiding students toward enhanced understanding and performance.   

Reading list

Supplementary

Module leaders

Dr Hongxiang Fan