Scalable Systems for the Cloud

Module aims

The course will provide an overview of data centre technologies, the challenges when building and managing a large-scale infrastructure aimed to accommodate a variety of applications, and the key design decisions when engineering scalable distributed applications on top of a data-centre environment. We cover fundamental concepts required to provide suitable abstraction and virtualization of the underlying hardware and the full system stack for managing data-centre resources, which is essential to support large-scale applications. We discuss the fundamental design principles for scalable systems, investigate concepts and techniques to make up such a system, with a focus on distributed storage, coordination, computation mechanisms and resource allocation.

The course will require reading of research papers on the topics listed below.

Learning outcomes

• Understand the requirements and challenges when architecting a large-scale data-centre infrastructure.
• Understand the design principles of modern technologies encompassing the software stack that manages the data-centre resources and be able to critically assess them.
• Acquire understanding of the requirements and challenges when designing, building and managing distributed systems.
• Understand the theoretical foundations and building blocks for scalable system design in relation to distributed storage, coordination and computation.
• Critically assess the trade-offs between different requirements when designing scalable distributed systems.
• To learn about the state-of-the-art in distributed systems research from research papers and be able to understand, discuss, compare and criticise proposed approaches.

Module syllabus

The course covers the advanced concepts that enable the design and implementation of scalable distributed systems. We overview the main design principles of data centres and the resources they manage and discuss the software stack that is needed to run various application workloads efficiently over the big pool of heterogeneous resources. In the second part of the course we demonstrate the application of theoretical techniques from distributed systems to design practical, scalable, reliable and performant distributed applications and services.

The course consists of mandatory reading of key (recently published) research papers related to aspects of scalable distributed systems design.

1. Overview of scalable distributed system design

  • Goals and example systems: Facebook, Google, Microsoft
  • Deployment environments: data centres, internet-wide systems
  • Challenges: scalability, high availability, performance (throughput, tail latency), consistency

2. Overview of data-centres, their technologies and the link to cloud computing:

  • Introducing the data-centre as a computer (the key components and the main requirements)
  • Challenges to design and manage a large-scale infrastructure
  • Fundamental requirements: resource abstraction, isolation

3. Rack-scale Computing

  • Rack-scale computers as a unit of deployment in data-centres
  • Resource Disaggregation: low-latency high-bandwidth networks and the communication protocols (RDMA, InfiniBand), compute - storage separation.

4. The OS for the Data-centre and Cloud Computing (IaaS, PaaS, SaaS, FaaS)

  • Virtualization
    • full and para virtualization;
    • virtualizing memory, network, storage;
    • security in VMs
  • Containers and Serverless:
    • Kernel namespaces and cgroups
    • Use cases: Docker, SELinux
  • Software Defined Networking (SDN)
  • Software Defined Storage (SDS)

5. Design Scalable Services and Applications

  • Typical reference architecture
  • Statelessness, caching, data movement
  • Fault tolerance techniques
  • Decentralization vs. Distribution

6. Distributed Storage

  • Fundamental trade-offs: CAP theorem, etc.
  • Consistency: strong, weak, eventual, etc.
  • Databases: ACID, NoSQL, KV stores
  • Use cases: MySQL, BigTable, Dynamo

7. Distributed Coordination

  • Consensus protocols: 2PC -> 3PC, Paxos, Raft
  • Use cases: Zookeeper, Chubby

8. Distributed Computation

  • Dataflow processing models
  • Graph processing models
  • Use cases: MapReduce, Naiad, Pregel, SEEP

9. Managing distributed resources

  • Resource allocation and isolation
  • Cluster schedulers
  • Use cases: YARN, Kubernetes


Recommended courses

211, 212, 347

Teaching methods

  Lectures and tutorials


*This is a level 7/M course

Module leaders

Dr Jana Giceva
Professor Peter Pietzuch