Scalable Systems and Data

Module aims

In this module you will have the opportunity to:

  • get an overview of data centre technologies, the infrastructure needed to run a variety of workloads, and the design decisions when engineering scalable distributed applications.
  • analyze the full system stack for managing and scheduling data-centre resources
  • discuss the design principles for scalable systems
  • investigate concepts and techniques to build large scale systems, with a focus on distributed storage, coordination, computation and resource allocation.
  • get an overview of NewSQL and NoSQL technologies
  • understand new data models, their associated query languages and systems
  • discuss new storage technology and its impact on query execution and data management systems in general.

Learning outcomes

  • Define the requirements and challenges when architecting, building and managing a large-scale data-centre infrastructure with distributed systems
  • Know the design principles of modern technologies, encompassing the software stack that manages the data-centre resources and be able to critically assess them
  • Use the theoretical foundations from distributed algorithms to define the building blocks for scalable system design in relation to distributed storage, coordination and computation.
  • Critically assess the trade-offs between different requirements when designing scalable distributed systems
  • Discuss, compare and criticise the proposed approaches presented in state-of-the-art research papers targeting distributed systems
  • Get a basic understanding of NewSQL and NoSQL technolgies driven by new data models
  • Understand the trade-offs in converting between data models and database tools
  • Understand the implications of new hardware (storage class memory, SSD, main memory and multicores) on database management systems               



 

Module syllabus

The course consists of mandatory reading of key (recently published) research papers related to aspects of scalable distributed systems design.

  • Overview of scalable distributed system design (goals and example systems, deployment environments, and challenges).
  • Overview of data-centres, their technologies and the link to cloud computing and introduction to latest technologies encompassing rack-scale computing
  • Hardware virtualization (full and para virtualization, virtualization of different resources, security)/OS virtualization (containers and serverless computing)
  • Design scalable services and applications (reference architectures, requirements and design priniciples for highly performant and fault tolerant systems)
  • Distributed Storage (CAP theorem, Consistency-types (strong, weak, eventual, etc.))
  • Distributed Coordination (consensus protocols, and use-cases)
  • Distributed Computation (data-flow and graph processing models)
  • Managing distributed resources (resource management and allocation, cluster schedulers)
  • NoSQL and NewSQL; Overview of the new data management landscape
  • Key/Value Store and their successor extensible record stores
  • Document stores; schema free: shifting the complexity from ETL to querying
  • New storage media; relative (random) access cost on disk compared to main memory/SSD
  • Main memory databases; transactions in main memory; random access in main memory
  • Flash/SSD databases; reading and writing (block erase); relative cost; wear levelling in write access; potentially outlook to phase change memory
  • Transactions on multicores; CPU cache hierarchy; multi-socket architecture; (workload driven) data placement. 

Teaching methods

The material will be taught through a mix of traditional lectures, tutorials and in-class discussions of state-of-the-art technologies, systems and publications.           

An online service will be used as a discussion forum for the module.

Assessments

There will be two courseworks that contribute 20% of the marks for the module.
There will be final written exam, which counts for the remaining 80% of the marks.               

Written feedback will be given on the assessed courseworks, approximately 14 days after submission.               

Module leaders

Dr Thomas Heinis
Professor Peter Pietzuch