# MSc Statistics (Biostatistics)

## Useful information

This one-year full-time programme provides outstanding training both in theoretical and applied statistics with a focus on Biostatistics. The modules will focus on the statistical methods that are widely used for the analysis and interpretation of medical data. In addition, the students will be introduced to the concepts of statistical genetics and on the statistical methods widely used for analysing the large and complex datasets that can be found in these fields. This course will equip students with a range of transferable skills, including programming, problem-solving, critical thinking, scientific writing, project work and presentation, to enable them to take on prominent roles in a wide array of employment and research sectors.

The programme is split between taught **core** and **optional** modules in the Autumn and Spring terms (66.67% weighting) and a **research project** in the Summer term (33.33% weighting).

## Core modules

**Core modules are offered in the Autumn and Spring terms:**

### Autumn term core modules

## Autumn term - core courses

### Applied Statistics (7.5 ECTS)

The module focuses on statistical modelling and regression when applied to realistic problems and real data. We will cover the following topics:

The Normal Linear model (estimation, residuals, residual sum of squares, goodness of fit, hypothesis testing, ANOVA, model comparison). Improving Designs and Explanatory Variables (categorical variables and multi-level regression, experimental design, random and mixed effects models). Diagnostics and Model Selection and Revision (outliers, leverage, misfit, exploratory and criterion based model selection, Box-Cox transformations, weighted regression), Generalised Linear Models (exponential family of distributions, iteratively re-weighted least squares, model selection and diagnostics). In addition, we will introduce more advanced topics related to regression such as penalised regression and link with related problems in Time series, Classification, and State Space modelling.

### Computational Statistics (7.5 ECTS)

This module covers a number of computational methods that are key in modern statistics. Topics include: Statistical Computing: R programming: data structures, programming constructs, object system, graphics. Numerical methods: root finding, numerical integration, optimisation methods such as EM-type algorithms. Simulation: generating random variates, Monte Carlo integration. Simulation approaches in inference: randomisation and permutation procedures, bootstrap, Markov Chain Monte Carlo.

### Fundamentals of Statistical Inference (7.5 ECTS)

In statistical inference experimental or observational data are modelled as the observed values of random variables, to provide a framework from which inductive conclusions may be drawn about the mechanism giving rise to the data. This is done by supposing that the random variable has an assumed parametric probability distribution: the inference is performed by assessing some aspect of the parameter of the distribution.

This module develops the main approaches to statistical inference for point estimation, hypothesis testing and confidence set construction. Focus is on description of the key elements of Bayesian, frequentist and Fisherian inference through development of the central underlying principles of statistical theory. Formal treatment is given of a decision-theoretic formulation of statistical inference. Key elements of Bayesian and frequentist theory are described, focussing on inferential methods deriving from important special classes of parametric problem and application of principles of data reduction. General purpose methods of inference deriving from the principle of maximum likelihood are detailed. Throughout, particular attention is given to evaluation of the comparative properties of competing methods of inference.

### Probability for Statistics (7.5 ECTS)

The module Probability for Statistics introduces the key concepts of probability theory in a rigorous way. Topics covered include: the elements of a probability space, random variables and vectors, distribution functions, independence of random variable/vectors, a concise review of the Lebesgue-Stieltjes integration theory, expectation, modes of convergence of random variables, law of large numbers, central limit theorems, characteristic functions, conditional probability and expectation.

The second part of the module will introduce discrete-time Markov chains and their key properties, including the Chapman-Kolmogorov equations, classification of states, recurrence and transience, stationarity, time reversibility, ergodicity. Moreover, a concise overview of Poisson processes, continuous-time Markov chains and Brownian motion will be given.

## Spring term core modules

### Biomedical Statistics (5 ECTS)

The students will be introduced to modern statistical approaches and tests performed when analysing data collected from observational studies, such as case-control studies, longitudinal studies and clinical trial studies. The course will introduce central techniques for modelling and inference in biostatistics, from generalized linear regression models to complex Bayesian multi-level models for clinical, environmental and ecological data. Case examples will illustrate recent theoretical advances in action, covering variable selection, principles of handling missing data, meta-analysis, aspects of causal inference, and the effective design of biostatistical studies. Particular emphasis will be on state-of-the-art computing, introducing students to the R tidyverse environment for data science, techniques for handling big data, and the Stan software for inference.

### Statistical Genetics and Bioinformatics (5 ECTS)

Advances in biotechnology are making routine use of DNA sequencing and microarray technology in biomedical research and clinical use a reality. Innovations in the field of Genomics are not only driving new investigations in the understanding of biology and disease but also fuelling rapid developments in computer science, statistics and engineering in order to support the massive information processing requirements. In this module, students will be introduced into the world of Statistical Genetics and Bioinformatics that have become in the last 10-15 years two of the dominant areas of research and application for modern Statistics. In this module we will develop models and tools to understand complex and high-dimensional genetics datasets. This will include statistical and machine learning techniques for: multiple testing, penalised regression, clustering, p-value combination, dimension reduction. The module will cover both Frequentist and Bayesian statistical approaches. In addition to the statistical approaches, the students will be introduced to genome-wide association and expression studies data, next generation sequencing and other OMICS datasets.

## Optional modules

**A total of 20-22.5 ECTS are to be obtained from the following lists with at least one module taken from the Optional A list. Students will be restricted to a maximum of two modules each worth 7.5 ECTS. Optional modules run in the Spring term unless otherwise stated.**

### Optional modules A

## Optional A modules

### Advanced Simulation Methods (5 ECTS)

Modern problems in Statistics require sampling from complicated probability distributions defined on a variety of spaces and setups. In this course we will visit popular advanced sampling techniques, such as Importance Sampling, Markov Chain Monte Carlo, Sequential Monte Carlo. We will consider the underlying principles of each method as well as practical aspects related to implementation, computational cost and efficiency. By the end of the course the students will be familiar with these sampling methods and will have applied them to popular models, such as Hidden Markov Models, which appear ubiquitous in many scientific disciplines.

### Bayesian Methods (5 ECTS)

This module introduces the fundamental definitions of probability which underly Bayesian inference and then explores the implications of these basic rules for generic statistical tasks. These include parameter inference, model comparison using the marginal likelihood, hypothesis testing, and experimental design. The model will also cover the formulation of inference problems, with a particular focus on hierarchical models and links to more heuristic approaches (e.g., least-squares fitting). Particular emphasis will also be placed on the assignment of probabilities and distributions, including prior distributions for parameter inference, with a focus on information theoretical considerations that lead to the maximum entropy distributions.

### Contemporary Statistical Theory (5 ECTS)

This course aims to give an introduction to key developments in contemporary statistical theory. It describes ideas of: multiple testing, inference under sparsity conditions; parametric higher-order likelihood theory for statistical inference; objective Bayes inference; bootstrap methodology and theory; key concepts and methods of selective inference.

### Multivariate Analysis (5 ECTS)

Multivariate Analysis is concerned with the theory and analysis of data that has more than one outcome variable at a time, a situation that is ubiquitous across all areas of science. Multiple uses of univariate statistical analysis is insufficient in this settings where interdependency between the multiple random variables are of influence and interest. In this module we look at some of the key ideas associated with multivariate analysis. Topics covered include: multivariate notation, the covariance matrix, multivariate characteristic functions, a detailed treatment of the multivariate normal distribution including the maximum likelihood estimators for mean and covariance, the Wishart distribution, Hotelling's T^2 statistic, likelihood ratio tests, principle component analysis, ordinary, partial and multiple correlation, multivariate discriminant analysis.

### Data Science (5 ECTS)

TBC

### Deep Learning with TensorFlow (5 ECTS)

This module teaches the building blocks of deep learning models, and how to design network architectures for specific applications, in both supervised and unsupervised contexts. It covers practical skills in implementing neural networks in the popular deep learning library TensorFlow. Students will learn how to build, train and evaluate networks using this framework. In the latter part of the module, the focus is on probabilistic deep learning models, such as normalising flows and variational autoencoders (VAEs).

### Graphical Models (5 ECTS)

Graphical models are those probability models whose independence structure is characterised by a graph, the conditional independence graph. In this module we will look at some aspects of graphical modelling for both (a) a vector of random variables, and (b) vector-valued time series. We will look at models and their estimation. Topics covered include: dependence structure and graphical representation; Markov properties for undirected graphs; the conditional independence graph; decomposable models; graphical Gaussian models; model selection; acyclic directed graphical models; global directed Markov property; Bayesian networks; graphical modelling of time series; model selection for time series graphs.

### Machine Learning (5 ECTS)

This module will provide an introduction to Bayesian statistical pattern recognition and machine learning. The lectures will focus on a variety of useful techniques including methods for feature extraction, dimensionality reduction, data clustering and pattern classification. State-of-art approaches such as Gaussian processes and exact and approximate inference methods will be introduced. Real-world applications will illustrate how the techniques are applied to real data sets. Continuous assessment through coursework.

### Introduction to Statistical Finance (5 ECTS)

The module “Introduction to Statistical Finance” introduces fundamental concepts in financial economics and quantitative finance and presents suitable statistical tools which are widely used when analysing financial data. The module will start off with an introduction to risk-neutral pricing theory followed by a short survey on risk measures such as value at risk and expected shortfall which are widely used in financial risk management. Next, an introduction to time series analysis will be given, where the main focus will be on so-called ARMA-GARCH processes. Such processes can describe some of the stylised facts widely overserved in financial data, including non-Gaussian returns and heteroscedasticity. Finally, methods for forecasting financial time series will be introduced.

### Advanced Statistical Finance (5 ECTS)

Advanced Statistical Finance focuses on modern statistical methods for analysis of financial data. During the last two decades, the increasing availability of large financial data sets has prompted development of new statistical and econometric methods that can cope with high-dimensional data, high-frequency observations and extreme values in data.

The module will first introduce the basics of extreme value theory, which will be used to develop models and estimation methods for extremes in financial data. The second part of the module will provide a concise introduction to the theory of stochastic integration and Itô calculus, which provide a theoretical foundation for volatility estimation from high-frequency data using the concept of realised variance. The asymptotic properties of realised variance will be elucidated and applied to draw inference on realised volatility.

The third part introduces some recently developed volatility forecasting models that incorporate volatility information from high-frequency data and demonstrates how the performance of such models can be assessed and compared using modern forecast evaluation methods such as the Diebold-Mariano test and the model confidence set.

The final part of the module provides an overview of covariance matrix estimation in a high-dimensional setting, motivated by applications to variance-optimal portfolios. The pitfalls of using the standard sample covariance matrix with high-dimensional data are first exemplified. Then it is shown how shrinkage methods can be applied to estimate covariance matrices accurately using high-dimensional data.

### Big Data (5 ECTS)

The emergence of Big Data as a recognised and sought-after technological capability is due to the following factors: the general recognition that data is omnipresent, an asset from which organisations can derive business value; the efficient interconnectivity of sensors, devices, networks, services and consumers, allowing data to be transported with relative ease; the emergence of middleware processing platforms, such as Hadoop, InfoSphere Streams, Accumulo, Storm, Spark, Elastic Search, …, which in general terms, empowers the developer with an ability to efficiently create distributed fault-tolerant applications that execute statistical analytics at scale.

To promote the use of advanced statistical methods within a Big Data environment - an essential requirement if correct conclusions are to be reached - it is necessary for statisticians to utilise Big Data tools when supporting or performing statistical analysis in the modern world. The objective of this module is to train statistically minded practitioners in the use of common Big Data tools, with an emphasis on the use of advanced statistical methods for analysis. The module will focus on the application of statistical methods in the processing platforms Hadoop and Spark. Assessment will be through coursework.

## Additional optional modules

### Survival Models and Actuarial Applications (7.5 ECTS)

Survival models are fundamental to actuarial work, as well as being a key concept in medical statistics. This module will introduce the ideas, placing particular emphasis on actuarial applications. Concepts of survival models, right and left censored and randomly censored data. Estimation procedures for lifetime distributions: empirical survival functions, Kaplan-Meier estimates, Cox model. Statistical models of transfers between multiple states, maximum likelihood estimators. Counting process models.

Actuarial Applications: Life table data and expectation of life. Binomial model of mortality. The Poisson model. Estimation of transition intensities that depend on age. Graduation and testing crude and smoothed estimates for consistency.

For M4S14/M5S14: All of the above and additionally, masters level material to be self-studied (based on master level textbook/research monograph/paper).

### Time Series (7.5 ECTS)

**Please note: this module currently runs in the Autumn term**

Time series analysis is an important area of statistics with applications in finance, engineering and many physical sciences plus areas such as neuroscience in medicine. This module covers introductory ideas in both the time domain and frequency domain areas of the subject. Topics:

Real examples, stationarity, autocovariance sequences, covariance matrices for segments, examples of discrete stationary processes, trend removal and seasonal adjustment, the general linear process, spectral representation, sampling and aliasing, linear filtering, estimation of mean and autocovariance, spectral estimation via the periodogram, tapering for bias reduction, autoregressive processes and estimation of their parameters, parametric and non-parametric bivariate time series, coherence, forecasting.