# MSc Statistics (Data Science)

## Useful information

This one-year full-time programme provides outstanding training both in theoretical and applied statistics with a focus on Data Science.

The modules will focus on a wide variety of tools and techniques related to the scientific handling of data at scale, including machine learning theory, data transformation and representation, data visualisation and using analytic software.

This course will equip students with a range of transferable skills, including programming, problem-solving, critical thinking, scientific writing, project work and presentation, to enable them to take on prominent roles in a wide array of employment and research sectors.

The programme is split between taught **core** and **optional** modules in the Autumn and Spring terms (66.67% weighting) and a **research project** in the Summer term (33.33% weighting).

**PLEASE NOTE: The programme is substantially the same from year to year but there may be some changes to the modules listed below. **

## Core modules

**Core modules are offered in the Autumn and Spring terms:**

### Autumn term core modules

## Core courses

### Applied Statistics (7.5 ECTS)

The module focuses on statistical modelling and regression when applied to realistic problems and real data. We will cover the following topics:

The Normal Linear model (estimation, residuals, residual sum of squares, goodness of fit, hypothesis testing, ANOVA, model comparison). Improving Designs and Explanatory Variables (categorical variables and multi-level regression, experimental design, random and mixed effects models). Diagnostics and Model Selection and Revision (outliers, leverage, misfit, exploratory and criterion based model selection, Box-Cox transformations, weighted regression), Generalised Linear Models (exponential family of distributions, iteratively re-weighted least squares, model selection and diagnostics). In addition, we will introduce more advanced topics related to regression such as penalised regression and link with related problems in Time series, Classification, and State Space modelling.

### Computational Statistics (7.5 ECTS)

This module covers a number of computational methods that are key in modern statistics. Topics include: Statistical Computing: R programming: data structures, programming constructs, object system, graphics. Numerical methods: root finding, numerical integration, optimisation methods such as EM-type algorithms. Simulation: generating random variates, Monte Carlo integration. Simulation approaches in inference: randomisation and permutation procedures, bootstrap, Markov Chain Monte Carlo.

### Fundamentals of Statistical Inference (7.5 ECTS)

In statistical inference experimental or observational data are modelled as the observed values of random variables, to provide a framework from which inductive conclusions may be drawn about the mechanism giving rise to the data. This is done by supposing that the random variable has an assumed parametric probability distribution: the inference is performed by assessing some aspect of the parameter of the distribution.

This module develops the main approaches to statistical inference for point estimation, hypothesis testing and confidence set construction. Focus is on description of the key elements of Bayesian, frequentist and Fisherian inference through development of the central underlying principles of statistical theory. Formal treatment is given of a decision-theoretic formulation of statistical inference. Key elements of Bayesian and frequentist theory are described, focussing on inferential methods deriving from important special classes of parametric problem and application of principles of data reduction. General purpose methods of inference deriving from the principle of maximum likelihood are detailed. Throughout, particular attention is given to evaluation of the comparative properties of competing methods of inference.

### Probability for Statistics (7.5 ECTS)

The module Probability for Statistics introduces the key concepts of probability theory in a rigorous way. Topics covered include: the elements of a probability space, random variables and vectors, distribution functions, independence of random variable/vectors, a concise review of the Lebesgue-Stieltjes integration theory, expectation, modes of convergence of random variables, law of large numbers, central limit theorems, characteristic functions, conditional probability and expectation.

The second part of the module will introduce discrete-time Markov chains and their key properties, including the Chapman-Kolmogorov equations, classification of states, recurrence and transience, stationarity, time reversibility, ergodicity. Moreover, a concise overview of Poisson processes, continuous-time Markov chains and Brownian motion will be given.

### Spring term core modules

## Core modules term 2

### Data Science I: Data (5 ECTS)

Data scientific methods are wide in scope, drawing equally from computational statistics and computer science. This course focuses on the “data” part of data science. It will cover:

Computing with data using R, python, and C++ in open and reproducible workflows: RMarkdown or Jupyter notebooks, version control, unit testing. Complex computational pipelines.

Data exploration and the field of Exploratory Data Analysis, which often reveals anomalies in real-world datasets. Data preparation, in which datasets are reformatted, cleaned, and pre-processed.

Data representation, covering the use of both databases and data formats like SQL and hdf5 and data structures, and also data transformations using mathematical representations (e.g. the Fourier transform) and deep learning approaches (e.g. word2vec).

### Data Science II: Science (5 ECTS)

Data scientific methods are wide in scope, drawing equally from computational statistics and computer science. This course focuses on the “science” part of data science. It will cover:

The visualization and presentation of data, including: raw datasets, intermediate results during the data cleaning and model fitting stage of an analysis, and final outputs for public/business communication.

Data modelling, covering both the generative modelling framework of applied statistics and the predictive modelling framework of machine learning, with a focus on the deployment of scalable and reproducible methods.

Science about data science, covering what data analysts really do, thinking critically about appropriate uses and misuses of data science.

### Machine Learning (5 ECTS)

This module will provide an introduction to Bayesian statistical pattern recognition and machine learning. The lectures will focus on a variety of useful techniques including methods for feature extraction, dimensionality reduction, data clustering and pattern classification. State-of-art approaches such as Gaussian processes and exact and approximate inference methods will be introduced. Real-world applications will illustrate how the techniques are applied to real data sets. Continuous assessment through coursework.

### Big Data (5 ECTS)

The emergence of Big Data as a recognised and sought-after technological capability is due to the following factors: the general recognition that data is omnipresent, an asset from which organisations can derive business value; the efficient interconnectivity of sensors, devices, networks, services and consumers, allowing data to be transported with relative ease; the emergence of middleware processing platforms, such as Hadoop, InfoSphere Streams, Accumulo, Storm, Spark, Elastic Search, …, which in general terms, empowers the developer with an ability to efficiently create distributed fault-tolerant applications that execute statistical analytics at scale.

To promote the use of advanced statistical methods within a Big Data environment - an essential requirement if correct conclusions are to be reached - it is necessary for statisticians to utilise Big Data tools when supporting or performing statistical analysis in the modern world. The objective of this module is to train statistically minded practitioners in the use of common Big Data tools, with an emphasis on the use of advanced statistical methods for analysis. The module will focus on the application of statistical methods in the processing platforms Hadoop and Spark. Assessment will be through coursework.

## Optional modules

**A total of 10-12.5 ECTS are to be obtained from the following list of modules. Students will be restricted to a maximum of one module worth 7.5 ECTS. Optional modules run in the Spring term unless otherwise stated.**

## Optional A options

### Advanced Statistical Theory (5 ECTS)

This module aims to give an introduction to key developments in contemporary statistical theory, building on ideas developed in the core module Fundamentals of Statistical Inference. Reasons for wishing to extend the techniques are several. Optimal procedures of inference, as described, say, by Neyman-Pearson theory, may only be tractable in unrealistically simple statistical models. Distributional approximations, such as those provided by asymptotic likelihood theory, may be judged to be inadequate, especially when confronted with small data samples (as often arise in various fields, such as particle physics and in examination of operational loss in financial systems). It may be desirable to develop general purpose inference methods, such as those given by likelihood theory, to explicitly incorporate ideas of appropriate conditioning. In many settings, such as bioinformatics, we are confronted with the need to simultaneously test many hypotheses. More generally, we may be confronted with problems where the dimensionality of the parameter of the model increases with sample size, rather than remaining fixed. The data structures being analysed may represent extremes of sets of observations, such as environmental or financial maxima.

We consider in this module a number of topics motivated by such considerations. These include: developments in likelihood-based inference, driven by accurate analytic approximation techniques; objective Bayes and bootstrap approaches to inference in parametric problems; multiple testing and estimation; extreme value theory, including distribution theory for maxima and upper order statistics and their associated domain of attraction; theoretical notions involved in high-dimensional inference.

### Bayesian Methods (5 ECTS)

This module introduces the fundamental definitions of probability which underly Bayesian inference and then explores the implications of these basic rules for generic statistical tasks. These include parameter inference, model comparison using the marginal likelihood, hypothesis testing, and experimental design. The model will also cover the formulation of inference problems, with a particular focus on hierarchical models and links to more heuristic approaches (e.g., least-squares fitting). Particular emphasis will also be placed on the assignment of probabilities and distributions, including prior distributions for parameter inference, with a focus on information theoretical considerations that lead to the maximum entropy distributions.

### Non-Parametric Smoothing and Wavelets (5 ECTS)

Non-parametric methods, as opposed to parametric methods, are desirable when we cannot confidently assume parametric models for our observations. In such situations we need flexible, data driven methods for estimating distributions or performing regression. This module looks at a number of non-parametric methods.

These will include: Non-parametric density estimation: histograms, kernel estimators, window width, adaptive kernel estimators. Non-parametric regression: regressograms, kernel regression, local polynomial regression, cross-validation. Regularisation and Spline Smoothing: roughness penalty, cubic splines, spline smoothing, Reinsch algorithm. Basis function approach: B-spines, wavelets: discrete wavelet transform; wavelet variance, wavelet shrinkage, thresholding.

### Multivariate Analysis (5 ECTS)

Multivariate Analysis is concerned with the theory and analysis of data that has more than one outcome variable at a time, a situation that is ubiquitous across all areas of science. Multiple uses of univariate statistical analysis is insufficient in this settings where interdependency between the multiple random variables are of influence and interest. In this module we look at some of the key ideas associated with multivariate analysis. Topics covered include: multivariate notation, the covariance matrix, multivariate characteristic functions, a detailed treatment of the multivariate normal distribution including the maximum likelihood estimators for mean and covariance, the Wishart distribution, Hotelling's T^2 statistic, likelihood ratio tests, principle component analysis, ordinary, partial and multiple correlation, multivariate discriminant analysis.

### Data Science I: Data (5 ECTS)

Data scientific methods are wide in scope, drawing equally from computational statistics and computer science. This course focuses on the “data” part of data science. It will cover:

Computing with data using R, python, and C++ in open and reproducible workflows: RMarkdown or Jupyter notebooks, version control, unit testing. Complex computational pipelines.

Data exploration and the field of Exploratory Data Analysis, which often reveals anomalies in real-world datasets. Data preparation, in which datasets are reformatted, cleaned, and pre-processed.

Data representation, covering the use of both databases and data formats like SQL and hdf5 and data structures, and also data transformations using mathematical representations (e.g. the Fourier transform) and deep learning approaches (e.g. word2vec).

### Data Science II: Science (5 ECTS)

Data scientific methods are wide in scope, drawing equally from computational statistics and computer science. This course focuses on the “science” part of data science. It will cover:

The visualization and presentation of data, including: raw datasets, intermediate results during the data cleaning and model fitting stage of an analysis, and final outputs for public/business communication.

Data modelling, covering both the generative modelling framework of applied statistics and the predictive modelling framework of machine learning, with a focus on the deployment of scalable and reproducible methods.

Science about data science, covering what data analysts really do, thinking critically about appropriate uses and misuses of data science.

### Graphical Models (5 ECTS)

Graphical models are those probability models whose independence structure is characterised by a graph, the conditional independence graph. In this module we will look at some aspects of graphical modelling for both (a) a vector of random variables, and (b) vector-valued time series. We will look at models and their estimation. Topics covered include: dependence structure and graphical representation; Markov properties for undirected graphs; the conditional independence graph; decomposable models; graphical Gaussian models; model selection; acyclic directed graphical models; global directed Markov property; Bayesian networks; graphical modelling of time series; model selection for time series graphs.

### Machine Learning (5 ECTS)

This module will provide an introduction to Bayesian statistical pattern recognition and machine learning. The lectures will focus on a variety of useful techniques including methods for feature extraction, dimensionality reduction, data clustering and pattern classification. State-of-art approaches such as Gaussian processes and exact and approximate inference methods will be introduced. Real-world applications will illustrate how the techniques are applied to real data sets. Continuous assessment through coursework.

### Introduction to Statistical Finance (5 ECTS)

The module “Introduction to Statistical Finance” introduces fundamental concepts in financial economics and quantitative finance and presents suitable statistical tools which are widely used when analysing financial data. The module will start off with an introduction to risk-neutral pricing theory followed by a short survey on risk measures such as value at risk and expected shortfall which are widely used in financial risk management. Next, an introduction to time series analysis will be given, where the main focus will be on so-called ARMA-GARCH processes. Such processes can describe some of the stylised facts widely overserved in financial data, including non-Gaussian returns and heteroscedasticity. Finally, methods for forecasting financial time series will be introduced.

### Advanced Statistical Finance (5 ECTS)

Advanced Statistical Finance focuses on modern statistical methods for analysis of financial data. During the last two decades, the increasing availability of large financial data sets has prompted development of new statistical and econometric methods that can cope with high-dimensional data, high-frequency observations and extreme values in data.

The module will first introduce the basics of extreme value theory, which will be used to develop models and estimation methods for extremes in financial data. The second part of the module will provide a concise introduction to the theory of stochastic integration and Itô calculus, which provide a theoretical foundation for volatility estimation from high-frequency data using the concept of realised variance. The asymptotic properties of realised variance will be elucidated and applied to draw inference on realised volatility.

The third part introduces some recently developed volatility forecasting models that incorporate volatility information from high-frequency data and demonstrates how the performance of such models can be assessed and compared using modern forecast evaluation methods such as the Diebold-Mariano test and the model confidence set.

The final part of the module provides an overview of covariance matrix estimation in a high-dimensional setting, motivated by applications to variance-optimal portfolios. The pitfalls of using the standard sample covariance matrix with high-dimensional data are first exemplified. Then it is shown how shrinkage methods can be applied to estimate covariance matrices accurately using high-dimensional data.

### Biomedical Statistics (5 ECTS)

The students will be introduced to modern statistical approaches and tests performed when analysing data collected from observational studies, such as case-control studies, longitudinal studies and clinical trial studies. The course will introduce central techniques for modelling and inference in biostatistics, from generalized linear regression models to complex Bayesian multi-level models for clinical, environmental and ecological data. Case examples will illustrate recent theoretical advances in action, covering variable selection, principles of handling missing data, meta-analysis, aspects of causal inference, and the effective design of biostatistical studies. Particular emphasis will be on state-of-the-art computing, introducing students to the R tidyverse environment for data science, techniques for handling big data, and the Stan software for inference.

### Statistical Genetics and Bioinformatics (5 ECTS)

Advances in biotechnology are making routine use of DNA sequencing and microarray technology in biomedical research and clinical use a reality. Innovations in the field of Genomics are not only driving new investigations in the understanding of biology and disease but also fuelling rapid developments in computer science, statistics and engineering in order to support the massive information processing requirements. In this module, students will be introduced into the world of Statistical Genetics and Bioinformatics that have become in the last 10-15 years two of the dominant areas of research and application for modern Statistics. In this module we will develop models and tools to understand complex and high-dimensional genetics datasets. This will include statistical and machine learning techniques for: multiple testing, penalised regression, clustering, p-value combination, dimension reduction. The module will cover both Frequentist and Bayesian statistical approaches. In addition to the statistical approaches, the students will be introduced to genome-wide association and expression studies data, next generation sequencing and other OMICS datasets.

### Big Data (5 ECTS)

The emergence of Big Data as a recognised and sought-after technological capability is due to the following factors: the general recognition that data is omnipresent, an asset from which organisations can derive business value; the efficient interconnectivity of sensors, devices, networks, services and consumers, allowing data to be transported with relative ease; the emergence of middleware processing platforms, such as Hadoop, InfoSphere Streams, Accumulo, Storm, Spark, Elastic Search, …, which in general terms, empowers the developer with an ability to efficiently create distributed fault-tolerant applications that execute statistical analytics at scale.

To promote the use of advanced statistical methods within a Big Data environment - an essential requirement if correct conclusions are to be reached - it is necessary for statisticians to utilise Big Data tools when supporting or performing statistical analysis in the modern world. The objective of this module is to train statistically minded practitioners in the use of common Big Data tools, with an emphasis on the use of advanced statistical methods for analysis. The module will focus on the application of statistical methods in the processing platforms Hadoop and Spark. Assessment will be through coursework.

### Algorithmic Trading and Machine Learning (5 ECTS)

**Please note: this module currently runs in the Autumn term**

*The Algorithmic Trading and Machine Learning module is part of the MSc in Mathematics and Finance. Any MSc in Statistics student interested in the module is welcome to attend the lectures. A limited number of MSc in Statistics students will be allowed to take this module for credit towards their degree. Priority will be given to the students following the Statistical Finance stream. The final selection of students allowed to take the module for credit will be decided by both Programme Directors after the student's registration for the January exams. *

The aim of the course is to present a series of cutting-edge topics in the area of “Algorithmic trading” in a unified and systematic fashion. For each of the problems presented, we try to emphasize both the mathematical theory as well as industry applications. The course consists of two main parts: 1) Optimal Execution Problems and 2) Machine Learning in Finance. Optimal execution techniques are particularly relevant for market makers and quantitative brokers whereas machine learning is often used by hedge fund and prop desks to generate trading signals. However machine learning algorithms can be also applied as part of optimal execution tools, for example in order to chose order types or speed of execution. The basic optimal execution problem consists of an agent (e.g. a bank or a broker) who needs to buy or sell a pre-specified number of units of a given asset within a fixed time frame (e.g. an hour, a day, etc). Assuming that the purchase or sale of the asset will have an impact on its price, what is the execution policy which minimizes market impact? Having decided on the execution schedule, what type of order (market or limit order) is better to submit? The first problem can be formulated as a trade-off between the expected execution cost and the price risk due to exogenous factors. We shall solve the optimization problem for different types of

- Price dynamics (ABM vs GBM, with drift or without drift);
- Market impact type (temporary, transient, permanent);
- Exogenous Risk functions (variance, VaR).

Machine learning techniques are becoming increasingly popular in the financial industry. They are typically used to help predict asset price patterns, volatility regimes, etc. The course starts by formalizing the concept of “learning” and providing an overview of various learning techniques. The subsequent lectures analyze in detail some of the most popular machine learning algorithms such as neutral networks and support vector machines. We then introduce various smoothing tools (kernel regression, wavelets, HHTs) which have historically been developed for signal processing applications but have found their way into finance over the last few years. Those methods can be used as stand alone or jointly with other learning algorithms, e.g. SVM. Finally, we shall analyze issues related to model selection and how to combine different models to improve the learning outcome. Trading applications using real market data will be presented during the course.

### Survival Models and Actuarial Applications (7.5 ECTS)

Survival models are fundamental to actuarial work, as well as being a key concept in medical statistics. This module will introduce the ideas, placing particular emphasis on actuarial applications. Concepts of survival models, right and left censored and randomly censored data. Estimation procedures for lifetime distributions: empirical survival functions, Kaplan-Meier estimates, Cox model. Statistical models of transfers between multiple states, maximum likelihood estimators. Counting process models.

Actuarial Applications: Life table data and expectation of life. Binomial model of mortality. The Poisson model. Estimation of transition intensities that depend on age. Graduation and testing crude and smoothed estimates for consistency.

For M4S14/M5S14: All of the above and additionally, masters level material to be self-studied (based on master level textbook/research monograph/paper).

### Quantitative Methods in Retail Finance (7.5 ECTS)

Profitability and behavioural models are introduced for credit risk, based on survival and Markov transition models. Profit and expected profit models are derived based on these formulations, allowing for risk-based pricing and optimization on profit.

State-of-the-art fraud detection methods are introduced such as artificial neural networks and anomaly detectors, along with the use of social network data. Assessment methods for fraud are also discussed.

Evaluation methods based on cross-validation and bootstrap are given, along with a critique of AUC, widely used in retail finance, and derivation of the H-measure.

Capital requirement calculations are given, based on the Basel Accord. In particular, the one-factor Merton model is derived. This leads to models for LGD estimation and panel model methods for estimating asset correlations.

### Time Series (7.5 ECTS)

**Please note: this module currently runs in the Autumn term**

Time series analysis is an important area of statistics with applications in finance, engineering and many physical sciences plus areas such as neuroscience in medicine. This module covers introductory ideas in both the time domain and frequency domain areas of the subject. Topics:

Real examples, stationarity, autocovariance sequences, covariance matrices for segments, examples of discrete stationary processes, trend removal and seasonal adjustment, the general linear process, spectral representation, sampling and aliasing, linear filtering, estimation of mean and autocovariance, spectral estimation via the periodogram, tapering for bias reduction, autoregressive processes and estimation of their parameters, parametric and non-parametric bivariate time series, coherence, forecasting.