Imperial College London

Hilary Watt CStat FHEA MSc MA(Oxon) BA

Faculty of MedicineSchool of Public Health

Senior Teaching Fellow in Statistics



+44 (0)20 7594 7451h.watt Website




322Reynolds BuildingCharing Cross Campus




Burwalls Statistics Teaching conference resources and Innovative Teaching Methods

BURWALLS DISCUSSION ON TEACHING CONFIDENCE INTERVALS: RESOURCES compiled during Burwalls online Statistics teaching conference, 11-13 July 2022. Video presentations based on these conference talks/ discussions will be posted shortly.

Discussion on teaching methods for confidence intervals and a second discussion on teaching p-values and desired focus on results amongst study participants, lead by myself, Kay Leedham-Green, Renata Medeiros Mirra and with Damian Farnell also contributing to the survey of Burwalls participants, whose results were presented.

Burwalls views on what we need to teach, so that students get CIs: what do students need to understand to get CIs

Confidence interval (CI) misconceptions that Burwalls conference participants feel we need to watch out for. Feel free to add your own: Misconceptions sheet

Teaching resources and ideas for teaching CIs, from Burwalls conference participants. Feel free to add your ownTeaching resources

Burwalls participants comments on the merit of interpreting results in study participants, and any lessons on how we teach p-values: P-value/ study participants discussion.

BURWALLS: Annual Conference for Teaching of Statistics in Medicine and Allied Health Sciences: venues list: Many thanks to Dr Vikki O'Neill (Queens University, Belfast) for hosting and co-ordinating: Burwalls 2022 and Burwalls 2023. Many thanks to Dr Damian Farnell and Dr Renata Medeiros Mirra (Cardiff Dental School) for hosting and co-ordinating: Burwalls 2020 & 2021 and for their work in editing Burwalls Statistics teaching book, based on expert contributions to Burwalls conferences. Earlier Burwalls were hosted at Norwich Medical School and in 2018 by Dr Margaret McDougall at Edinburgh Medical School.

STATISTICS MASTERCLASS: new module, lead by Hilary Watt: teaching practical interpretations and implications of CIs and of p-values, including the influence of strategies of analysis on these. Teaching selection of regression model for practical research questions and interpretation of regression results. This module will be open for staff, PhD students and MSc/ MPH/ BSc students from Department of Public Health, Imperial College London to attend, so that many people can benefit from her ability to draw out core concepts, including use of innovative teaching methods.

“Your (Int J Epi stats teaching) paper was enormously helpful to me when I returned to teaching". Dr Vicky Ryan, lecturer in biostatistics at Newcastle University. July 2021, in online chat at Burwalls statistics teaching conference and via twitter.

You've gained a reputation for explaining concepts really clearly. This cohort is remarkable for the lack of any students coming forwards to complain that they do not understand statistics, despite a particularly high proportion coming forward over exam stress because of their more in-depth and broader statistical examinations. Dr Christine Franey, senior academic tutor on MPH at Imperial College. January 2019.


Key ideas are encompassed into Hilary's talk (view by clicking here).

Statisticians have expressed concern over many years about poor standards of academic statistical interpretations, with many people finding Statistics hard to learn. Formal definitions (of p-values and confidence intervals) are far from intuitive, being misquoted even by many statisticians who teach medical statistics. Some consider these definitions to be impractical, with many such educators relying on informal interpretations; yet standard informal interpretations may be too crude to be much help, with some even leading towards common misconceptions.(three abstracts based on survey of statistics educator, accepted by Royal Statistical Society conference as talks/ posters presentations & was the basis for two discussion forums at Burwalls annual conference 2022).

P-value and confidence interval definitions are based on their calculation methods, which involves taking repeated theoretical random samples from some “population”. Yet in practical medical research, participants are rarely selected at random, and we never take many repeated random samples. Hence the “population” can rarely be clearly defined; definitions and interpretations rely on an understanding of this concept and its practical relevance in the absence of random sampling. Some people cannot easily deduce their practical value with one set of study participants. Hence Hilary has designed definitions that are rooted in practical research, with one set of study participants available. Her definitions explain the overall purpose of p-values and confidence intervals within them, responding to evidence that many people do not develop such understanding from traditional teaching methods. She has developed images and practical exercises relating to these images, that focus on further development of conceptual understanding. She noticed that intentionally making definitions more transparent, and deliberately fostering conceptual understanding, can make the subject make more sense and become easier to understand. Clear transparent interpretations support the agenda of improving standards of statistical interpretations.

There is a risk that distortions in informal interpretations means that the subject does not make much sense, which is one reason why students find it hard or fail to understand. This may be linked to widespread poor standards of interpretation in the applied literature, that ASA are attempting to address. Hence greater clarity in language can potentially make statistics easier to understand.

Referring to results in study participants: Statistics teachers generally refer to relative risk in the sample, which implies a random sample of some population. Yet participants are rarely selected randomly from any population, hence this population cannot be clearly defined. It is unnecessarily convoluted to refer to study participants via their relationship to this theoretical population. Hence for clarity I use the words "study participants". This clarity helps students to understand confidence intervals, by contrast from this clear starting point. Watt HC 2020

Example of clear, transparent confidence interval interpretation: Risk of myocardial infarction was 19% higher in male compared to female study participants. We are 95% sure that the risk is between 5% and 35% higher amongst males in the population than amongst females. [More precisely: The 95% confidence interval (from 1·05 to 1·35) contains values for the population relative risk that could have given rise to the relative risk calculated amongst study participants, at the 95% confidence level] This interval reflects imprecision based on generalising beyond study participants to the population from which they were selected at random, i.e. reflecting imprecision based solely on assumed random choices in selection of study participants. Watt HC 2020

Standard practice when interpreting confidence intervals: informal interpretations usually omit any mention of who forms the “population”, and the underlying assumption of random sampling. Yet understanding this concept is potentially the trickiest part of understanding these concepts, including understanding why we focus on the population, and what it represents when there is no random sampling. Statistics courses often repeat CI “definitions” throughout, to reinforce their meaning. Yet the value of such repetition may be lost when the trickiest aspect of understanding what CI’s represent is not reflected in them. I have therefore developed transparent explanatory definitions that explain the concept of the population within them.

Example of clear, transparent p-value interpretation: Study data showed low compatibility (p=0·006) with being selected by probability random sampling from a population where men and women had the same risk of myocardial infarction, assuming that the assumptions underpinning the calculation methods are correct. Use of the term "probability random sampling" make the distinction from the everyday meaning of the word "random" (something that is unexpected or bizarre). Watt HC 2020

Poor practice when interpreting p-values: it is common practice to refer to p<0.05 as “statistically significant”, yet in plain English this implies large enough to matter. This language may be one reason why many people ignore strengths of association, focusing on whether p<0.05. For instance, when p>0.05, many people refer to “no association” or even having “proven no association”, even when the relative risk amongst study participants is large.

P-value interpretations often omit to refer to the population, which may be one reason why many researchers, including around half of statistical authors, have been shown to erroneously believe that p>0.05 implies lack of knowledge of whether the relative risk is above one amongst study participants, rather than merely amongst the population. This fundamental misunderstanding, with excessive focus on whether p<0.05, can lead to researchers lacking the ability to sensibly summarise the evidence across the research literature.

Figure to promote understanding of the continuous nature of p-values: The x-axis shows the ratio of strength of association (such as difference in means between group), to precision of estimate (of this difference in means in the population, assuming study participants are selected at random from this population). The y-axis shows the p-value, which reflect compatibility of study data with being selected at random from a population where the difference in means in the same. This also applies to some other measures of strength of association. Watt HC 2020Figure to promote understanding of the continuous nature of p-values

The following figure (can be downloaded) is a compromise, interpreting p-values as strength of evidence for associations in the population (assuming random sampling of participants from this population). This is arguably appropriate for primary outcomes in medical research.

P-values vs Z-values: Interpreting as strength of evidence for association in population (assuming random sampling of participants from population)

Data management in R and in Stata

The most time-consuming part of data analysis is generally to prepare the data for analysis. If attention isn't given to the accuracy of data at this stage, results can be based on erroneous data, potentially rendering them worthless.

Hilary has prepared resources in both stata and in R. Whereas most software teaching is arranged around types of command, she prepares resources arranged around practical problems that people face. There are resources for data cleaning, teaching how to get data into good order. She teaches important checks and requirements required prior to merging, and some commands that may be useful to tidy up the data for this purpose. She teaches how to check results and respond to some common error messages. She teaches restructuring of datasets, including necessary commands to prepare the data before commands to restructure the datasets can be applied. This saves students from taking for example a whole module to learn string functions, when they need one string function command, and they might not necessarily recognise which one may be relevant.

Some PhD students attended these stata training sessions towards the end of their dissertations, after using google to learn stata. They were amazed at how many efficient data wrangling functions they had missed, and how much time that would have saved them.

MPH students have described resources as “awesome”. Some commented that they were still using the resources a few years after finishing their studies.

The stata data preparation documents were previously available on the internet, with researchers requesting the associated datasets, and praising them highly for how well laid out they are.