Imperial College London

What's stats got to do with it?


Image credit: Prof Mark Girolami

Image credit: Prof Mark Girolami

Data science is a fast developing discipline affecting myriad aspects of our everyday lives.

From healthcare to banking, travel to advertising, construction to retail: data science unifies broad areas of research, including mathematics, statistics, computer science and machine learning. It explores the acquisition, storage and curation of infinite and complex data sets, churned out by modern, tech-focused, device-driven culture. Last month the Statistical Data Science conference examined the cross-disciplinary nature of this field, considering the subject from the perspective of differing scientific communities, focusing particularly on the role of statistics.

Among those giving talks were Professors David Hand (Imperial College London), Michael Jordan (University of California, Berkeley) and David Leslie (Lancaster University); here they explain where their interest in data science lies, exploring the rise, relevance and future of the field.

Tell us a bit about the talks you gave at the conference, and why this area of data science interests you

Professor Jordan: Data science is, in part, a combination of computational thinking and inferential thinking. Those two styles are not that easy to merge together and this is the goal of data science – to build systems that do statistical inference but that are also computationally robust and scalable. It’s an intellectually interesting area; there are a lot of difficult problems to solve. It’s also of great interest to modern industry; there are tens of thousands of companies worldwide who need to try and solve these issues, so you immediately get to work on real-world, mission critical problems.

...these new phrases “big data” and “data science" came along; perceptions began to change, and now we have this sort of avalanche because of that.

– Professor David Hand

Imperial College London

Professor Hand: I work in what are called supervised classification problems where the aim is to assign an object to one of a pre-specified number of classes, so it might be diagnosing someone to determine whether they’ve got disease A or disease B, or identifying a spoken word as affirmative or negative. I have particular interest in evaluating the performance of these methods, and questioning how you can improve them... I’ve done a lot of work in the retail finance industry – credit cards, mortgages, bank loans and so on – where there is this fundamental problem of should you give someone a loan or a product or not. Why am I so interested in this area? Ubiquity; it crops up all over the place; I think that’s why I find it particularly worthwhile.

Professor Leslie: [Currently there’s a lot of focus on] finding the best point predictions and ignoring any measure of uncertainty. My research tries to use uncertainty to an advantage to make sensible decisions; it is often worth experimenting with an action for which there is high uncertainty, even if it seems like it is probably a bad option, to make sure we learn to avoid missing out on high rewards in the future. This research can be applied to internet advertising; by exploring different advertisement options companies can learn about both their adverts and their customers, but a balance needs to be achieved in order to make this effective. Too little exploration into untested adverts means the company will never learn which adverts to show to which people; likewise, too much exploration can negatively impact profits as it can impede the amount effective advertising. It’s a hot topic right on the cross-over of the inference and the decision-making communities … I’ve always put myself in the gaps because that’s where I find things of interest.

What is the contemporary relevance of data science?

The goal of data science – to build systems that do statistical inference but that are also computationally robust and scalable.

– Professor Michael Jordan

University of California, Berkeley

Professor Hand: In the past we [statisticians] would spend a lot of time trying to convince people that this subject was a very exciting, necessary, universal discipline, but one consistently ran up against the perception that statistics was dry and dusty, to do with adding up columns of numbers rather than a high-tech discipline involving sophisticated computer tools to probe data for understanding. But then, with the rise of high tech companies like Google and Amazon, which are based on data, these new phrases “big data” and “data science” came along; perceptions began to change, and now we have this sort of avalanche because of that.

Professor Leslie: It’s not that [data science] is more relevant now, it’s that finally people have realised it’s relevant … also people can now get access to computer power, which in the past was very much a specialist resource.

Professor Jordan: Now you have data about not just groups of people, but about every individual person, so you can start to build more personalised services within existing services and that’s a very new phenomenon.

What challenges face this area of research?

Professor Leslie: I think one of the biggest challenges at the moment is a lack of people. There is data everywhere, but not enough skilled people to develop and implement the techniques to extract meaning and act on it. We’re training people as fast as we can, but more investment is needed.

Professor Hand: One of the challenges is data quality; no matter how poor your data, if you throw it into an algorithm the algorithm will spit out a number – you know there’s that old saying “garbage in, garbage out”, and that’s perhaps even more pertinent nowadays. In the old days, if you had a set of a hundred data points, or a thousand data points, you could look at them one by one – is this sensible, does it make sense, is there something wrong with it? [This approach is impossible] if you’ve got a billion data points; and just because you’ve got more data, doesn’t make the problems go away. Quality is even more critical. Failing to acknowledge data quality issues can be catastrophic for companies.

Is the role of statistics in data science unclear?

We need to make sure that the constituent communities of Data Science listen to each other a bit better.

– Professor David Leslie

Lancaster University

Professor Hand: I think it’s pretty clear. I divide data science problems into two types, although the overlap is colossal. On the one hand is the purely computational, which involves using algorithms to search, match and aggregate data; the other, which focuses on using the results to understand and model inference, is fundamentally a statistical thing. Data science involves computing and mathematics to decide how best to manipulate and juggle data, but for extracting understanding and illumination from data you need statistics.

Professor Leslie: A lot of buzz in the data science community is about black box methods such as deep learning, which require huge amounts of data and computation. The statistical role in data science is to ensure the lessons we have learned about issues such as data quality, uncertainty measures and interpretability are not ignored. But statisticians need to move away from being overly-cautious and adopt some more of the “can do” attitude that is prevalent in data science.

Professor Niall Adams, one of the organisers, commented that the aim of the conference was “to explore the relationships, differences, clashes between data science and statistics … how to get strength from one to the other in both directions”. What have you taken away from the conference?

Professor Hand: It’s interesting, I think it’s been a very good conference and I’ve been asking myself why. I go to many conferences where there are a whole series of pretty narrow technical talks; people are talking about their most recent piece of research, and if you don’t work in that area you soon lose track of what they’re talking about. I think here, because we’ve had such a diversity of topics – from high level academics and people from industry – we’ve got this heterogeneity. I think that’s made it more exciting.

Professor Leslie: Lots of traditional statisticians are engaging with the data science world and that’s great. But we need to make sure that the constituent communities of Data Science listen to each other a bit better.

Keep up to date with upcoming events and seminars organised by the Statisitics section, and find out more about their research.


Claudia Cannon

Claudia Cannon
Faculty of Natural Sciences