How can social media be used to detect the next pandemic?

20 July 2017

Researchers at the Institute for Security Science and Technology (ISST) are using machine learning and social media to identify illness outbreaks.

At the 2017 Munich Security Conference Bill Gates expressed concern over biosecurity and deadly pandemics:

“Whether it occurs by a quirk of nature or at the hand of a terrorist, epidemiologists say a fast-moving airborne pathogen could kill more than 30 million people in less than a year. And they say there is a reasonable probability the world will experience such an outbreak in the next 10-15 years.”

Early detection of such pathogens is an essential component of biosecurity. The U.S. Department of Defense, Defense Threat Reduction Agency (DTRA), is funding ISST through the UK Defence Science and Technology Laboratory (Dstl), to support the development of the Biosurveillance Ecosystem (BSVE). This aims to provide earlier warning and situational awareness of biological threats.

As part of this effort, researchers at the ISST are turning their expertise in machine learning to detecting disease outbreaks via social media. Dr Ovidiu Serban and Nick Thapen tell us more.

What are you up to?

We’re basically scanning Twitter for clues that people might be experiencing an illness. So they might tweet the symptoms they are experiencing, or medication they are on.

We’re focused on the United States at the moment and have two applications.

The first is event detection, combining Twitter geo-location data with tweeted symptoms to spot a spike in cases at the state-level, which might suggest an outbreak. We also cross-reference this with data crawled from 16,000 RSS feeds of news agencies.

The second application is called nowcasting. This aims to get a picture of how many people are experiencing a particular set of symptoms now, to better forecast medical needs for example. In the U.S., the Center for Disease Control (CDC) looks at various indicators to monitor individual illnesses for this reason. But there is often a one to two week delay in compiling their figures, so we want to see if social media can give better advance warning.

How easy is it to find these clues?

We scan millions of Tweets coming out of the US each day. But it’s not as simple as mining for keywords as these are ambiguous. The word cold could mean that someone has a cold or that they are cold.

So before looking at keywords, we look at sentences to decide if there is a health context. For this we have designed a classifier; a deep neural network which, through machine learning, teaches itself to distinguish what is a health context and what isn’t.

How does the computer teach itself?

I suppose we should say the computer learns from us. We take a set of tweets, in our case around 5,000, and annotate them manually as either health-related or not. We feed these to the computer which looks for language patterns in how we distinguished the health tweets from the non-health tweets. It can then apply this to new tweets.

Inevitably, however, in the real-world it’s going to encounter language outside of our training data set. So we also need some general language rules to allow our classifier to deal with this.

For this we use a software library called GloVe which analyses lots of data – 400 million tweets in our case – and assigns numeric representations to words according to how related they are. If plotted on a graph, more closely related words will have closer coordinates.

Can you give an example?

Imagine our classifier sees ‘my stomach is painful’ in training, which we annotated as health-related. If it later sees ‘my stomach is in agony’, having not previously seen the word agony it might not know that this is also about health.

But GloVe would show that agony is closely related to pain, and so the classifier calculates that the phrase is probably health related.

So how accurate is the system?

We checked for accuracy, with the level of agreement between two people being the benchmark. Our human annotators agreed with each other between 84% and 89% of the time when selecting whether tweets were health-related or not. Our neural network is currently around 85% accurate, so we know it’s almost as reliable as us!

We’re running the system on Twitter as we speak, and you can see one of our published research papers here.

Article supporters

Defence Science & Technology Laboratory

Reporter

Max Swinscow-Hall

Institute for Security Science & Technology

Email: press.office@imperial.ac.uk
Articles by this author