New data synthesis tool can rapidly sort complex datasets to aid decision-makers

11 July 2022

Decision makers using new data science tool

A new data synthesis tool can help tackle ‘infodemics’ by quickly sorting through large, complex datasets to make optimal decisions for society.

Data scientists from Imperial College London in collaboration with commercial partners AWS, MirrorWeb and CloudWick, have developed a new platform called Realtime Data Synthesis and Analysis (REDASA) to help curate and filter large amounts of complex information in a short amount of time. The platform represents a key tool that will help stakeholders make optimal decisions for society which will ultimately improve factors like public health and safety.

While the huge scale of the scientific response to the COVID-19 pandemic has unquestionably saved lives, the sheer volume and velocity of new information published each day has triggered an unprecedented ‘infodemic’ – an overabundance of information both online and offline.

"This study represents a promising tool for use in the future of information retrieval research and data quality in medicine." Dr Ovidiu Serban

By combining the knowledge of medical experts, with the efficiency of an artificial-intelligence-enabled engine, the team, including scientists from Imperial’s Data Science Institute, developed a data extraction methodology to filter out documents representing only the most relevant and important information about COVID-19. This new method can be applied to other extensive datasets in the future.

The study, published in the Journal of Medical Interest Research, used REDASA to create one of the world’s largest and most up-to-date sources of COVID-19-related evidence, consisting of over 104,000 documents.

An ‘infodemic’

COVID-19 is the first pandemic in history in which technology and social media are being used on a massive scale to keep people safe, informed, productive, and connected.

However, at the same time, the technology we rely on has enabled an ‘infodemic’ that has undermined the global response and measures to control the pandemic.

The rapid publication of large amounts of data across both peer- and nonpeer-reviewed sources presents considerable challenges for stakeholders such as policy makers, clinicians, and patients to navigate.

These stakeholders must rapidly synthesise information to make optimal, evidence-based decisions for the benefit of society and for the protection of public health, and current methods of synthesising data are unable to keep up with the pace of the rapidly changing information landscape.

Therefore, there is an urgent need to capture, structure and interpret large and complex datasets in real time.

Human-in-the-loop methodology

The REDASA’s design adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine. This means that human experts are involved in the decision-making process along with AI components: the automated system filters down the information to a manageable amount, humans then check the results and assess the quality of the selected work and another AI component then verifies that the experts are consistent.

The platform has been designed for use across a wide range of data-rich subject areas while keeping application and impact in mind. It continuously captures and synthesises both academic literature and relevant ‘grey’ literature (including news websites, policy documentation and social media posts) to develop a data curation approach that could supplement machine-learning methodologies.

Next steps

According to co-author Dr Ovidiu Serban, “This study represents a promising tool for use in the future of information retrieval research and data quality in medicine. We are currently validating the same pipeline in cancer research, while also working with research groups looking at systematic reviews for biomarkers.”

Moving forward he said: “We are working with publishers to get more data and to ensure all data is fully accessible. We are also looking into using this tool for analysing social media to incorporate public opinion and discussions around various medical treatments. Ideally by showing the evolution of medical evidence over time, and being more open with existing medical evidence, we would be able to counteract future fake news phenomenon.”

The development of the REDASA platform is a promising step towards ensuring important decisions about public health, infrastructure and society can be made quickly, safely and efficiently in the future.

‘Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study’ by Vaghela et al. published on 23 May 2021.

The REDASA Covid-19 Open Data snapshot can be found here.