Imperial College London

Dr Joram M. Posma PhD MSc B AS MRSC

Faculty of MedicineDepartment of Metabolism, Digestion and Reproduction

Senior Lecturer in Biomedical Informatics
 
 
 
//

Contact

 

j.posma11 Website

 
 
//

Location

 

E305Burlington DanesHammersmith Campus

//

Summary

 

Publications

Citation

BibTex format

@inproceedings{Li:2021:10.6084/m9.figshare.14784858,
author = {Li, Z and Makraduli, F and Yeung, C and McQuibban, NAR and Popovici, C and Sun, S and Hu, Y and Rowlands, T and Posma, JM and Beck, T},
doi = {10.6084/m9.figshare.14784858},
title = {Auto-CORPus: Automated and Consistent Outputs from Research Publications},
url = {http://dx.doi.org/10.6084/m9.figshare.14784858},
year = {2021}
}

RIS format (EndNote, RefMan)

TY  - CPAPER
AB - The availability of improved natural language processing (NLP) algorithms and models enable researchers to analyse larger corpora using open source tools. Text mining of biomedical literature is one area for which NLP has been used in recent years with large untapped potential. However, to generate corpora that can be analysed using machine learning NLP algorithms, these need to be standardized. Summarizing data from literature to be stored into databases typically requires manual curation, especially for extracting data from result tables.We present an automated pipeline that cleans HTML files from biomedical literature. The outputs are JSON files that contains the text for each section, table data in machine-readable format and lists the phenotypes, assays, chemical compounds, SNPs, P-values and abbreviations found in the article. We analysed a total of 2,441 Open Access articles from PubMed Central, from both Genome-Wide and Metabolome-Wide Association Studies, and developed a model to standardize the section headers based on the Information Artifact Ontology. Extraction of table data was developed on PubMed articles and fine-tuned using the equivalent publisher versions. As part of this work we found evidence of tables being converted to figures by authors as well as publishers. To this end we have developed a pipeline that converts the table from an image back to text while keeping the table structure intact. We have fine-tuned the Tesseract optical character recognition (OCR) algorithm specifically for biomedical table data. We have improved the accuracy of recognising characters in table-images using the original Tesseract algorithm from 53% to 90% when evaluated on 233 tables from 80 publications.In summary, Auto-CORPus can be used to create a corpus for different fields where the section headers are standardised to allow NLP algorithms to be applied to specific paragraphs, rather than only on abstracts or the full text.
AU - Li,Z
AU - Makraduli,F
AU - Yeung,C
AU - McQuibban,NAR
AU - Popovici,C
AU - Sun,S
AU - Hu,Y
AU - Rowlands,T
AU - Posma,JM
AU - Beck,T
DO - 10.6084/m9.figshare.14784858
PY - 2021///
TI - Auto-CORPus: Automated and Consistent Outputs from Research Publications
UR - http://dx.doi.org/10.6084/m9.figshare.14784858
UR - http://hdl.handle.net/10044/1/93935
ER -