Imperial College London

Dr Joram M. Posma PhD MSc B AS MRSC

Faculty of MedicineDepartment of Metabolism, Digestion and Reproduction

Senior Lecturer in Biomedical Informatics
 
 
 
//

Contact

 

j.posma11 Website

 
 
//

Location

 

E305Burlington DanesHammersmith Campus

//

Summary

 

Publications

Citation

BibTex format

@unpublished{Hu:2021:10.1101/2021.01.08.425887,
author = {Hu, Y and Sun, S and Rowlands, T and Beck, T and Posma, JM},
doi = {10.1101/2021.01.08.425887},
publisher = {bioRxiv},
title = {Auto-CORPus: automated and consistent outputs from research publications},
url = {http://dx.doi.org/10.1101/2021.01.08.425887},
year = {2021}
}

RIS format (EndNote, RefMan)

TY  - UNPB
AB - Motivation: The availability of improved natural language processing (NLP) algorithms and models enable researchers to analyse larger corpora using open source tools. Text mining of biomedical literature is one area for which NLP has been used in recent years with large untapped potential. However, in order to generate corpora that can be analyzed using machine learning NLP algorithms, these need to be standardized. Summarizing data from literature to be stored into databases typically requires manual curation, especially for extracting data from result tables. Results: We present here an automated pipeline that cleans HTML files from biomedical literature. The output is a single JSON file that contains the text for each section, table data in machine-readable format and lists of phenotypes and abbreviations found in the article. We analyzed a total of 2,441 Open Access articles from PubMed Central, from both Genome-Wide and Metabolome-Wide Association Studies, and developed a model to standardize the section headers based on the Information Artifact Ontology. Extraction of table data was developed on PubMed articles and fine-tuned using the equivalent publisher versions. Availability: The Auto-CORPus package is freely available with detailed instructions from Github at https://github.com/jmp111/AutoCORPus/.
AU - Hu,Y
AU - Sun,S
AU - Rowlands,T
AU - Beck,T
AU - Posma,JM
DO - 10.1101/2021.01.08.425887
PB - bioRxiv
PY - 2021///
TI - Auto-CORPus: automated and consistent outputs from research publications
UR - http://dx.doi.org/10.1101/2021.01.08.425887
UR - https://www.biorxiv.org/content/10.1101/2021.01.08.425887v1
UR - http://hdl.handle.net/10044/1/88967
ER -