Imperial College London

DrKirillVeselkov

Faculty of MedicineDepartment of Surgery & Cancer

Lecturer
 
 
 
//

Contact

 

+44 (0)20 7594 3899kirill.veselkov04

 
 
//

Location

 

Sir Alexander Fleming BuildingSouth Kensington Campus

//

Summary

 

Publications

Citation

BibTex format

@article{Galea:2018:bioinformatics/bty152,
author = {Galea, D and Laponogov, I and Veselkov, K},
doi = {bioinformatics/bty152},
journal = {Bioinformatics},
pages = {2472--2482},
title = {Exploiting and assessing multi-source data for supervised biomedical named entity recognition},
url = {http://dx.doi.org/10.1093/bioinformatics/bty152},
volume = {34},
year = {2018}
}

RIS format (EndNote, RefMan)

TY  - JOUR
AB - Motivation:Recognition of biomedical entities from scientific text is a critical component of naturallanguage processing and automated information extraction platforms. Modern named entity recognitionapproaches rely heavily on supervised machine learning techniques, which are critically dependent onannotated training corpora. These approaches have been shown toperform well when trained and testedon the same source. However, in such scenario, the performanceand evaluation of these models may beoptimistic, as such models may not necessarily generalize to independent corpora, resulting in potentialnon-optimal entity recognition for large-scale tagging of widely diverse articles in databases such asPubMed.Results:Here we aggregated published corpora for the recognition of biomolecular entities (such asgenes, RNA, proteins, variants, drugs, and metabolites), identified entity class overlap and performedleave-corpus-out cross validation strategy to test the efficiency of existing models. We demonstratethat accuracies of models trained on individual corpora decrease substantially for recognition of thesame biomolecular entity classes in independent corpora. This behavior is possibly due to limitedgeneralizability of entity-class-related features captured by individual corpora (model “overtraining”) whichwe investigated further at the orthographic level, as well as potential annotation standard differences.We show that the combined use of multi-source training corpora results in overall more generalizablemodels for named entity recognition, while achieving comparable individual performance. By performinglearning-curve-based power analysis we further identified thatperformance is often not limited by thequantity of the annotated data.
AU - Galea,D
AU - Laponogov,I
AU - Veselkov,K
DO - bioinformatics/bty152
EP - 2482
PY - 2018///
SN - 1367-4803
SP - 2472
TI - Exploiting and assessing multi-source data for supervised biomedical named entity recognition
T2 - Bioinformatics
UR - http://dx.doi.org/10.1093/bioinformatics/bty152
UR - http://hdl.handle.net/10044/1/57872
VL - 34
ER -