Background
Auto‑CORPus addresses the difficulty of applying Natural Language Processing (NLP) and machine learning methods to biomedical text mining, which are available in varied and inconsistent formats like HTML, XML, and PDF. Auto‑CORPus standardises scientific publications to machine-readable corpora (BioC) to efficiently process full texts, tables, and abbreviations. Auto‑CORPus is designed to convert publications into three outputs: (1) BioC‑formatted full text annotated with standardised section labels using the Information Artifact Ontology, (2) a custom JSON representation of tables extracted from inline or linked HTML, and (3) a JSON file linking abbreviations to their long forms. By automating these conversions, Auto‑CORPus enables scalable and accurate text analytics workflows, removing the bottleneck of manual preprocessing and facilitating tasks like named entity recognition, relation extraction, and literature-based discovery.
Our Contribution
This project was a follow-up from a successful OSB project that had been completed earlier. The main goal of this project was to make it easier to integrate alternative processing pipelines into the Auto-CORPus software. These included processing XML versions of publications and supplementary materials in PDF, Word and Excel formats. The existing code was all written assuming that the input files would be HTML, so we needed to refactor the code to make it more modular and functional, making it easier to add these new features. Much of the project was conducted alongside the research group, with the new features being added by them while we completed our refactoring. This posed its own unique challenges, but ultimately resulted in far more being achieved than what we would've been able to do by ourselves.
Outcomes
The v1.1.1 release made at the conclusion of our project included a significantly refactored codebase with additional features of being able to analyse XML files as well as multiple forms of supplementary materials. The code is also now far easier to work with and test, which will result in more features in the future and a more robust, trusted software tool.
Testimonials
“The RSE team rapidly integrated into the project, improving code quality, clarifying testing, and helping us streamline Auto-CORPus into a cleaner, more professional library that is better for both us within the CoDiet project and other users.” - Antoine Lain, Postdoc in Biomedical Natural Language Processing
“The RSE team have worked closely with two postdocs on aspects of software engineering, and the software is now much easier to contribute to, and their efforts in streamlining the code with the postdocs will facilitate introducing the next updates in a better way.” - Joram Posma, Associate Professor in Biomedical Informatics
“Bringing in the RSE team provided some critical momentum our software development project needed. Their expertise in applying current coding best practices helped to improve the robustness of our codebase. The contribution from the RSE team not only enhanced code quality but helped the wider project team understand new software development techniques which can be applied to other projects.” - Tim Beck, Associate Professor in Federated Systems and Bioinformatics (University of Nottingham)
“Working closely with the RSE team brought best practices into our project’s development cycle, refactoring existing code and putting new tools in our hands to ensure code quality, maintenance and testing is in line with more experienced software engineering workflows.” - Thomas Rowlands, Research Fellow in Health Informatics