Background
Auto‑CORPus addresses the difficulty of applying Natural Language Processing (NLP) and machine learning methods to biomedical text mining, which are available in varied and inconsistent formats like HTML, XML, and PDF. Auto‑CORPus standardises scientific publications to machine-readable corpora (BioC) to efficiently process full texts, tables, and abbreviations. Auto‑CORPus is designed to convert publications into three outputs: (1) BioC‑formatted full text annotated with standardised section labels using the Information Artifact Ontology, (2) a custom JSON representation of tables extracted from inline or linked HTML, and (3) a JSON file linking abbreviations to their long forms. By automating these conversions, Auto‑CORPus enables scalable and accurate text analytics workflows, removing the bottleneck of manual preprocessing and facilitating tasks like named entity recognition, relation extraction, and literature-based discovery.
Our Contribution
The main goal for this OSB was to modernise and professionalise the Auto-CORPus codebase. The existing repository (tag v1.0.0) had limited and outdated tooling and dependency management, so we used our python-template to update it. This included implementing poetry for packaging and dependency management, ruff for code style linting and formatting, pytest for testing and GitHub Actions for CI/CD. This required a small restructuring of the repo, however the main code was never touched. In order to ensure no results were changed by upgrading to newer packages, an integration test was added to the repo.
Outcomes
This all culminated in a release of version 1.1.0 on PyPI, and set up the repo for easier, more sustainable further development. This was followed a few months later by a longer RSE project (include link to other case study).
Testimonials
“The OSB brought us up to speed with current software development tools, approaches to code problems and automation of vital testing procedures. I feel more confident in our code quality and ability to maintain our growing codebase following the OSB.” - Thomas Rowlands, Research Fellow in Health Informatics
“Thanks to the RSE team's support in organising, structuring and professionalising the Auto-CORPus codebase, we have created a more welcoming and accessible open-source environment where others can easily contribute.” - Antoine Lain, Research Associate in Biomedical Natural Language Processing
“With the help of the OSB programme, the Auto-CORPus code was revamped and made pip installable with a PyPI distribution and also updated to work with the latest PubMed Central website configuration, which makes the software more widely usable than it was before.” - Joram Posma, Associate Professor in Biomedical Informatics