Imperial College London

Dr Ben Glocker

Faculty of EngineeringDepartment of Computing

Professor in Machine Learning for Imaging
 
 
 
//

Contact

 

+44 (0)20 7594 8334b.glocker Website CV

 
 
//

Location

 

377Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

352 results found

Khara G, Trivedi H, Newell MS, Patel R, Rijken T, Kecskemethy P, Glocker Bet al., 2024, Generalisable deep learning method for mammographic density prediction across imaging techniques and self-reported race., Commun Med (Lond), Vol: 4

BACKGROUND: Breast density is an important risk factor for breast cancer complemented by a higher risk of cancers being missed during screening of dense breasts due to reduced sensitivity of mammography. Automated, deep learning-based prediction of breast density could provide subject-specific risk assessment and flag difficult cases during screening. However, there is a lack of evidence for generalisability across imaging techniques and, importantly, across race. METHODS: This study used a large, racially diverse dataset with 69,697 mammographic studies comprising 451,642 individual images from 23,057 female participants. A deep learning model was developed for four-class BI-RADS density prediction. A comprehensive performance evaluation assessed the generalisability across two imaging techniques, full-field digital mammography (FFDM) and two-dimensional synthetic (2DS) mammography. A detailed subgroup performance and bias analysis assessed the generalisability across participants' race. RESULTS: Here we show that a model trained on FFDM-only achieves a 4-class BI-RADS classification accuracy of 80.5% (79.7-81.4) on FFDM and 79.4% (78.5-80.2) on unseen 2DS data. When trained on both FFDM and 2DS images, the performance increases to 82.3% (81.4-83.0) and 82.3% (81.3-83.1). Racial subgroup analysis shows unbiased performance across Black, White, and Asian participants, despite a separate analysis confirming that race can be predicted from the images with a high accuracy of 86.7% (86.0-87.4). CONCLUSIONS: Deep learning-based breast density prediction generalises across imaging techniques and race. No substantial disparities are found for any subgroup, including races that were never seen during model development, suggesting that density predictions are unbiased.

Journal article

Doran SJ, Barfoot T, Wedlake L, Winfield JM, Petts J, Glocker B, Li X, Leach M, Kaiser M, Barwick TD, Chaidos A, Satchwell L, Soneji N, Elgendy K, Sheeka A, Wallitt K, Koh D-M, Messiou C, Rockall Aet al., 2024, Curation of myeloma observational study MALIMAR using XNAT: solving the challenges posed by real-world data., Insights Imaging, Vol: 15, ISSN: 1869-4101

OBJECTIVES: MAchine Learning In MyelomA Response (MALIMAR) is an observational clinical study combining "real-world" and clinical trial data, both retrospective and prospective. Images were acquired on three MRI scanners over a 10-year window at two institutions, leading to a need for extensive curation. METHODS: Curation involved image aggregation, pseudonymisation, allocation between project phases, data cleaning, upload to an XNAT repository visible from multiple sites, annotation, incorporation of machine learning research outputs and quality assurance using programmatic methods. RESULTS: A total of 796 whole-body MR imaging sessions from 462 subjects were curated. A major change in scan protocol part way through the retrospective window meant that approximately 30% of available imaging sessions had properties that differed significantly from the remainder of the data. Issues were found with a vendor-supplied clinical algorithm for "composing" whole-body images from multiple imaging stations. Historic weaknesses in a digital video disk (DVD) research archive (already addressed by the mid-2010s) were highlighted by incomplete datasets, some of which could not be completely recovered. The final dataset contained 736 imaging sessions for 432 subjects. Software was written to clean and harmonise data. Implications for the subsequent machine learning activity are considered. CONCLUSIONS: MALIMAR exemplifies the vital role that curation plays in machine learning studies that use real-world data. A research repository such as XNAT facilitates day-to-day management, ensures robustness and consistency and enhances the value of the final dataset. The types of process described here will be vital for future large-scale multi-institutional and multi-national imaging projects. CRITICAL RELEVANCE STATEMENT: This article showcases innovative data curation methods using a state-of-the-art image repository platform; such tools will be vital for managing the l

Journal article

Maier-Hein L, Reinke A, Godau P, Tizabi MD, Buettner F, Christodoulou E, Glocker B, Isensee F, Kleesiek J, Kozubek M, Reyes M, Riegler MA, Wiesenfarth M, Kavur AE, Sudre CH, Baumgartner M, Eisenmann M, Heckmann-Nötzel D, Rädsch T, Acion L, Antonelli M, Arbel T, Bakas S, Benis A, Blaschko MB, Cardoso MJ, Cheplygina V, Cimini BA, Collins GS, Farahani K, Ferrer L, Galdran A, van Ginneken B, Haase R, Hashimoto DA, Hoffman MM, Huisman M, Jannin P, Kahn CE, Kainmueller D, Kainz B, Karargyris A, Karthikesalingam A, Kofler F, Kopp-Schneider A, Kreshuk A, Kurc T, Landman BA, Litjens G, Madani A, Maier-Hein K, Martel AL, Mattson P, Meijering E, Menze B, Moons KGM, Müller H, Nichyporuk B, Nickel F, Petersen J, Rajpoot N, Rieke N, Saez-Rodriguez J, Sánchez CI, Shetty S, van Smeden M, Summers RM, Taha AA, Tiulpin A, Tsaftaris SA, Van Calster B, Varoquaux G, Jäger PFet al., 2024, Metrics reloaded: recommendations for image analysis validation, Nature Methods, Vol: 21, Pages: 195-212, ISSN: 1548-7091

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint-a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.

Journal article

Reinke A, Tizabi MD, Baumgartner M, Eisenmann M, Heckmann-Nötzel D, Kavur AE, Rädsch T, Sudre CH, Acion L, Antonelli M, Arbel T, Bakas S, Benis A, Buettner F, Cardoso MJ, Cheplygina V, Chen J, Christodoulou E, Cimini BA, Farahani K, Ferrer L, Galdran A, van Ginneken B, Glocker B, Godau P, Hashimoto DA, Hoffman MM, Huisman M, Isensee F, Jannin P, Kahn CE, Kainmueller D, Kainz B, Karargyris A, Kleesiek J, Kofler F, Kooi T, Kopp-Schneider A, Kozubek M, Kreshuk A, Kurc T, Landman BA, Litjens G, Madani A, Maier-Hein K, Martel AL, Meijering E, Menze B, Moons KGM, Müller H, Nichyporuk B, Nickel F, Petersen J, Rafelski SM, Rajpoot N, Reyes M, Riegler MA, Rieke N, Saez-Rodriguez J, Sánchez CI, Shetty S, Summers RM, Taha AA, Tiulpin A, Tsaftaris SA, Van Calster B, Varoquaux G, Yaniv ZR, Jäger PF, Maier-Hein Let al., 2024, Understanding metric-related pitfalls in image analysis validation, Nature Methods, Vol: 21, Pages: 182-194, ISSN: 1548-7091

Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.

Journal article

Jones C, Castro DC, De Sousa Ribeiro F, Oktay O, McCradden M, Glocker Bet al., 2024, A causal perspective on dataset bias in machine learning for medical imaging, Nature Machine Intelligence, Vol: 6, Pages: 138-146, ISSN: 2522-5839

As machine learning methods gain prominence within clinical decision-making, the need to address fairness concerns becomes increasingly urgent. Despite considerable work dedicated to detecting and ameliorating algorithmic bias, today’s methods are deficient, with potentially harmful consequences. Our causal Perspective sheds new light on algorithmic bias, highlighting how different sources of dataset bias may seem indistinguishable yet require substantially different mitigation strategies. We introduce three families of causal bias mechanisms stemming from disparities in prevalence, presentation and annotation. Our causal analysis underscores how current mitigation methods tackle only a narrow and often unrealistic subset of scenarios. We provide a practical three-step framework for reasoning about fairness in medical imaging, supporting the development of safe and equitable predictive models.

Journal article

Kori A, Locatello F, De Sousa Ribeiro F, Toni F, Glocker Bet al., 2024, Grounded Object-Centric Learning, International Conference on Learning Representations (ICLR)

Conference paper

Santhirasekaram A, Winkler M, Rockall A, Glocker Bet al., 2024, Hierarchical Compositionality in Hyperbolic Space for Robust Medical Image Segmentation, Pages: 52-62, ISSN: 0302-9743

Deep learning based medical image segmentation models need to be robust to domain shifts and image distortion for the safe translation of these models into clinical practice. The most popular methods for improving robustness are centred around data augmentation and adversarial training. Many image segmentation tasks exhibit regular structures with only limited variability. We aim to exploit this notion by learning a set of base components in the latent space whose composition can account for the entire structural variability of a specific segmentation task. We enforce a hierarchical prior in the composition of the base components and consider the natural geometry in which to build our hierarchy. Specifically, we embed the base components on a hyperbolic manifold which we claim leads to a more natural composition. We demonstrate that our method improves model robustness under various perturbations and in the task of single domain generalisation.

Conference paper

Åkerlund CAI, Holst A, Bhattacharyay S, Stocchetti N, Steyerberg E, Smielewski P, Menon DK, Ercole A, Nelson DW, CENTER-TBI participants and investigatorset al., 2024, Clinical descriptors of disease trajectories in patients with traumatic brain injury in the intensive care unit (CENTER-TBI): a multicentre observational cohort study., Lancet Neurol, Vol: 23, Pages: 71-80

BACKGROUND: Patients with traumatic brain injury are a heterogeneous population, and the most severely injured individuals are often treated in an intensive care unit (ICU). The primary injury at impact, and the harmful secondary events that can occur during the first week of the ICU stay, will affect outcome in this vulnerable group of patients. We aimed to identify clinical variables that might distinguish disease trajectories among patients with traumatic brain injury admitted to the ICU. METHODS: We used data from the Collaborative European NeuroTrauma Effectiveness Research in Traumatic Brain Injury (CENTER-TBI) prospective observational cohort study. We included patients aged 18 years or older with traumatic brain injury who were admitted to the ICU at one of the 65 CENTER-TBI participating centres, which range from large academic hospitals to small rural hospitals. For every patient, we obtained pre-injury data and injury features, clinical characteristics on admission, demographics, physiological parameters, laboratory features, brain biomarkers (ubiquitin carboxy-terminal hydrolase L1 [UCH-L1], S100 calcium-binding protein B [S100B], tau, neurofilament light [NFL], glial fibrillary acidic protein [GFAP], and neuron-specific enolase [NSE]), and information about intracranial pressure lowering treatments during the first 7 days of ICU stay. To identify clinical variables that might distinguish disease trajectories, we applied a novel clustering method to these data, which was based on a mixture of probabilistic graph models with a Markov chain extension. The relation of clusters to the extended Glasgow Outcome Scale (GOS-E) was investigated. FINDINGS: Between Dec 19, 2014, and Dec 17, 2017, 4509 patients with traumatic brain injury were recruited into the CENTER-TBI core dataset, of whom 1728 were eligible for this analysis. Glucose variation (defined as the difference between daily maximum and minimum glucose concentrations) and brain biomarkers (S100B, NSE

Journal article

Ribeiro FDS, Glocker B, 2024, Demystifying Variational Diffusion Models., CoRR, Vol: abs/2401.06281

Journal article

Rockall AG, Li X, Johnson N, Lavdas I, Santhakumaran S, Prevost AT, Punwani S, Goh V, Barwick TD, Bharwani N, Sandhu A, Sidhu H, Plumb A, Burn J, Fagan A, Wengert GJ, Koh D-M, Reczko K, Dou Q, Warwick J, Liu X, Messiou C, Tunariu N, Boavida P, Soneji N, Johnston EW, Kelly-Morland C, De Paepe KN, Sokhi H, Wallitt K, Lakhani A, Russell J, Salib M, Vinnicombe S, Haq A, Aboagye EO, Taylor S, Glocker Bet al., 2023, Development and evaluation of machine learning in whole-body magnetic resonance imaging for detecting metastases in patients with lung or colon cancer: a diagnostic test accuracy study, Investigative Radiology, Vol: 58, Pages: 823-831, ISSN: 0020-9996

OBJECTIVES: Whole-body magnetic resonance imaging (WB-MRI) has been demonstrated to be efficient and cost-effective for cancer staging. The study aim was to develop a machine learning (ML) algorithm to improve radiologists' sensitivity and specificity for metastasis detection and reduce reading times. MATERIALS AND METHODS: A retrospective analysis of 438 prospectively collected WB-MRI scans from multicenter Streamline studies (February 2013-September 2016) was undertaken. Disease sites were manually labeled using Streamline reference standard. Whole-body MRI scans were randomly allocated to training and testing sets. A model for malignant lesion detection was developed based on convolutional neural networks and a 2-stage training strategy. The final algorithm generated lesion probability heat maps. Using a concurrent reader paradigm, 25 radiologists (18 experienced, 7 inexperienced in WB-/MRI) were randomly allocated WB-MRI scans with or without ML support to detect malignant lesions over 2 or 3 reading rounds. Reads were undertaken in the setting of a diagnostic radiology reading room between November 2019 and March 2020. Reading times were recorded by a scribe. Prespecified analysis included sensitivity, specificity, interobserver agreement, and reading time of radiology readers to detect metastases with or without ML support. Reader performance for detection of the primary tumor was also evaluated. RESULTS: Four hundred thirty-three evaluable WB-MRI scans were allocated to algorithm training (245) or radiology testing (50 patients with metastases, from primary 117 colon [n = 117] or lung [n = 71] cancer). Among a total 562 reads by experienced radiologists over 2 reading rounds, per-patient specificity was 86.2% (ML) and 87.7% (non-ML) (-1.5% difference; 95% confidence interval [CI], -6.4%, 3.5%; P = 0.39). Sensitivity was 66.0% (ML) and 70.0% (non-ML) (-4.0% difference; 95% CI, -13.5%, 5.5%; P = 0.344). Among 161 reads by inexperienced readers, per-patient spec

Journal article

Ng AY, Oberije CJG, Ambrózay É, Szabó E, Serfőző O, Karpati E, Fox G, Glocker B, Morris EA, Forrai G, Kecskemethy PDet al., 2023, Prospective implementation of AI-assisted screen reading to improve early detection of breast cancer., Nat Med, Vol: 29, Pages: 3044-3049

Artificial intelligence (AI) has the potential to improve breast cancer screening; however, prospective evidence of the safe implementation of AI into real clinical practice is limited. A commercially available AI system was implemented as an additional reader to standard double reading to flag cases for further arbitration review among screened women. Performance was assessed prospectively in three phases: a single-center pilot rollout, a wider multicenter pilot rollout and a full live rollout. The results showed that, compared to double reading, implementing the AI-assisted additional-reader process could achieve 0.7-1.6 additional cancer detection per 1,000 cases, with 0.16-0.30% additional recalls, 0-0.23% unnecessary recalls and a 0.1-1.9% increase in positive predictive value (PPV) after 7-11% additional human reads of AI-flagged cases (equating to 4-6% additional overall reading workload). The majority of cancerous cases detected by the AI-assisted additional-reader process were invasive (83.3%) and small-sized (≤10 mm, 47.0%). This evaluation suggests that using AI as an additional reader can improve the early detection of breast cancer with relevant prognostic features, with minimal to no unnecessary recalls. Although the AI-assisted additional-reader workflow requires additional reads, the higher PPV suggests that it can increase screening effectiveness.

Journal article

Glocker B, Jones C, Bernhardt M, Winzeck Set al., 2023, Risk of bias in chest radiography deep learning foundation models, Radiology: Artificial Intelligence, Vol: 5, ISSN: 2638-6100

Purpose:To analyze a recently published chest radiography foundation model for the presence of biases that could lead to subgroup performance disparities across biologic sex and race.Materials and Methods:This Health Insurance Portability and Accountability Act–compliant retrospective study used 127 118 chest radiographs from 42 884 patients (mean age, 63 years ± 17 [SD]; 23 623 male, 19 261 female) from the CheXpert dataset that were collected between October 2002 and July 2017. To determine the presence of bias in features generated by a chest radiography foundation model and baseline deep learning model, dimensionality reduction methods together with two-sample Kolmogorov–Smirnov tests were used to detect distribution shifts across sex and race. A comprehensive disease detection performance analysis was then performed to associate any biases in the features to specific disparities in classification performance across patient subgroups.Results:Ten of 12 pairwise comparisons across biologic sex and race showed statistically significant differences in the studied foundation model, compared with four significant tests in the baseline model. Significant differences were found between male and female (P < .001) and Asian and Black (P < .001) patients in the feature projections that primarily capture disease. Compared with average model performance across all subgroups, classification performance on the “no finding” label decreased between 6.8% and 7.8% for female patients, and performance in detecting “pleural effusion” decreased between 10.7% and 11.6% for Black patients.Conclusion:The studied chest radiography foundation model demonstrated racial and sex-related bias, which led to disparate performance across patient subgroups; thus, this model may be unsafe for clinical applications.

Journal article

Li Z, Kamnitsas K, Dou Q, Qin C, Glocker Bet al., 2023, Joint Optimization of Class-Specific Training- and Test-Time Data Augmentation in Segmentation., IEEE Trans. Medical Imaging, Vol: 42, Pages: 3323-3335

Journal article

Li Z, Kamnitsas K, Dou Q, Qin C, Glocker Bet al., 2023, Joint optimization of class-specific training- and test-time data augmentation in segmentation, IEEE Transactions on Medical Imaging, Vol: 42, Pages: 3323-3335, ISSN: 0278-0062

This paper presents an effective and general data augmentation framework for medical image segmentation. We adopt a computationally efficient and data-efficient gradient-based meta-learning scheme to explicitly align the distribution of training and validation data which is used as a proxy for unseen test data. We improve the current data augmentation strategies with two core designs. First, we learn class-specific training-time data augmentation (TRA) effectively increasing the heterogeneity within the training subsets and tackling the class imbalance common in segmentation. Second, we jointly optimize TRA and test-time data augmentation (TEA), which are closely connected as both aim to align the training and test data distribution but were so far considered separately in previous works. We demonstrate the effectiveness of our method on four medical image segmentation tasks across different scenarios with two state-of-the-art segmentation models, DeepMedic and nnU-Net. Extensive experimentation shows that the proposed data augmentation framework can significantly and consistently improve the segmentation performance when compared to existing solutions. Code is publicly available at https://github.com/ZerojumpLine/JCSAugment.

Journal article

Roschewitz M, Khara G, Yearsley J, Sharma N, James JJ, Ambrózay É, Heroux A, Kecskemethy P, Rijken T, Glocker Bet al., 2023, Automatic correction of performance drift under acquisition shift in medical image classification., Nat Commun, Vol: 14

Image-based prediction models for disease detection are sensitive to changes in data acquisition such as the replacement of scanner hardware or updates to the image processing software. The resulting differences in image characteristics may lead to drifts in clinically relevant performance metrics which could cause harm in clinical decision making, even for models that generalise in terms of area under the receiver-operating characteristic curve. We propose Unsupervised Prediction Alignment, a generic automatic recalibration method that requires no ground truth annotations and only limited amounts of unlabelled example images from the shifted data distribution. We illustrate the effectiveness of the proposed method to detect and correct performance drift in mammography-based breast cancer screening and on publicly available histopathology data. We show that the proposed method can preserve the expected performance in terms of sensitivity/specificity under various realistic scenarios of image acquisition shift, thus offering an important safeguard for clinical deployment.

Journal article

Islam M, Seenivasan L, Sharan SP, Viekash VK, Gupta B, Glocker B, Ren Het al., 2023, Paced-curriculum distillation with prediction and label uncertainty for image segmentation., Int. J. Comput. Assist. Radiol. Surg., Vol: 18, Pages: 1875-1883

Journal article

Islam M, Seenivasan L, Sharan SP, Viekash VK, Gupta B, Glocker B, Ren Het al., 2023, Paced-curriculum distillation with prediction and label uncertainty for image segmentation, International Journal of Computer Assisted Radiology and Surgery, Vol: 18, Pages: 1875-1883, ISSN: 1861-6410

PURPOSE: In curriculum learning, the idea is to train on easier samples first and gradually increase the difficulty, while in self-paced learning, a pacing function defines the speed to adapt the training progress. While both methods heavily rely on the ability to score the difficulty of data samples, an optimal scoring function is still under exploration. METHODOLOGY: Distillation is a knowledge transfer approach where a teacher network guides a student network by feeding a sequence of random samples. We argue that guiding student networks with an efficient curriculum strategy can improve model generalization and robustness. For this purpose, we design an uncertainty-based paced curriculum learning in self-distillation for medical image segmentation. We fuse the prediction uncertainty and annotation boundary uncertainty to develop a novel paced-curriculum distillation (P-CD). We utilize the teacher model to obtain prediction uncertainty and spatially varying label smoothing with Gaussian kernel to generate segmentation boundary uncertainty from the annotation. We also investigate the robustness of our method by applying various types and severity of image perturbation and corruption. RESULTS: The proposed technique is validated on two medical datasets of breast ultrasound image segmentation and robot-assisted surgical scene segmentation and achieved significantly better performance in terms of segmentation and robustness. CONCLUSION: P-CD improves the performance and obtains better generalization and robustness over the dataset shift. While curriculum learning requires extensive tuning of hyper-parameters for pacing function, the level of performance improvement suppresses this limitation.

Journal article

Ribeiro FDS, Xia T, Monteiro M, Pawlowski N, Glocker Bet al., 2023, High fidelity image counterfactuals with probabilistic causal models, ICML 2023, Publisher: ML Research Press, Pages: 7390-7425

We present a general causal generative modelling framework for accurateestimation of high fidelity image counterfactuals with deep structural causalmodels. Estimation of interventional and counterfactual queries forhigh-dimensional structured variables, such as images, remains a challengingtask. We leverage ideas from causal mediation analysis and advances ingenerative modelling to design new deep causal mechanisms for structuredvariables in causal models. Our experiments demonstrate that our proposedmechanisms are capable of accurate abduction and estimation of direct, indirectand total effects as measured by axiomatic soundness of counterfactuals.

Conference paper

Pinto MS, Winzeck S, Kornaropoulos EN, Richter S, Paolella R, Correia MM, Glocker B, Williams G, Vik A, Posti JP, Haberg A, Stenberg J, Guns P-J, den Dekker AJ, Menon DK, Sijbers J, Van Dyck P, Newcombe VFJet al., 2023, Use of Support Vector Machines Approach via ComBat Harmonized Diffusion Tensor Imaging for the Diagnosis and Prognosis of Mild Traumatic Brain Injury: A CENTER-TBI Study, JOURNAL OF NEUROTRAUMA, Vol: 40, Pages: 1317-1338, ISSN: 0897-7151

Journal article

Li Z, Kamnitsas K, Ouyang C, Chen C, Glocker Bet al., 2023, Context label learning: improving background class representations in semantic segmentation, IEEE Transactions on Medical Imaging, Vol: 42, Pages: 1885-1896, ISSN: 0278-0062

Background samples provide key contextual information for segmenting regionsof interest (ROIs). However, they always cover a diverse set of structures,causing difficulties for the segmentation model to learn good decisionboundaries with high sensitivity and precision. The issue concerns the highlyheterogeneous nature of the background class, resulting in multi-modaldistributions. Empirically, we find that neural networks trained withheterogeneous background struggle to map the corresponding contextual samplesto compact clusters in feature space. As a result, the distribution overbackground logit activations may shift across the decision boundary, leading tosystematic over-segmentation across different datasets and tasks. In thisstudy, we propose context label learning (CoLab) to improve the contextrepresentations by decomposing the background class into several subclasses.Specifically, we train an auxiliary network as a task generator, along with theprimary segmentation model, to automatically generate context labels thatpositively affect the ROI segmentation accuracy. Extensive experiments areconducted on several challenging segmentation tasks and datasets. The resultsdemonstrate that CoLab can guide the segmentation model to map the logits ofbackground samples away from the decision boundary, resulting in significantlyimproved segmentation accuracy. Code is available.

Journal article

Li Z, Kamnitsas K, Ouyang C, Chen C, Glocker Bet al., 2023, Context Label Learning: Improving Background Class Representations in Semantic Segmentation., IEEE Trans. Medical Imaging, Vol: 42, Pages: 1885-1896

Journal article

Mackay K, Bernstein D, Glocker B, Kamnitsas K, Taylor Aet al., 2023, A review of the metrics used to assess auto-contouring systems in radiotherapy, Clinical Oncology, Vol: 35, Pages: 354-369, ISSN: 0936-6555

Auto-contouring could revolutionise future planning of radiotherapy treatment. The lack of consensus on how to assess and validate auto-contouring systems currently limits clinical use. This review formally quantifies the assessment metrics used in studies published during one calendar year and assesses the need for standardised practice. A PubMed literature search was undertaken for papers evaluating radiotherapy auto-contouring published during 2021. Papers were assessed for types of metric and the methodology used to generate ground-truth comparators. Our PubMed search identified 212 studies, of which 117 met the criteria for clinical review. Geometric assessment metrics were used in 116 of 117 studies (99.1%). This includes the Dice Similarity Coefficient used in 113 (96.6%) studies. Clinically relevant metrics, such as qualitative, dosimetric and time-saving metrics, were less frequently used in 22 (18.8%), 27 (23.1%) and 18 (15.4%) of 117 studies, respectively. There was heterogeneity within each category of metric. Over 90 different names for geometric measures were used. Methods for qualitative assessment were different in all but two papers. Variation existed in the methods used to generate radiotherapy plans for dosimetric assessment. Consideration of editing time was only given in 11 (9.4%) papers. A single manual contour as a ground-truth comparator was used in 65 (55.6%) studies. Only 31 (26.5%) studies compared auto-contours to usual inter- and/or intra-observer variation. In conclusion, significant variation exists in how research papers currently assess the accuracy of automatically generated contours. Geometric measures are the most popular, however their clinical utility is unknown. There is heterogeneity in the methods used to perform clinical assessment. Considering the different stages of system implementation may provide a framework to decide the most appropriate metrics. This analysis supports the need for a consensus on the clinical implement

Journal article

Xu M, Islam M, Glocker B, Ren Het al., 2023, Confidence-Aware Paced-Curriculum Learning by Label Smoothing for Surgical Scene Understanding, IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, ISSN: 1545-5955

Journal article

Li L, Heselgrave A, Soreq E, Nattino G, Rosnati M, Garbero E, Zimmerman K, Graham N, Moro F, Novelli D, Gradisek P, Magnoni S, Glocker B, Zetterberg H, Bertolini G, Sharp Det al., 2023, Investigating the characteristics and correlates of systemic inflammation after traumatic brain injury: the TBI-BraINFLAMM study, BMJ Open, Vol: 13, ISSN: 2044-6055

Introduction: A significant environmental risk factor for neurodegenerative disease is traumatic brain injury (TBI). However, it is not clear how TBI results in ongoing chronic neurodegeneration. Animal studies show that systemic inflammation is signalled to the brain. This can result in sustained and aggressive microglial activation, which in turn is associated with widespread neurodegeneration. We aim to evaluate systemic inflammation as a mediator of ongoing neurodegeneration after TBI.Methods and analysis: TBI-braINFLAMM will combine data already collected from two large prospective TBI studies. The CREACTIVE study, a broad consortium which enrolled >8000 patients with TBI to have CT scans and blood samples in the hyperacute period, has data available from 854 patients. The BIO-AX-TBI study recruited 311 patients to have acute CT scans, longitudinal blood samples and longitudinal MRI brain scans. The BIO-AX-TBI study also has data from 102 healthy and 24 non-TBI trauma controls, comprising blood samples (both control groups) and MRI scans (healthy controls only). All blood samples from BIO-AX-TBI and CREACTIVE have already been tested for neuronal injury markers (GFAP, tau and NfL), and CREACTIVE blood samples have been tested for inflammatory cytokines. We will additionally test inflammatory cytokine levels from the already collected longitudinal blood samples in the BIO-AX-TBI study, as well as matched microdialysate and blood samples taken during the acute period from a subgroup of patients with TBI (n=18).We will use this unique dataset to characterise post-TBI systemic inflammation, and its relationships with injury severity and ongoing neurodegeneration.Ethics and dissemination: Ethical approval for this study has been granted by the London—Camberwell St Giles Research Ethics Committee (17/LO/2066). Results will be submitted for publication in peer-review journals, presented at conferences and inform the design of larger observational and experime

Journal article

Sharma N, Ng AY, James JJ, Khara G, Ambrozay E, Austin CC, Forrai G, Fox G, Glocker B, Heindl A, Karpati E, Rijken TM, Venkataraman V, Yearsley JE, Kecskemethy PDet al., 2023, Multi-vendor evaluation of artificial intelligence as an independent reader for double reading in breast cancer screening on 275,900 mammograms, BMC CANCER, Vol: 23

Journal article

Ng AY, Glocker B, Oberije C, Fox G, Sharma N, James JJ, Ambrozay E, Nash J, Karpati E, Kerruish S, Kecskemethy PDet al., 2023, Artificial intelligence as supporting reader in breast screening: a novel workflow to preserve quality and reduce workload, Journal of Breast Imaging, Vol: 5, Pages: 267-276, ISSN: 2631-6110

ObjectiveTo evaluate the effectiveness of a new strategy for using artificial intelligence (AI) as supporting reader for the detection of breast cancer in mammography-based double reading screening practice.MethodsLarge-scale multi-site, multi-vendor data were used to retrospectively evaluate a new paradigm of AI-supported reading. Here, the AI served as the second reader only if it agrees with the recall/no-recall decision of the first human reader. Otherwise, a second human reader made an assessment followed by the standard clinical workflow. The data included 280 594 cases from 180 542 female participants screened for breast cancer at seven screening sites in two countries and using equipment from four hardware vendors. The statistical analysis included non-inferiority and superiority testing of cancer screening performance and evaluation of the reduction in workload, measured as arbitration rate and number of cases requiring second human reading.ResultsArtificial intelligence as a supporting reader was found to be superior or noninferior on all screening metrics compared with human double reading while reducing the number of cases requiring second human reading by up to 87% (245 395/280 594). Compared with AI as an independent reader, the number of cases referred to arbitration was reduced from 13% (35 199/280 594) to 2% (5056/280 594).ConclusionThe simulation indicates that the proposed workflow retains screening performance of human double reading while substantially reducing the workload. Further research should study the impact on the second human reader because they would only assess cases in which the AI prediction and first human reader disagree.

Journal article

Santhirasekaram A, Kori A, Winkler M, Rockall A, Toni F, Glocker Bet al., 2023, Robust Hierarchical Symbolic Explanations in Hyperbolic Space for Image Classification, Computer Vision and Pattern Recognition

Conference paper

Glocker B, Jones C, Bernhardt M, Winzeck Set al., 2023, Algorithmic encoding of protected characteristics in chest X-ray disease detection models, EBioMedicine, Vol: 89, Pages: 1-19, ISSN: 2352-3964

BackgroundIt has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. An algorithm may encode protected characteristics, and then use this information for making predictions due to undesirable correlations in the (historical) training data. It remains unclear how we can establish whether such information is actually used. Besides the scarcity of data from underserved populations, very little is known about how dataset biases manifest in predictive models and how this may result in disparate performance. This article aims to shed some light on these issues by exploring methodology for subgroup analysis in image-based disease detection models.MethodsWe utilize two publicly available chest X-ray datasets, CheXpert and MIMIC-CXR, to study performance disparities across race and biological sex in deep learning models. We explore test set resampling, transfer learning, multitask learning, and model inspection to assess the relationship between the encoding of protected characteristics and disease detection performance across subgroups.FindingsWe confirm subgroup disparities in terms of shifted true and false positive rates which are partially removed after correcting for population and prevalence shifts in the test sets. We find that transfer learning alone is insufficient for establishing whether specific patient information is used for making predictions. The proposed combination of test-set resampling, multitask learning, and model inspection reveals valuable insights about the way protected characteristics are encoded in the feature representations of deep neural networks.InterpretationSubgroup analysis is key for identifying performance disparities of AI models, but statistical differences across subgroups need to be taken into account when analyzing potential biases in disease detection. The proposed methodology provides a comprehensive framework for subgroup analysis enabling further research into the underlyi

Journal article

Menten MJ, Holland R, Leingang O, Bogunovic H, Hagag AM, Kaye R, Riedl S, Traber GL, Hassan ON, Pawlowski N, Glocker B, Fritsche LG, Scholl HPN, Sivaprasad S, Schmidt-Erfurth U, Rueckert D, Lotery AJet al., 2023, Exploring healthy retinal aging with deep learning, Ophthalmology Science, Vol: 3, Pages: 1-10, ISSN: 2666-9145

PurposeTo study the individual course of retinal changes caused by healthy aging using deep learning.DesignRetrospective analysis of a large data set of retinal OCT images.ParticipantsA total of 85 709 adults between the age of 40 and 75 years of whom OCT images were acquired in the scope of the UK Biobank population study.MethodsWe created a counterfactual generative adversarial network (GAN), a type of neural network that learns from cross-sectional, retrospective data. It then synthesizes high-resolution counterfactual OCT images and longitudinal time series. These counterfactuals allow visualization and analysis of hypothetical scenarios in which certain characteristics of the imaged subject, such as age or sex, are altered, whereas other attributes, crucially the subject’s identity and image acquisition settings, remain fixed.Main Outcome MeasuresUsing our counterfactual GAN, we investigated subject-specific changes in the retinal layer structure as a function of age and sex. In particular, we measured changes in the retinal nerve fiber layer (RNFL), combined ganglion cell layer plus inner plexiform layer (GCIPL), inner nuclear layer to the inner boundary of the retinal pigment epithelium (INL-RPE), and retinal pigment epithelium (RPE).ResultsOur counterfactual GAN is able to smoothly visualize the individual course of retinal aging. Across all counterfactual images, the RNFL, GCIPL, INL-RPE, and RPE changed by −0.1 μm ± 0.1 μm, −0.5 μm ± 0.2 μm, −0.2 μm ± 0.1 μm, and 0.1 μm ± 0.1 μm, respectively, per decade of age. These results agree well with previous studies based on the same cohort from the UK Biobank population study. Beyond population-wide average measures, our counterfactual GAN allows us to explore whether the retinal layers of a given eye will increase in thickness, decrease in thickness, or stagnate as a subject ages.ConclusionThis study demonstrates how counterfactual GANs

Journal article

Monteiro M, De Sousa Ribeiro F, Pawlowski N, Coelho De Castro D, Glocker Bet al., 2023, Measuring axiomatic soundness of counterfactual image models, International Conference on Learning Representations (ICLR)

We use the axiomatic definition of counterfactual to derive metrics that enable quantifying the correctness of approximate counterfactual inference models.Abstract: We present a general framework for evaluating image counterfactuals. The power and flexibility of deep generative models make them valuable tools for learning mechanisms in structural causal models. However, their flexibility makes counterfactual identifiability impossible in the general case.Motivated by these issues, we revisit Pearl's axiomatic definition of counterfactuals to determine the necessary constraints of any counterfactual inference model: composition, reversibility, and effectiveness. We frame counterfactuals as functions of an input variable, its parents, and counterfactual parents and use the axiomatic constraints to restrict the set of functions that could represent the counterfactual, thus deriving distance metrics between the approximate and ideal functions. We demonstrate how these metrics can be used to compare and choose between different approximate counterfactual inference models and to provide insight into a model's shortcomings and trade-offs.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00795421&limit=30&person=true