Imperial College London

Dr Ben Glocker

Faculty of EngineeringDepartment of Computing

Reader in Machine Learning for Imaging
 
 
 
//

Contact

 

+44 (0)20 7594 8334b.glocker Website CV

 
 
//

Location

 

377Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

238 results found

Liu X, Glocker B, McCradden MM, Ghassemi M, Denniston AK, Oakden-Rayner Let al., 2022, The medical algorithmic audit., The Lancet Digital Health, Vol: 4, Pages: e384-e397, ISSN: 2589-7500

Artificial intelligence systems for health care, like any other medical device, have the potential to fail. However, specific qualities of artificial intelligence systems, such as the tendency to learn spurious correlates in training data, poor generalisability to new deployment settings, and a paucity of reliable explainability mechanisms, mean they can yield unpredictable errors that might be entirely missed without proactive investigation. We propose a medical algorithmic audit framework that guides the auditor through a process of considering potential algorithmic errors in the context of a clinical task, mapping the components that might contribute to the occurrence of errors, and anticipating their potential consequences. We suggest several approaches for testing algorithmic errors, including exploratory error analysis, subgroup testing, and adversarial testing, and provide examples from our own work and previous studies. The medical algorithmic audit is a tool that can be used to better understand the weaknesses of an artificial intelligence system and put in place mechanisms to mitigate their impact. We propose that safety monitoring and medical algorithmic auditing should be a joint responsibility between users and developers, and encourage the use of feedback mechanisms between these groups to promote learning and maintain safe deployment of artificial intelligence systems.

Journal article

Dou Q, So TY, Jiang M, Liu Q, Vardhanabhuti V, Kaissis G, Li Z, Si W, Lee HHC, Yu K, Feng Z, Dong L, Burian E, Jungmann F, Braren R, Makowski M, Kainz B, Rueckert D, Glocker B, Yu SCH, Heng PAet al., 2022, Y Federated deep learning for detecting COVID-19 lung abnormalities in CT: a privacy-preserving multinational validation study (vol 5, 56, 2022), NPJ DIGITAL MEDICINE, Vol: 5, ISSN: 2398-6352

Journal article

Newcombe VFJ, Ashton NJ, Posti JP, Glocker B, Manktelow A, Chatfield DA, Winzeck S, Needham E, Correia MM, Williams GB, Simrén J, Takala RSK, Katila AJ, Maanpää H-R, Tallus J, Frantzén J, Blennow K, Tenovuo O, Zetterberg H, Menon DKet al., 2022, Post-acute blood biomarkers and disease progression in traumatic brain injury., Brain

There is substantial interest in the potential for traumatic brain injury to result in progressive neurological deterioration. While blood biomarkers such as glial fibrillary acid protein and neurofilament light have been widely explored in characterising acute traumatic brain injury, their use in the chronic phase is limited. Given increasing evidence that these proteins may be markers of ongoing neurodegeneration in a range of diseases, we examined their relationship to imaging changes and functional outcome in the months to years following traumatic brain injury. Two-hundred and three patients were recruited in two separate cohorts; six months post-injury (n=165); and >5 years post-injury (n=38; 12 of whom also provided data ∼8 months post-TBI). Subjects underwent blood biomarker sampling (n=199) and magnetic resonance imaging (n=172; including diffusion tensor imaging). Data from patient cohorts were compared to 59 healthy volunteers and 21 non-brain injury trauma controls. Mean diffusivity and fractional anisotropy were calculated in cortical grey matter, deep grey matter and whole brain white matter. Accelerated brain ageing was calculated at a whole brain level as the predicted age difference defined using T1-weighted images, and at a voxel-based level as the annualised Jacobian determinants in white matter and grey matter, referenced to a population of 652 healthy control subjects. Serum neurofilament light concentrations were elevated in the early chronic phase. While GFAP values were within the normal range at ∼8 months, many patients showed a secondary and temporally distinct elevations up to >5 years after injury. Biomarker elevation at six months was significantly related to metrics of microstructural injury on diffusion tensor imaging. Biomarker levels at ∼8 months predicted white matter volume loss at >5 years, and annualised brain volume loss between ∼8 months and 5 years. Patients who worsened functionally between ∼8 mon

Journal article

Bernhardt M, Castro DC, Tanno R, Schwaighofer A, Tezcan KC, Monteiro M, Bannur S, Lungren M, Nori A, Glocker B, Alvarez-Valle J, Oktay Oet al., 2022, Active label cleaning for improved dataset quality under resource constraints, Publisher: NATURE PORTFOLIO

Working paper

Whitehouse DP, Monteiro M, Czeiter E, Vyvere TV, Valerio F, Ye Z, Amrein K, Kamnitsas K, Xu H, Yang Z, Verheyden J, Das T, Kornaropoulos EN, Steyerberg E, Maas AIR, Wang KKW, Büki A, Glocker B, Menon DK, Newcombe VFJ, CENTER-TBI Participants and Investigatorset al., 2021, Relationship of admission blood proteomic biomarkers levels to lesion type and lesion burden in traumatic brain injury: A CENTER-TBI study., EBioMedicine, Vol: 75, Pages: 1-15, ISSN: 2352-3964

BACKGROUND: We aimed to understand the relationship between serum biomarker concentration and lesion type and volume found on computed tomography (CT) following all severities of TBI. METHODS: Concentrations of six serum biomarkers (GFAP, NFL, NSE, S100B, t-tau and UCH-L1) were measured in samples obtained <24 hours post-injury from 2869 patients with all severities of TBI, enrolled in the CENTER-TBI prospective cohort study (NCT02210221). Imaging phenotypes were defined as intraparenchymal haemorrhage (IPH), oedema, subdural haematoma (SDH), extradural haematoma (EDH), traumatic subarachnoid haemorrhage (tSAH), diffuse axonal injury (DAI), and intraventricular haemorrhage (IVH). Multivariable polynomial regression was performed to examine the association between biomarker levels and both distinct lesion types and lesion volumes. Hierarchical clustering was used to explore imaging phenotypes; and principal component analysis and k-means clustering of acute biomarker concentrations to explore patterns of biomarker clustering. FINDINGS: 2869 patient were included, 68% (n=1946) male with a median age of 49 years (range 2-96). All severities of TBI (mild, moderate and severe) were included for analysis with majority (n=1946, 68%) having a mild injury (GCS 13-15). Patients with severe diffuse injury (Marshall III/IV) showed significantly higher levels of all measured biomarkers, with the exception of NFL, than patients with focal mass lesions (Marshall grades V/VI). Patients with either DAI+IVH or SDH+IPH+tSAH, had significantly higher biomarker concentrations than patients with EDH. Higher biomarker concentrations were associated with greater volume of IPH (GFAP, S100B, t-tau;adj r2 range:0·48-0·49; p<0·05), oedema (GFAP, NFL, NSE, t-tau, UCH-L1;adj r2 range:0·44-0·44; p<0·01), IVH (S100B;adj r2 range:0.48-0.49; p<0.05), Unsupervised k-means biomarker clustering revealed two clusters explaining 83·9% of varian

Journal article

Popescu SG, Glocker B, Sharp DJ, Cole JHet al., 2021, Local brain-age: A u-net model, Frontiers in Aging Neuroscience, Vol: 13, Pages: 1-17, ISSN: 1663-4365

We propose a new framework for estimating neuroimaging-derived “brain-age” at a local level within the brain, using deep learning. The local approach, contrary to existing global methods, provides spatial information on anatomical patterns of brain ageing. We trained a U-Net model using brain MRI scans from n = 3,463 healthy people (aged 18–90 years) to produce individualised 3D maps of brain-predicted age. When testing on n = 692 healthy people, we found a median (across participant) mean absolute error (within participant) of 9.5 years. Performance was more accurate (MAE around 7 years) in the prefrontal cortex and periventricular areas. We also introduce a new voxelwise method to reduce the age-bias when predicting local brain-age “gaps.” To validate local brain-age predictions, we tested the model in people with mild cognitive impairment or dementia using data from OASIS3 (n = 267). Different local brain-age patterns were evident between healthy controls and people with mild cognitive impairment or dementia, particularly in subcortical regions such as the accumbens, putamen, pallidum, hippocampus, and amygdala. Comparing groups based on mean local brain-age over regions-of-interest resulted in large effects sizes, with Cohen's d values >1.5, for example when comparing people with stable and progressive mild cognitive impairment. Our local brain-age framework has the potential to provide spatial information leading to a more mechanistic understanding of individual differences in patterns of brain ageing in health and disease.

Journal article

Folgoc LL, Baltatzis V, Alansary A, Desai S, Devaraj A, Ellis S, Manzanera OEM, Kanavati F, Nair A, Schnabel J, Glocker Bet al., 2021, Bayesian analysis of the prevalence bias: learning and predicting from imbalanced data, Publisher: ArXiv

Datasets are rarely a realistic approximation of the target population. Say,prevalence is misrepresented, image quality is above clinical standards, etc.This mismatch is known as sampling bias. Sampling biases are a major hindrancefor machine learning models. They cause significant gaps between modelperformance in the lab and in the real world. Our work is a solution toprevalence bias. Prevalence bias is the discrepancy between the prevalence of apathology and its sampling rate in the training dataset, introduced uponcollecting data or due to the practioner rebalancing the training batches. Thispaper lays the theoretical and computational framework for training models, andfor prediction, in the presence of prevalence bias. Concretely a bias-correctedloss function, as well as bias-corrected predictive rules, are derived underthe principles of Bayesian risk minimization. The loss exhibits a directconnection to the information gain. It offers a principled alternative toheuristic training losses and complements test-time procedures based onselecting an operating point from summary curves. It integrates seamlessly inthe current paradigm of (deep) learning using stochastic backpropagation andnaturally with Bayesian models.

Working paper

Sounderajah V, Ashrafian H, Rose S, Shah NH, Ghassemi M, Golub R, Kahn CE, Esteva A, Karthikesalingam A, Mateen B, Webster D, Milea D, Ting D, Treanor D, Cushnan D, King D, McPherson D, Glocker B, Greaves F, Harling L, Ordish J, Cohen JF, Deeks J, Leeflang M, Diamond M, McInnes MDF, McCradden M, Abramoff MD, Normahani P, Markar SR, Chang S, Liu X, Mallett S, Shetty S, Denniston A, Collins GS, Moher D, Whiting P, Bossuyt PM, Darzi Aet al., 2021, A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI, NATURE MEDICINE, Vol: 27, Pages: 1663-1665, ISSN: 1078-8956

Journal article

Baltatzis V, Bintsi K-M, Folgoc LL, Manzanera OEM, Ellis S, Nair A, Desai S, Glocker B, Schnabel JAet al., 2021, The pitfalls of sample selection: a case study on lung nodule classification, Predictive Intelligence in Medicine at MICCAI

Using publicly available data to determine the performance of methodologicalcontributions is important as it facilitates reproducibility and allowsscrutiny of the published results. In lung nodule classification, for example,many works report results on the publicly available LIDC dataset. In theory,this should allow a direct comparison of the performance of proposed methodsand assess the impact of individual contributions. When analyzing seven recentworks, however, we find that each employs a different data selection process,leading to largely varying total number of samples and ratios between benignand malignant cases. As each subset will have different characteristics withvarying difficulty for classification, a direct comparison between the proposedmethods is thus not always possible, nor fair. We study the particular effectof truthing when aggregating labels from multiple experts. We show thatspecific choices can have severe impact on the data distribution where it maybe possible to achieve superior performance on one sample distribution but noton another. While we show that we can further improve on the state-of-the-arton one sample selection, we also find that on a more challenging sampleselection, on the same database, the more advanced models underperform withrespect to very simple baseline methods, highlighting that the selected datadistribution may play an even more important role than the model architecture.This raises concerns about the validity of claimed methodologicalcontributions. We believe the community should be aware of these pitfalls andmake recommendations on how these can be avoided in future work.

Conference paper

Qaiser T, Winzeck S, Barfoot T, Barwick T, Doran SJ, Kaiser MF, Wedlake L, Tunariu N, Koh D-M, Messiou C, Rockall A, Glocker Bet al., 2021, Multiple instance learning with auxiliary task weighting for multiple myeloma classification, International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI)

Whole body magnetic resonance imaging (WB-MRI) is the recommended modalityfor diagnosis of multiple myeloma (MM). WB-MRI is used to detect sites ofdisease across the entire skeletal system, but it requires significantexpertise and is time-consuming to report due to the great number of images. Toaid radiological reading, we propose an auxiliary task-based multiple instancelearning approach (ATMIL) for MM classification with the ability to localizesites of disease. This approach is appealing as it only requires patient-levelannotations where an attention mechanism is used to identify local regions withactive disease. We borrow ideas from multi-task learning and define anauxiliary task with adaptive reweighting to support and improve learningefficiency in the presence of data scarcity. We validate our approach on bothsynthetic and real multi-center clinical data. We show that the MIL attentionmodule provides a mechanism to localize bone regions while the adaptivereweighting of the auxiliary task considerably improves the performance.

Conference paper

Baltatzis V, Folgoc LL, Ellis S, Manzanera OEM, Bintsi K-M, Nair A, Desai S, Glocker B, Schnabel JAet al., 2021, The effect of the loss on generalization: empirical study on syntheticlung nodule data, Interpretability of Machine Intelligence in Medical Image Computing at MICCAI 2021

Convolutional Neural Networks (CNNs) are widely used for image classificationin a variety of fields, including medical imaging. While most studies deploycross-entropy as the loss function in such tasks, a growing number ofapproaches have turned to a family of contrastive learning-based losses. Eventhough performance metrics such as accuracy, sensitivity and specificity areregularly used for the evaluation of CNN classifiers, the features that theseclassifiers actually learn are rarely identified and their effect on theclassification performance on out-of-distribution test samples isinsufficiently explored. In this paper, motivated by the real-world task oflung nodule classification, we investigate the features that a CNN learns whentrained and tested on different distributions of a synthetic dataset withcontrolled modes of variation. We show that different loss functions lead todifferent features being learned and consequently affect the generalizationability of the classifier on unseen data. This study provides some importantinsights into the design of deep learning solutions for medical imaging tasks.

Conference paper

Kamnitsas K, Winzeck S, Kornaropoulos EN, Whitehouse D, Englman C, Phyu P, Pao N, Menon DK, Rueckert D, Das T, Newcombe VFJ, Glocker Bet al., 2021, Transductive image segmentation: Self-training and effect of uncertainty estimation, MICCAI Workshop on Domain Adaptation and Representation Transfer

Semi-supervised learning (SSL) uses unlabeled data during training to learnbetter models. Previous studies on SSL for medical image segmentation focusedmostly on improving model generalization to unseen data. In some applications,however, our primary interest is not generalization but to obtain optimalpredictions on a specific unlabeled database that is fully available duringmodel development. Examples include population studies for extracting imagingphenotypes. This work investigates an often overlooked aspect of SSL,transduction. It focuses on the quality of predictions made on the unlabeleddata of interest when they are included for optimization during training,rather than improving generalization. We focus on the self-training frameworkand explore its potential for transduction. We analyze it through the lens ofInformation Gain and reveal that learning benefits from the use of calibratedor under-confident models. Our extensive experiments on a large MRI databasefor multi-class segmentation of traumatic brain lesions shows promising resultswhen comparing transductive with inductive predictions. We believe this studywill inspire further research on transductive learning, a well-suited paradigmfor medical image analysis.

Conference paper

Budd S, Sinclair M, Day T, Vlontzos A, Tan J, Liu T, Matthew J, Skelton E, Simpson J, Razavi R, Glocker B, Rueckert D, Robinson EC, Kainz Bet al., 2021, Detecting hypo-plastic left heart syndrome in fetal ultrasound via disease-specific atlas maps, 24th International Conference on Medical Image Computing and Computer Assisted Intervention, Publisher: Springer, Pages: 207-217, ISSN: 0302-9743

Fetal ultrasound screening during pregnancy plays a vital role in the early detection of fetal malformations which have potential long-term health impacts. The level of skill required to diagnose such malformations from live ultrasound during examination is high and resources for screening are often limited. We present an interpretable, atlas-learning segmentation method for automatic diagnosis of Hypo-plastic Left Heart Syndrome (HLHS) from a single ‘4 Chamber Heart’ view image. We propose to extend the recently introduced Image-and-Spatial Transformer Networks (Atlas-ISTN) into a framework that enables sensitising atlas generation to disease. In this framework we can jointly learn image segmentation, registration, atlas construction and disease prediction while providing a maximum level of clinical interpretability compared to direct image classification methods. As a result our segmentation allows diagnoses competitive with expert-derived manual diagnosis and yields an AUC-ROC of 0.978 (1043 cases for training, 260 for validation and 325 for testing).

Conference paper

Filbrandt G, Kamnitsas K, Bernstein D, Taylor A, Glocker Bet al., 2021, Learning from Partially Overlapping Labels: Image Segmentation under Annotation Shift, MICCAI Workshop on Domain Adaptation and Representation Transfer

Scarcity of high quality annotated images remains a limiting factor fortraining accurate image segmentation models. While more and more annotateddatasets become publicly available, the number of samples in each individualdatabase is often small. Combining different databases to create larger amountsof training data is appealing yet challenging due to the heterogeneity as aresult of differences in data acquisition and annotation processes, oftenyielding incompatible or even conflicting information. In this paper, weinvestigate and propose several strategies for learning from partiallyoverlapping labels in the context of abdominal organ segmentation. We find thatcombining a semi-supervised approach with an adaptive cross entropy loss cansuccessfully exploit heterogeneously annotated data and substantially improvesegmentation accuracy compared to baseline and alternative approaches.

Conference paper

Glocker B, Musolesi M, Richens J, Uhler Cet al., 2021, Causality in digital medicine, Nature Communications, Vol: 12, Pages: 1-6, ISSN: 2041-1723

Ben Glocker (an expert in machine learning for medical imaging, Imperial College London), Mirco Musolesi (a data science and digital health expert, University College London), Jonathan Richens (an expert in diagnostic machine learning models, Babylon Health) and Caroline Uhler (a computational biology expert, MIT) talked to Nature Communications about their research interests in causality inference and how this can provide a robust framework for digital medicine studies and their implementation, across different fields of application.

Journal article

Chen X, Pawlowski N, Glocker B, Konukoglu Eet al., 2021, Normative ascent with local gaussians for unsupervised lesion detection, MEDICAL IMAGE ANALYSIS, Vol: 74, ISSN: 1361-8415

Journal article

Li J, Pimentel P, Szengel A, Ehlke M, Lamecker H, Zachow S, Estacio L, Doenitz C, Ramm H, Shi H, Chen X, Matzkin F, Newcombe V, Ferrante E, Jin Y, Ellis DG, Aizenberg MR, Kodym O, Spanel M, Herout A, Mainprize JG, Fishman Z, Hardisty MR, Bayat A, Shit S, Wang B, Liu Z, Eder M, Pepe A, Gsaxner C, Alves V, Zefferer U, von Campe G, Pistracher K, Schaefer U, Schmalstieg D, Menze BH, Glocker B, Egger Jet al., 2021, AutoImplant 2020-First MICCAI Challenge on Automatic Cranial Implant Design, IEEE TRANSACTIONS ON MEDICAL IMAGING, Vol: 40, Pages: 2329-2342, ISSN: 0278-0062

Journal article

Usynin D, Ziller A, Makowski M, Braren R, Rueckert D, Glocker B, Kaissis G, Passerat-Palmbach Jet al., 2021, Adversarial interference and its mitigations in privacy-preserving collaborative machine learning, NATURE MACHINE INTELLIGENCE, Vol: 3, Pages: 749-758

Journal article

Co KT, Muñoz-González L, Kanthan L, Glocker B, Lupu ECet al., 2021, Universal adversarial robustness of texture and shape-biased models, IEEE International Conference on Image Processing (ICIP)

Increasing shape-bias in deep neural networks has been shown to improverobustness to common corruptions and noise. In this paper we analyze theadversarial robustness of texture and shape-biased models to UniversalAdversarial Perturbations (UAPs). We use UAPs to evaluate the robustness of DNNmodels with varying degrees of shape-based training. We find that shape-biasedmodels do not markedly improve adversarial robustness, and we show thatensembles of texture and shape-biased models can improve universal adversarialrobustness while maintaining strong performance.

Conference paper

Osuala R, Kushibar K, Garrucho L, Linardos A, Szafranowska Z, Klein S, Glocker B, Diaz O, Lekadir Ket al., 2021, A review of generative adversarial networks in cancer imaging: new applications, new solutions, Publisher: arXiv

Despite technological and medical advances, the detection, interpretation,and treatment of cancer based on imaging data continue to pose significantchallenges. These include high inter-observer variability, difficulty ofsmall-sized lesion detection, nodule interpretation and malignancydetermination, inter- and intra-tumour heterogeneity, class imbalance,segmentation inaccuracies, and treatment effect uncertainty. The recentadvancements in Generative Adversarial Networks (GANs) in computer vision aswell as in medical imaging may provide a basis for enhanced capabilities incancer detection and analysis. In this review, we assess the potential of GANsto address a number of key challenges of cancer imaging, including datascarcity and imbalance, domain and dataset shifts, data access and privacy,data annotation and quantification, as well as cancer detection, tumourprofiling and treatment planning. We provide a critical appraisal of theexisting literature of GANs applied to cancer imagery, together withsuggestions on future research directions to address these challenges. Weanalyse and discuss 163 papers that apply adversarial training techniques inthe context of cancer imaging and elaborate their methodologies, advantages andlimitations. With this work, we strive to bridge the gap between the needs ofthe clinical cancer imaging community and the current and prospective researchon GANs in the artificial intelligence community.

Working paper

Sekuboyina A, Husseini ME, Bayat A, Löffler M, Liebl H, Li H, Tetteh G, Kukačka J, Payer C, Štern D, Urschler M, Chen M, Cheng D, Lessmann N, Hu Y, Wang T, Yang D, Xu D, Ambellan F, Amiranashvili T, Ehlke M, Lamecker H, Lehnert S, Lirio M, Olaguer NPD, Ramm H, Sahu M, Tack A, Zachow S, Jiang T, Ma X, Angerman C, Wang X, Brown K, Wolf M, Kirszenberg A, Puybareau É, Chen D, Bai Y, Rapazzo BH, Yeah T, Zhang A, Xu S, Hou F, He Z, Zeng C, Xiangshang Z, Liming X, Netherton TJ, Mumme RP, Court LE, Huang Z, He C, Wang L-W, Ling SH, Huynh LD, Boutry N, Jakubicek R, Chmelik J, Mulay S, Sivaprakasam M, Paetzold JC, Shit S, Ezhov I, Wiestler B, Glocker B, Valentinitsch A, Rempfler M, Menze BH, Kirschke JSet al., 2021, VerSe: a vertebrae labelling and segmentation benchmark for multi-detector CT images, Medical Image Analysis, ISSN: 1361-8415

Vertebral labelling and segmentation are two fundamental tasks in anautomated spine processing pipeline. Reliable and accurate processing of spineimages is expected to benefit clinical decision-support systems for diagnosis,surgery planning, and population-based analysis on spine and bone health.However, designing automated algorithms for spine processing is challengingpredominantly due to considerable variations in anatomy and acquisitionprotocols and due to a severe shortage of publicly available data. Addressingthese limitations, the Large Scale Vertebrae Segmentation Challenge (VerSe) wasorganised in conjunction with the International Conference on Medical ImageComputing and Computer Assisted Intervention (MICCAI) in 2019 and 2020, with acall for algorithms towards labelling and segmentation of vertebrae. Twodatasets containing a total of 374 multi-detector CT scans from 355 patientswere prepared and 4505 vertebrae have individually been annotated atvoxel-level by a human-machine hybrid algorithm (https://osf.io/nqjyw/,https://osf.io/t98fz/). A total of 25 algorithms were benchmarked on thesedatasets. In this work, we present the the results of this evaluation andfurther investigate the performance-variation at vertebra-level, scan-level,and at different fields-of-view. We also evaluate the generalisability of theapproaches to an implicit domain shift in data by evaluating the top performingalgorithms of one challenge iteration on data from the other iteration. Theprincipal takeaway from VerSe: the performance of an algorithm in labelling andsegmenting a spine scan hinges on its ability to correctly identify vertebraein cases of rare anatomical variations. The content and code concerning VerSecan be accessed at: https://github.com/anjany/verse.

Journal article

Islam M, Seenivasan L, Ren H, Glocker Bet al., 2021, Class-distribution-aware calibration for long-tailed visual recognition, ICML Workshop on Uncertainty and Robustness in Deep Learning

Despite impressive accuracy, deep neural networks are often miscalibrated andtend to overly confident predictions. Recent techniques like temperaturescaling (TS) and label smoothing (LS) show effectiveness in obtaining awell-calibrated model by smoothing logits and hard labels with scalar factors,respectively. However, the use of uniform TS or LS factor may not be optimalfor calibrating models trained on a long-tailed dataset where the modelproduces overly confident probabilities for high-frequency classes. In thisstudy, we propose class-distribution-aware TS (CDA-TS) and LS (CDA-LS) byincorporating class frequency information in model calibration in the contextof long-tailed distribution. In CDA-TS, the scalar temperature value isreplaced with the CDA temperature vector encoded with class frequency tocompensate for the over-confidence. Similarly, CDA-LS uses a vector smoothingfactor and flattens the hard labels according to their corresponding classdistribution. We also integrate CDA optimal temperature vector withdistillation loss, which reduces miscalibration in self-distillation (SD). Weempirically show that class-distribution-aware TS and LS can accommodate theimbalanced data distribution yielding superior performance in both calibrationerror and predictive accuracy. We also observe that SD with an extremelyimbalanced dataset is less effective in terms of calibration performance. Codeis available in https://github.com/mobarakol/Class-Distribution-Aware-TS-LS.

Conference paper

Islam M, Glocker B, 2021, Spatially varying label smoothing: capturing uncertainty from expertannotations, Information Processing in Medical Imaging (IPMI) 2021, Publisher: Springer Verlag, Pages: 677-688, ISSN: 0302-9743

The task of image segmentation is inherently noisy due to ambiguitiesregarding the exact location of boundaries between anatomical structures. Weargue that this information can be extracted from the expert annotations at noextra cost, and when integrated into state-of-the-art neural networks, it canlead to improved calibration between soft probabilistic predictions and theunderlying uncertainty. We built upon label smoothing (LS) where a network istrained on 'blurred' versions of the ground truth labels which has been shownto be effective for calibrating output predictions. However, LS is not takingthe local structure into account and results in overly smoothed predictionswith low confidence even for non-ambiguous regions. Here, we propose SpatiallyVarying Label Smoothing (SVLS), a soft labeling technique that captures thestructural uncertainty in semantic segmentation. SVLS also naturally lendsitself to incorporate inter-rater uncertainty when multiple labelmaps areavailable. The proposed approach is extensively validated on four clinicalsegmentation tasks with different imaging modalities, number of classes andsingle and multi-rater expert annotations. The results demonstrate that SVLS,despite its simplicity, obtains superior boundary prediction with improveduncertainty and model calibration.

Conference paper

Kart T, Fischer M, Kuestner T, Hepp T, Bamberg F, Winzeck S, Glocker B, Rueckert D, Gatidis Set al., 2021, Deep Learning-Based Automated Abdominal Organ Segmentation in the UK Biobank and German National Cohort Magnetic Resonance Imaging Studies, INVESTIGATIVE RADIOLOGY, Vol: 56, Pages: 401-408, ISSN: 0020-9996

Journal article

Popescu SG, Sharp DJ, Cole JH, Kamnitsas K, Glocker Bet al., 2021, Distributional gaussian process layers for outlier detection in imagesegmentation, Information Processing in Medical Imaging (IPMI) 2021, Publisher: arXiv

We propose a parameter efficient Bayesian layer for hierarchicalconvolutional Gaussian Processes that incorporates Gaussian Processes operatingin Wasserstein-2 space to reliably propagate uncertainty. This directlyreplaces convolving Gaussian Processes with a distance-preserving affineoperator on distributions. Our experiments on brain tissue-segmentation showthat the resulting architecture approaches the performance of well-establisheddeterministic segmentation algorithms (U-Net), which has never been achievedwith previous hierarchical Gaussian Processes. Moreover, by applying the samesegmentation model to out-of-distribution data (i.e., images with pathologysuch as brain tumors), we show that our uncertainty estimates result inout-of-distribution detection that outperforms the capabilities of previousBayesian networks and reconstruction-based approaches that learn normativedistributions.

Conference paper

Reinke A, Eisenmann M, Tizabi MD, Sudre CH, Rädsch T, Antonelli M, Arbel T, Bakas S, Cardoso MJ, Cheplygina V, Farahani K, Glocker B, Heckmann-Nötzel D, Isensee F, Jannin P, Kahn CE, Kleesiek J, Kurc T, Kozubek M, Landman BA, Litjens G, Maier-Hein K, Menze B, Müller H, Petersen J, Reyes M, Rieke N, Stieltjes B, Summers RM, Tsaftaris SA, Ginneken BV, Kopp-Schneider A, Jäger P, Maier-Hein Let al., 2021, Common limitations of image processing metrics: a picture story, Publisher: arXiv

While the importance of automatic image analysis is increasing at an enormouspace, recent meta-research revealed major flaws with respect to algorithmvalidation. Specifically, performance metrics are key for objective,transparent and comparative performance assessment, but relatively littleattention has been given to the practical pitfalls when using specific metricsfor a given image analysis task. A common mission of several internationalinitiatives is therefore to provide researchers with guidelines and tools tochoose the performance metrics in a problem-aware manner. This dynamicallyupdated document has the purpose to illustrate important limitations ofperformance metrics commonly applied in the field of image analysis. Thecurrent version is based on a Delphi process on metrics conducted by aninternational consortium of image analysis experts.

Working paper

Manzanera OEM, Ellis S, Baltatzis V, Nair A, Le Folgoc L, Desai S, Glocker B, Schnabel JAet al., 2021, Patient-specific 3d cellular automata nodule growth synthesis in lung cancer without the need of external data, Pages: 925-928, ISSN: 1945-7928

We propose a novel patient-specific generative approach to simulate the emergence and growth of lung nodules using 3D cellular automata (CA) in computer tomography (CT). Our proposed method can be applied to individual images thus eliminating the need of external images that can contaminate and influence the generative process, a valuable characteristic in the medical domain. Firstly, we employ inpainting to generate pseudo-healthy representations of lung CT scans prior the visible appearance of each lung nodule. Then, for each nodule, we train a 3D CA to simulate nodule growth and progression using the image of that same nodule as a target. After each CA is trained, we generate early versions of each nodule from a single voxel until the growing nodule closely matches the appearance of the original nodule. These synthesized nodules are inserted where the original nodule was located in the pseudo-healthy inpainted versions of the CTs, which provide realistic context to the generated nodule. We utilize the simulated images for data augmentation yielding false positive reduction in a nodule detector. We found statistically significant improvements (p lt 0.001) in the detection of lung nodules.

Conference paper

Dou Q, So TY, Jiang M, Liu Q, Vardhanabhuti V, Kaissis G, Li Z, Si W, Lee HHC, Yu K, Feng Z, Dong L, Burian E, Jungmann F, Braren R, Makowski M, Kainz B, Rueckert D, Glocker B, Yu SCH, Heng PAet al., 2021, Federated deep learning for detecting COVID-19 lung abnormalities in CT: a privacy-preserving multinational validation study, npj Digital Medicine, Vol: 4, Pages: 1-11, ISSN: 2398-6352

Data privacy mechanisms are essential for rapidly scaling medical training databases to capture the heterogeneity of patient data distributions toward robust and generalizable machine learning systems. In the current COVID-19 pandemic, a major focus of artificial intelligence (AI) is interpreting chest CT, which can be readily used in the assessment and management of the disease. This paper demonstrates the feasibility of a federated learning method for detecting COVID-19 related CT abnormalities with external validation on patients from a multinational study. We recruited 132 patients from seven multinational different centers, with three internal hospitals from Hong Kong for training and testing, and four external, independent datasets from Mainland China and Germany, for validating model generalizability. We also conducted case studies on longitudinal scans for automated estimation of lesion burden for hospitalized COVID-19 patients. We explore the federated learning algorithms to develop a privacy-preserving AI model for COVID-19 medical image diagnosis with good generalization capability on unseen multinational datasets. Federated learning could provide an effective mechanism during pandemics to rapidly develop clinically useful AI across institutions and countries overcoming the burden of central aggregation of large amounts of sensitive data.

Journal article

Li Z, Kamnitsas K, Glocker B, 2021, Analyzing overfitting under class imbalance in neural networks for image segmentation, IEEE Transactions on Medical Imaging, Vol: 40, Pages: 1065-1077, ISSN: 0278-0062

Class imbalance poses a challenge for developingunbiased, accurate predictive models. In particular, in imagesegmentation neural networks may overfit to the foregroundsamples from small structures, which are often heavily underrepresented in the training set, leading to poor generalization.In this study, we provide new insights on the problem ofoverfitting under class imbalance by inspecting the networkbehavior. We find empirically that when training with limiteddata and strong class imbalance, at test time the distribution oflogit activations may shift across the decision boundary, whilesamples of the well-represented class seem unaffected. This biasleads to a systematic under-segmentation of small structures.This phenomenon is consistently observed for different databases,tasks and network architectures. To tackle this problem, weintroduce new asymmetric variants of popular loss functionsand regularization techniques including a large margin loss,focal loss, adversarial training, mixup and data augmentation,which are explicitly designed to counter logit shift of the underrepresented classes. Extensive experiments are conducted onseveral challenging segmentation tasks. Our results demonstratethat the proposed modifications to the objective function canlead to significantly improved segmentation accuracy comparedto baselines and alternative approaches.

Journal article

Korkinof D, Harvey H, Heindl A, Karpati E, Williams G, Rijken T, Kecskemethy P, Glocker Bet al., 2021, Perceived Realism of High-Resolution Generative Adversarial Network-derived Synthetic Mammograms., Radiol Artif Intell, Vol: 3

PURPOSE: To explore whether generative adversarial networks (GANs) can enable synthesis of realistic medical images that are indiscernible from real images, even by domain experts. MATERIALS AND METHODS: In this retrospective study, progressive growing GANs were used to synthesize mammograms at a resolution of 1280 × 1024 pixels by using images from 90 000 patients (average age, 56 years ± 9) collected between 2009 and 2019. To evaluate the results, a method to assess distributional alignment for ultra-high-dimensional pixel distributions was used, which was based on moment plots. This method was able to reveal potential sources of misalignment. A total of 117 volunteer participants (55 radiologists and 62 nonradiologists) took part in a study to assess the realism of synthetic images from GANs. RESULTS: A quantitative evaluation of distributional alignment shows 60%-78% mutual-information score between the real and synthetic image distributions, and 80%-91% overlap in their support, which are strong indications against mode collapse. It also reveals shape misalignment as the main difference between the two distributions. Obvious artifacts were found by an untrained observer in 13.6% and 6.4% of the synthetic mediolateral oblique and craniocaudal images, respectively. A reader study demonstrated that real and synthetic images are perceptually inseparable by the majority of participants, even by trained breast radiologists. Only one out of the 117 participants was able to reliably distinguish real from synthetic images, and this study discusses the cues they used to do so. CONCLUSION: On the basis of these findings, it appears possible to generate realistic synthetic full-field digital mammograms by using a progressive GAN architecture up to a resolution of 1280 × 1024 pixels.Supplemental material is available for this article.© RSNA, 2020.

Journal article

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00795421&limit=30&person=true