# Dr Ben Glocker

Faculty of EngineeringDepartment of Computing

Professor in Machine Learning for Imaging

//

### Contact

+44 (0)20 7594 8334b.glocker CV

//

### Location

377Huxley BuildingSouth Kensington Campus

//

## Publications

Publication Type
Year
to

276 results found

Islam M, Seenivasan L, Sharan SP, Viekash VK, Gupta B, Glocker B, Ren Het al., 2023, Paced-curriculum distillation with prediction and label uncertainty for image segmentation, International Journal of Computer Assisted Radiology and Surgery, Pages: 1-9, ISSN: 1861-6410

PURPOSE: In curriculum learning, the idea is to train on easier samples first and gradually increase the difficulty, while in self-paced learning, a pacing function defines the speed to adapt the training progress. While both methods heavily rely on the ability to score the difficulty of data samples, an optimal scoring function is still under exploration. METHODOLOGY: Distillation is a knowledge transfer approach where a teacher network guides a student network by feeding a sequence of random samples. We argue that guiding student networks with an efficient curriculum strategy can improve model generalization and robustness. For this purpose, we design an uncertainty-based paced curriculum learning in self-distillation for medical image segmentation. We fuse the prediction uncertainty and annotation boundary uncertainty to develop a novel paced-curriculum distillation (P-CD). We utilize the teacher model to obtain prediction uncertainty and spatially varying label smoothing with Gaussian kernel to generate segmentation boundary uncertainty from the annotation. We also investigate the robustness of our method by applying various types and severity of image perturbation and corruption. RESULTS: The proposed technique is validated on two medical datasets of breast ultrasound image segmentation and robot-assisted surgical scene segmentation and achieved significantly better performance in terms of segmentation and robustness. CONCLUSION: P-CD improves the performance and obtains better generalization and robustness over the dataset shift. While curriculum learning requires extensive tuning of hyper-parameters for pacing function, the level of performance improvement suppresses this limitation.

Journal article

Glocker B, Jones C, Bernhardt M, Winzeck Set al., 2023, Algorithmic encoding of protected characteristics in chest X-ray disease detection models, EBioMedicine, Vol: 89, Pages: 1-19, ISSN: 2352-3964

BackgroundIt has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. An algorithm may encode protected characteristics, and then use this information for making predictions due to undesirable correlations in the (historical) training data. It remains unclear how we can establish whether such information is actually used. Besides the scarcity of data from underserved populations, very little is known about how dataset biases manifest in predictive models and how this may result in disparate performance. This article aims to shed some light on these issues by exploring methodology for subgroup analysis in image-based disease detection models.MethodsWe utilize two publicly available chest X-ray datasets, CheXpert and MIMIC-CXR, to study performance disparities across race and biological sex in deep learning models. We explore test set resampling, transfer learning, multitask learning, and model inspection to assess the relationship between the encoding of protected characteristics and disease detection performance across subgroups.FindingsWe confirm subgroup disparities in terms of shifted true and false positive rates which are partially removed after correcting for population and prevalence shifts in the test sets. We find that transfer learning alone is insufficient for establishing whether specific patient information is used for making predictions. The proposed combination of test-set resampling, multitask learning, and model inspection reveals valuable insights about the way protected characteristics are encoded in the feature representations of deep neural networks.InterpretationSubgroup analysis is key for identifying performance disparities of AI models, but statistical differences across subgroups need to be taken into account when analyzing potential biases in disease detection. The proposed methodology provides a comprehensive framework for subgroup analysis enabling further research into the underlyi

Journal article

Li Z, Kamnitsas K, Ouyang C, Chen C, Glocker Bet al., 2023, Context label learning: improving background class representations in semantic segmentation, IEEE Transactions on Medical Imaging, Pages: 1-12, ISSN: 0278-0062

Background samples provide key contextual information for segmenting regionsof interest (ROIs). However, they always cover a diverse set of structures,causing difficulties for the segmentation model to learn good decisionboundaries with high sensitivity and precision. The issue concerns the highlyheterogeneous nature of the background class, resulting in multi-modaldistributions. Empirically, we find that neural networks trained withheterogeneous background struggle to map the corresponding contextual samplesto compact clusters in feature space. As a result, the distribution overbackground logit activations may shift across the decision boundary, leading tosystematic over-segmentation across different datasets and tasks. In thisstudy, we propose context label learning (CoLab) to improve the contextrepresentations by decomposing the background class into several subclasses.Specifically, we train an auxiliary network as a task generator, along with theprimary segmentation model, to automatically generate context labels thatpositively affect the ROI segmentation accuracy. Extensive experiments areconducted on several challenging segmentation tasks and datasets. The resultsdemonstrate that CoLab can guide the segmentation model to map the logits ofbackground samples away from the decision boundary, resulting in significantlyimproved segmentation accuracy. Code is available.

Journal article

Monteiro M, De Sousa Ribeiro F, Pawlowski N, Coelho De Castro D, Glocker Bet al., 2023, Measuring axiomatic soundness of counterfactual image models, International Conference on Learning Representations (ICLR)

We use the axiomatic definition of counterfactual to derive metrics that enable quantifying the correctness of approximate counterfactual inference models.Abstract: We present a general framework for evaluating image counterfactuals. The power and flexibility of deep generative models make them valuable tools for learning mechanisms in structural causal models. However, their flexibility makes counterfactual identifiability impossible in the general case.Motivated by these issues, we revisit Pearl's axiomatic definition of counterfactuals to determine the necessary constraints of any counterfactual inference model: composition, reversibility, and effectiveness. We frame counterfactuals as functions of an input variable, its parents, and counterfactual parents and use the axiomatic constraints to restrict the set of functions that could represent the counterfactual, thus deriving distance metrics between the approximate and ideal functions. We demonstrate how these metrics can be used to compare and choose between different approximate counterfactual inference models and to provide insight into a model's shortcomings and trade-offs.

Conference paper

Mackay K, Bernstein D, Glocker B, Kamnitsas K, Taylor Aet al., 2023, A Review of the Metrics Used to Assess Auto-Contouring Systems in Radiotherapy., Clin Oncol (R Coll Radiol)

Auto-contouring could revolutionise future planning of radiotherapy treatment. The lack of consensus on how to assess and validate auto-contouring systems currently limits clinical use. This review formally quantifies the assessment metrics used in studies published during one calendar year and assesses the need for standardised practice. A PubMed literature search was undertaken for papers evaluating radiotherapy auto-contouring published during 2021. Papers were assessed for types of metric and the methodology used to generate ground-truth comparators. Our PubMed search identified 212 studies, of which 117 met the criteria for clinical review. Geometric assessment metrics were used in 116 of 117 studies (99.1%). This includes the Dice Similarity Coefficient used in 113 (96.6%) studies. Clinically relevant metrics, such as qualitative, dosimetric and time-saving metrics, were less frequently used in 22 (18.8%), 27 (23.1%) and 18 (15.4%) of 117 studies, respectively. There was heterogeneity within each category of metric. Over 90 different names for geometric measures were used. Methods for qualitative assessment were different in all but two papers. Variation existed in the methods used to generate radiotherapy plans for dosimetric assessment. Consideration of editing time was only given in 11 (9.4%) papers. A single manual contour as a ground-truth comparator was used in 65 (55.6%) studies. Only 31 (26.5%) studies compared auto-contours to usual inter- and/or intra-observer variation. In conclusion, significant variation exists in how research papers currently assess the accuracy of automatically generated contours. Geometric measures are the most popular, however their clinical utility is unknown. There is heterogeneity in the methods used to perform clinical assessment. Considering the different stages of system implementation may provide a framework to decide the most appropriate metrics. This analysis supports the need for a consensus on the clinical implement

Journal article

Pati S, Baid U, Edwards B, Sheller M, Wang S-H, Reina GA, Foley P, Gruzdev A, Karkada D, Davatzikos C, Sako C, Ghodasara S, Bilello M, Mohan S, Vollmuth P, Brugnara G, Preetha CJ, Sahm F, Maier-Hein K, Zenk M, Bendszus M, Wick W, Calabrese E, Rudie J, Villanueva-Meyer J, Cha S, Ingalhalikar M, Jadhav M, Pandey U, Saini J, Garrett J, Larson M, Jeraj R, Currie S, Frood R, Fatania K, Huang RY, Chang K, Balaña C, Capellades J, Puig J, Trenkler J, Pichler J, Necker G, Haunschmidt A, Meckel S, Shukla G, Liem S, Alexander GS, Lombardo J, Palmer JD, Flanders AE, Dicker AP, Sair HI, Jones CK, Venkataraman A, Jiang M, So TY, Chen C, Heng PA, Dou Q, Kozubek M, Lux F, Michálek J, Matula P, Keřkovský M, Kopřivová T, Dostál M, Vybíhal V, Vogelbaum MA, Mitchell JR, Farinhas J, Maldjian JA, Yogananda CGB, Pinho MC, Reddy D, Holcomb J, Wagner BC, Ellingson BM, Cloughesy TF, Raymond C, Oughourlian T, Hagiwara A, Wang C, To M-S, Bhardwaj S, Chong C, Agzarian M, Falcão AX, Martins SB, Teixeira BCA, Sprenger F, Menotti D, Lucio DR, LaMontagne P, Marcus D, Wiestler B, Kofler F, Ezhov I, Metz M, Jain R, Lee M, Lui YW, McKinley R, Slotboom J, Radojewski P, Meier R, Wiest R, Murcia D, Fu E, Haas R, Thompson J, Ormond DR, Badve C, Sloan AE, Vadmal V, Waite K, Colen RR, Pei L, Ak M, Srinivasan A, Bapuraj JR, Rao A, Wang N, Yoshiaki O, Moritani T, Turk S, Lee J, Prabhudesai S, Morón F, Mandel J, Kamnitsas K, Glocker B, Dixon LVM, Williams M, Zampakis P, Panagiotopoulos V, Tsiganos P, Alexiou S, Haliassos I, Zacharaki EI, Moustakas K, Kalogeropoulou C, Kardamakis DM, Choi YS, Lee S-K, Chang JH, Ahn SS, Luo B, Poisson L, Wen N, Tiwari P, Verma R, Bareja R, Yadav I, Chen J, Kumar N, Smits M, van der Voort SR, Alafandi A, Incekara F, Wijnenga MMJ, Kapsas G, Gahrmann R, Schouten JW, Dubbink HJ, Vincent AJPE, van den Bent MJ, French PJ, Klein S, Yuan Y, Sharma S, Tseng T-C, Adabi S, Niclou SP, Keunen O, Hau A-C, Vallières M, Fortin D, Lepage M, Landman B, Ramadass K, Xu K, Chotai S, Chambless LB, Miet al., 2023, Author Correction: Federated learning enables big data for rare cancer boundary detection., Nature Communications, Vol: 14, Pages: 436-436, ISSN: 2041-1723

Journal article

Batten J, Sinclair M, Glocker B, Schaap Met al., 2023, Image To Tree with Recursive Prompting

Extracting complex structures from grid-based data is a common key step inautomated medical image analysis. The conventional solution to recoveringtree-structured geometries typically involves computing the minimal cost paththrough intermediate representations derived from segmentation masks. However,this methodology has significant limitations in the context of projectiveimaging of tree-structured 3D anatomical data such as coronary arteries, sincethere are often overlapping branches in the 2D projection. In this work, wepropose a novel approach to predicting tree connectivity structure whichreformulates the task as an optimization problem over individual steps of arecursive process. We design and train a two-stage model which leverages theUNet and Transformer architectures and introduces an image-based promptingtechnique. Our proposed method achieves compelling results on a pair ofsynthetic datasets, and outperforms a shortest-path baseline.

Working paper

Dorent R, Kujawa A, Ivory M, Bakas S, Rieke N, Joutard S, Glocker B, Cardoso J, Modat M, Batmanghelich K, Belkov A, Calisto MB, Choi JW, Dawant BM, Dong H, Escalera S, Fan Y, Hansen L, Heinrich MP, Joshi S, Kashtanova V, Kim HG, Kondo S, Kruse CN, Lai-Yuen SK, Li H, Liu H, Ly B, Oguz I, Shin H, Shirokikh B, Su Z, Wang G, Wu J, Xu Y, Yao K, Zhang L, Ourselin S, Shapey J, Vercauteren Tet al., 2023, CrossMoDA 2021 challenge: Benchmark of cross-modality domain adaptation techniques for vestibular schwannoma and cochlea segmentation, Publisher: ELSEVIER

Working paper

Xu M, Islam M, Glocker B, Ren Het al., 2022, Confidence-Aware Paced-Curriculum Learning by Label Smoothing for Surgical Scene Understanding

Curriculum learning and self-paced learning are the training strategies thatgradually feed the samples from easy to more complex. They have captivatedincreasing attention due to their excellent performance in robotic vision. Mostrecent works focus on designing curricula based on difficulty levels in inputsamples or smoothing the feature maps. However, smoothing labels to control thelearning utility in a curriculum manner is still unexplored. In this work, wedesign a paced curriculum by label smoothing (P-CBLS) using paced learning withuniform label smoothing (ULS) for classification tasks and fuse uniform andspatially varying label smoothing (SVLS) for semantic segmentation tasks in acurriculum manner. In ULS and SVLS, a bigger smoothing factor value enforces aheavy smoothing penalty in the true label and limits learning less information.Therefore, we design the curriculum by label smoothing (CBLS). We set a biggersmoothing value at the beginning of training and gradually decreased it to zeroto control the model learning utility from lower to higher. We also designed aconfidence-aware pacing function and combined it with our CBLS to investigatethe benefits of various curricula. The proposed techniques are validated onfour robotic surgery datasets of multi-class, multi-label classification,captioning, and segmentation tasks. We also investigate the robustness of ourmethod by corrupting validation data into different severity levels. Ourextensive analysis shows that the proposed method improves prediction accuracyand robustness.

Working paper

Gatidis S, Kart T, Fischer M, Winzeck S, Glocker B, Bai W, Bülow R, Emmel C, Friedrich L, Kauczor H-U, Keil T, Kröncke T, Mayer P, Niendorf T, Peters A, Pischon T, Schaarschmidt BM, Schmidt B, Schulze MB, Umutle L, Völzke H, Küstner T, Bamberg F, Schölkopf B, Rueckert Det al., 2022, Better Together: Data Harmonization and Cross-Study Analysis of Abdominal MRI Data From UK Biobank and the German National Cohort., Invest Radiol

OBJECTIVES: The UK Biobank (UKBB) and German National Cohort (NAKO) are among the largest cohort studies, capturing a wide range of health-related data from the general population, including comprehensive magnetic resonance imaging (MRI) examinations. The purpose of this study was to demonstrate how MRI data from these large-scale studies can be jointly analyzed and to derive comprehensive quantitative image-based phenotypes across the general adult population. MATERIALS AND METHODS: Image-derived features of abdominal organs (volumes of liver, spleen, kidneys, and pancreas; volumes of kidney hilum adipose tissue; and fat fractions of liver and pancreas) were extracted from T1-weighted Dixon MRI data of 17,996 participants of UKBB and NAKO based on quality-controlled deep learning generated organ segmentations. To enable valid cross-study analysis, we first analyzed the data generating process using methods of causal discovery. We subsequently harmonized data from UKBB and NAKO using the ComBat approach for batch effect correction. We finally performed quantile regression on harmonized data across studies providing quantitative models for the variation of image-derived features stratified for sex and dependent on age, height, and weight. RESULTS: Data from 8791 UKBB participants (49.9% female; age, 63 ± 7.5 years) and 9205 NAKO participants (49.1% female, age: 51.8 ± 11.4 years) were analyzed. Analysis of the data generating process revealed direct effects of age, sex, height, weight, and the data source (UKBB vs NAKO) on image-derived features. Correction of data source-related effects resulted in markedly improved alignment of image-derived features between UKBB and NAKO. Cross-study analysis on harmonized data revealed comprehensive quantitative models for the phenotypic variation of abdominal organs across the general adult population. CONCLUSIONS: Cross-study analysis of MRI data from UKBB and NAKO as proposed in this work can be helpful for futur

Journal article

Chalkidou A, Shokraneh F, Kijauskaite G, Taylor-Phillips S, Halligan S, Wilkinson L, Glocker B, Garrett P, Denniston AK, Mackie A, Seedat Fet al., 2022, Recommendations for the development and use of imaging test sets to investigate the test performance of artificial intelligence in health screening., Lancet Digit Health, Vol: 4, Pages: e899-e905

Rigorous evaluation of artificial intelligence (AI) systems for image classification is essential before deployment into health-care settings, such as screening programmes, so that adoption is effective and safe. A key step in the evaluation process is the external validation of diagnostic performance using a test set of images. We conducted a rapid literature review on methods to develop test sets, published from 2012 to 2020, in English. Using thematic analysis, we mapped themes and coded the principles using the Population, Intervention, and Comparator or Reference standard, Outcome, and Study design framework. A group of screening and AI experts assessed the evidence-based principles for completeness and provided further considerations. From the final 15 principles recommended here, five affect population, one intervention, two comparator, one reference standard, and one both reference standard and comparator. Finally, four are appliable to outcome and one to study design. Principles from the literature were useful to address biases from AI; however, they did not account for screening specific biases, which we now incorporate. The principles set out here should be used to support the development and use of test sets for studies that assess the accuracy of AI within screening programmes, to ensure they are fit for purpose and minimise bias.

Journal article

Kart T, Fischer M, Winzeck S, Glocker B, Bai W, Buelow R, Emmel C, Friedrich L, Kauczor H-U, Keil T, Kroencke T, Mayer P, Niendorf T, Peters A, Pischon T, Schaarschmidt BM, Schmidt B, Schulze MB, Umutle L, Voelzke H, Kuestner T, Bamberg F, Schoelkopf B, Rueckert D, Gatidis Set al., 2022, Automated imaging-based abdominal organ segmentation and quality control in 20,000 participants of the UK Biobank and German National Cohort Studies, SCIENTIFIC REPORTS, Vol: 12, ISSN: 2045-2322

Journal article

Rosnati M, Ribeiro FDS, Monteiro M, Castro DCD, Glocker Bet al., 2022, Analysing the effectiveness of a generative model for semi-supervised medical image segmentation

Image segmentation is important in medical imaging, providing valuable,quantitative information for clinical decision-making in diagnosis, therapy,and intervention. The state-of-the-art in automated segmentation remainssupervised learning, employing discriminative models such as U-Net. However,training these models requires access to large amounts of manually labelleddata which is often difficult to obtain in real medical applications. In suchsettings, semi-supervised learning (SSL) attempts to leverage the abundance ofunlabelled data to obtain more robust and reliable models. Recently, generativemodels have been proposed for semantic segmentation, as they make an attractivechoice for SSL. Their ability to capture the joint distribution over inputimages and output label maps provides a natural way to incorporate informationfrom unlabelled images. This paper analyses whether deep generative models suchas the SemanticGAN are truly viable alternatives to tackle challenging medicalimage segmentation problems. To that end, we thoroughly evaluate thesegmentation performance, robustness, and potential subgroup disparities ofdiscriminative and generative segmentation methods when applied to large-scale,publicly available chest X-ray datasets.

Working paper

Kori A, Glocker B, Toni F, 2022, Visual Debates

The natural way of obtaining different perspectives on any given topic is byconducting a debate, where participants argue for and against the topic. Here,we propose a novel debate framework for understanding the classifier'sreasoning for making a particular prediction by modelling it as a multiplayersequential zero-sum game. The players aim to maximise their utilities byadjusting their arguments with respect to other players' counterarguments. Thecontrastive nature of our framework encourages players to put forward diversearguments, picking up the reasoning trails missed by their opponents. Thus, ourframework answers the question: why did the classifier make a certainprediction?, by allowing players to argue for and against the classifier'sdecision. In the proposed setup, given the question and the classifier's latentknowledge, both agents take turns in proposing arguments to support orcontradict the classifier's decision; arguments here correspond to theselection of specific features from the discretised latent space of thecontinuous classifier. By the end of the debate, we collect sets of supportiveand manipulative features, serving as an explanation depicting the internalreasoning of the classifier. We demonstrate our Visual Debates on the geometricSHAPE and MNIST datasets for subjective validation, followed by thehigh-resolution AFHQ dataset. For further investigation, our framework isavailable at \url{https://github.com/koriavinash1/VisualDebates}.

Working paper

Shehata N, Bain W, Glocker B, 2022, A Comparative Study of Graph Neural Networks for Shape Classification in Neuroimaging, Proceedings of Machine Learning Research, GeoMedIA Workshop

Graph neural networks have emerged as a promising approach for the analysisof non-Euclidean data such as meshes. In medical imaging, mesh-like data playsan important role for modelling anatomical structures, and shape classificationcan be used in computer aided diagnosis and disease detection. However, with aplethora of options, the best architectural choices for medical shape analysisusing GNNs remain unclear. We conduct a comparative analysis to providepractitioners with an overview of the current state-of-the-art in geometricdeep learning for shape classification in neuroimaging. Using biological sexclassification as a proof-of-concept task, we find that using FPFH as nodefeatures substantially improves GNN performance and generalisation toout-of-distribution data; we compare the performance of three alternativeconvolutional layers; and we reinforce the importance of data augmentation forgraph based learning. We then confirm these results hold for a clinicallyrelevant task, using the classification of Alzheimer's disease.

Conference paper

Rosnati M, Soreq E, Monteiro M, Li L, Graham NSN, Zimmerman K, Rossi C, Carrara G, Bertolini G, Sharp DJ, Glocker Bet al., 2022, Automatic lesion analysis for increased efficiency in outcome prediction of traumatic brain injury, 5th International Workshop, MLCN 2022, Publisher: Springer Nature Switzerland, Pages: 135-146, ISSN: 0302-9743

The accurate prognosis for traumatic brain injury (TBI) patients is difficult yet essential to inform therapy, patient management, and long-term after-care. Patient characteristics such as age, motor and pupil responsiveness, hypoxia and hypotension, and radiological findings on computed tomography (CT), have been identified as important variables for TBI outcome prediction. CT is the acute imaging modality of choice in clinical practice because of its acquisition speed and widespread availability. However, this modality is mainly used for qualitative and semi-quantitative assessment, such as the Marshall scoring system, which is prone to subjectivity and human errors. This work explores the predictive power of imaging biomarkers extracted from routinely-acquired hospital admission CT scans using a state-of-the-art, deep learning TBI lesion segmentation method. We use lesion volumes and corresponding lesion statistics as inputs for an extended TBI outcome prediction model. We compare the predictive power of our proposed features to the Marshall score, independently and when paired with classic TBI biomarkers. We find that automatically extracted quantitative CT features perform similarly or better than the Marshall score in predicting unfavourable TBI outcomes. Leveraging automatic atlas alignment, we also identify frontal extra-axial lesions as important indicators of poor outcome. Our work may contribute to a better understanding of TBI, and provides new insights into how automated neuroimaging analysis can be used to improve prognostication after TBI.

Conference paper

Satchwell L, Wedlake L, Greenlay E, Li X, Messiou C, Glocker B, Barwick T, Barfoot T, Doran S, Leach MO, Koh DM, Kaiser M, Winzeck S, Qaiser T, Aboagye E, Rockall Aet al., 2022, Development of machine learning support for reading whole body diffusion-weighted MRI (WB-MRI) in myeloma for the detection and quantification of the extent of disease before and after treatment (MALIMAR): protocol for a cross-sectional diagnostic test accuracy study, BMJ Open, Vol: 12, Pages: 1-9, ISSN: 2044-6055

Journal article

Islam M, Glocker B, 2022, Frequency Dropout: Feature-Level Regularization via Randomized Filtering

Deep convolutional neural networks have shown remarkable performance onvarious computer vision tasks, and yet, they are susceptible to picking upspurious correlations from the training signal. So called shortcuts' can occurduring learning, for example, when there are specific frequencies present inthe image data that correlate with the output predictions. Both high and lowfrequencies can be characteristic of the underlying noise distribution causedby the image acquisition rather than in relation to the task-relevantinformation about the image content. Models that learn features related to thischaracteristic noise will not generalize well to new data. In this work, we propose a simple yet effective training strategy, FrequencyDropout, to prevent convolutional neural networks from learningfrequency-specific imaging features. We employ randomized filtering of featuremaps during training which acts as a feature-level regularization. In thisstudy, we consider common image processing filters such as Gaussian smoothing,Laplacian of Gaussian, and Gabor filtering. Our training strategy ismodel-agnostic and can be used for any computer vision task. We demonstrate theeffectiveness of Frequency Dropout on a range of popular architectures andmultiple tasks including image classification, domain adaptation, and semanticsegmentation using both computer vision and medical imaging datasets. Ourresults suggest that the proposed approach does not only improve predictiveaccuracy but also improves robustness against domain shift.

Working paper

Li Z, Kamnitsas K, Islam M, Chen C, Glocker Bet al., 2022, Estimating model performance under domain shifts with class-specific confidence scores, MICCAI 2022 25th International Conference, Publisher: Springer Nature Switzerland, Pages: 693-703, ISSN: 0302-9743

Machine learning models are typically deployed in a test setting that differs from the training setting, potentially leading to decreased model performance because of domain shift. If we could estimate the performance that a pre-trained model would achieve on data from a specific deployment setting, for example a certain clinic, we could judge whether the model could safely be deployed or if its performance degrades unacceptably on the specific data. Existing approaches estimate this based on the confidence of predictions made on unlabeled test data from the deployment’s domain. We find existing methods struggle with data that present class imbalance, because the methods used to calibrate confidence do not account for bias induced by class imbalance, consequently failing to estimate class-wise accuracy. Here, we introduce class-wise calibration within the framework of performance estimation for imbalanced datasets. Specifically, we derive class-specific modifications of state-of-the-art confidence-based model evaluation methods including temperature scaling (TS), difference of confidences (DoC), and average thresholded confidence (ATC). We also extend the methods to estimate Dice similarity coefficient (DSC) in image segmentation. We conduct experiments on four tasks and find the proposed modifications consistently improve the estimation accuracy for imbalanced datasets. Our methods improve accuracy estimation by 18% in classification under natural domain shifts, and double the estimation accuracy on segmentation tasks, when compared with prior methods (Code is available at https://github.com/ZerojumpLine/ModelEvaluationUnderClassImbalance).

Conference paper

Glocker B, Jones C, Bernhardt M, Winzeck Set al., 2022, Risk of Bias in Chest X-ray Foundation Models

Foundation models are considered a breakthrough in all applications of AI,promising robust and reusable mechanisms for feature extraction, alleviatingthe need for large amounts of high quality annotated training data fortask-specific prediction models. However, foundation models may potentiallyencode and even reinforce existing biases present in historic datasets. Giventhe limited ability to scrutinize foundation models, it remains unclear whetherthe opportunities outweigh the risks in safety critical applications such asclinical decision making. In our statistical bias analysis of a recentlypublished, and publicly accessible chest X-ray foundation model, we foundreasons for concern as the model seems to encode protected characteristicsincluding biological sex and racial identity. When used for the downstreamapplication of disease detection, we observed substantial degradation ofperformance of the foundation model compared to a standard model with specificdisparities in protected subgroups. While research into foundation models forhealthcare applications is in an early stage, we hope to raise awareness of therisks by highlighting the importance of conducting thorough bias and subgroupperformance analyses.

Working paper

Rasal R, Castro DC, Pawlowski N, Glocker Bet al., 2022, Deep Structural Causal Shape Models

Causal reasoning provides a language to ask important interventional andcounterfactual questions beyond purely statistical association. In medicalimaging, for example, we may want to study the causal effect of genetic,environmental, or lifestyle factors on the normal and pathological variation ofanatomical phenotypes. However, while anatomical shape models of 3D surfacemeshes, extracted from automated image segmentation, can be reliablyconstructed, there is a lack of computational tooling to enable causalreasoning about morphological variations. To tackle this problem, we proposedeep structural causal shape models (CSMs), which utilise high-quality meshgeneration techniques, from geometric deep learning, within the expressiveframework of deep structural causal models. CSMs enable subject-specificprognoses through counterfactual mesh generation ("How would this patient'sbrain structure change if they were ten years older?"), which is in contrast tomost current works on purely population-level statistical shape modelling. Wedemonstrate the capabilities of CSMs at all levels of Pearl's causal hierarchythrough a number of qualitative and quantitative experiments leveraging a largedataset of 3D brain structures.

Working paper

Ellis S, Manzanera OEM, Baltatzis V, Nawaz I, Nair A, Folgoc LL, Desai S, Glocker B, Schnabel JAet al., 2022, Evaluation of 3D GANs for lung tissue modelling in pulmonary CT, The Journal of Machine Learning for Biomedical Imaging

GANs are able to model accurately the distribution of complex,high-dimensional datasets, e.g. images. This makes high-quality GANs useful forunsupervised anomaly detection in medical imaging. However, differences intraining datasets such as output image dimensionality and appearance ofsemantically meaningful features mean that GAN models from the natural imagedomain may not work out-of-the-box' for medical imaging, necessitatingre-implementation and re-evaluation. In this work we adapt and evaluate threeGAN models to the task of modelling 3D healthy image patches for pulmonary CT.To the best of our knowledge, this is the first time that such an evaluationhas been performed. The DCGAN, styleGAN and the bigGAN architectures wereinvestigated due to their ubiquity and high performance in natural imageprocessing. We train different variants of these methods and assess theirperformance using the FID score. In addition, the quality of the generatedimages was evaluated by a human observer study, the ability of the networks tomodel 3D domain-specific features was investigated, and the structure of theGAN latent spaces was analysed. Results show that the 3D styleGAN producesrealistic-looking images with meaningful 3D structure, but suffer from modecollapse which must be addressed during training to obtain samples diversity.Conversely, the 3D DCGAN models show a greater capacity for image variability,but at the cost of poor-quality images. The 3D bigGAN models provide anintermediate level of image quality, but most accurately model the distributionof selected semantically meaningful features. The results suggest that futuredevelopment is required to realise a 3D GAN with sufficient capacity forpatch-based lung CT anomaly detection and we offer recommendations for futureareas of research, such as experimenting with other architectures andincorporation of position-encoding.

Journal article

Vasey B, Nagendran M, Campbell B, Clifton DA, Collins GS, Denaxas S, Denniston AK, Faes L, Geerts B, Ibrahim M, Liu X, Mateen BA, Mathur P, McCradden MD, Morgan L, Ordish J, Rogers C, Saria S, Ting DSW, Watkinson P, Weber W, Wheatstone P, McCulloch Pet al., 2022, Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI (May, 10.1038/s41591-022-01772-9, 2022), NATURE MEDICINE, ISSN: 1078-8956

Journal article

Santhirasekaram A, Kori A, Rockall A, Winkler M, Toni F, Glocker Bet al., 2022, Hierarchical Symbolic Reasoning in Hyperbolic Space for Deep Discriminative Models

Explanations for \emph{black-box} models help us understand model decisionsas well as provide information on model biases and inconsistencies. Most of thecurrent explainability techniques provide a single level of explanation, oftenin terms of feature importance scores or feature attention maps in input space.Our focus is on explaining deep discriminative models at \emph{multiple levelsof abstraction}, from fine-grained to fully abstract explanations. We achievethis by using the natural properties of \emph{hyperbolic geometry} to moreefficiently model a hierarchy of symbolic features and generate\emph{hierarchical symbolic rules} as part of our explanations. Specifically,for any given deep discriminative model, we distill the underpinning knowledgeby discretisation of the continuous latent space using vector quantisation toform symbols, followed by a \emph{hyperbolic reasoning block} to induce an\emph{abstraction tree}. We traverse the tree to extract explanations in termsof symbolic rules and its corresponding visual semantics. We demonstrate theeffectiveness of our method on the MNIST and AFHQ high-resolution animal facesdataset. Our framework is available at\url{https://github.com/koriavinash1/SymbolicInterpretability}.

Working paper

Kori A, Toni F, Glocker B, 2022, GLANCE: Global to Local Architecture-Neutral Concept-based Explanations

Most of the current explainability techniques focus on capturing the importance of features in input space. However, given the complexity of models and data-generating processes, the resulting explanations are far from being complete', in that they lack an indication of feature interactions and visualization of their effect'. In this work, we propose a novel twin-surrogate explainability framework to explain the decisions made by any CNN-based image classifier (irrespective of the architecture). For this, we first disentangle latent features from the classifier, followed by aligning these features to observed/human-defined context' features. These aligned features form semantically meaningful concepts that are used for extracting a causal graph depicting the perceived' data-generating process, describing the inter- and intra-feature interactions between unobserved latent features and observed context' features. This causal graph serves as a global model from which local explanations of different forms can be extracted. Specifically, we provide a generator to visualize the effect' of interactions among features in latent space and draw feature importance therefrom as local explanations. Our framework utilizes adversarial knowledge distillation to faithfully learn a representation from the classifiers' latent space and use it for extracting visual explanations. We use the styleGAN-v2 architecture with an additional regularization term to enforce disentanglement and alignment. We demonstrate and evaluate explanations obtained with our framework on Morpho-MNIST and on the FFHQ human faces dataset. Our framework is available at \url{https://github.com/koriavinash1/GLANCE-Explanations}.

Working paper

Taylor-Phillips S, Seedat F, Kijauskaite G, Marshall J, Halligan S, Hyde C, Given-Wilson R, Wilkinson L, Denniston AK, Glocker B, Garrett P, Mackie A, Steele RJet al., 2022, UK National Screening Committee's approach to reviewing evidence on artificial intelligence in breast cancer screening, The Lancet Digital Health, Vol: 4, Pages: e558-e565, ISSN: 2589-7500

Artificial intelligence (AI) could have the potential to accurately classify mammograms according to the presence or absence of radiological signs of breast cancer, replacing or supplementing human readers (radiologists). The UK National Screening Committee's assessments of the use of AI systems to examine screening mammograms continues to focus on maximising benefits and minimising harms to women screened, when deciding whether to recommend the implementation of AI into the Breast Screening Programme in the UK. Maintaining or improving programme specificity is important to minimise anxiety from false positive results. When considering cancer detection, AI test sensitivity alone is not sufficiently informative, and additional information on the spectrum of disease detected and interval cancers is crucial to better understand the benefits and harms of screening. Although large retrospective studies might provide useful evidence by directly comparing test accuracy and spectrum of disease detected between different AI systems and by population subgroup, most retrospective studies are biased due to differential verification (ie, the use of different reference standards to verify the target condition among study participants). Enriched, multiple-reader, multiple-case, test set laboratory studies are also biased due to the laboratory effect (ie, radiologists' performance in retrospective, laboratory, observer studies is substantially different to their performance in a clinical environment). Therefore, assessment of the effect of incorporating any AI system into the breast screening pathway in prospective studies is required as it will provide key evidence for the effect of the interaction of medical staff with AI, and the impact on women's outcomes.

Journal article

Bernhardt M, Jones C, Glocker B, 2022, Potential sources of dataset bias complicate investigation of underdiagnosis by machine learning algorithms, NATURE MEDICINE, Vol: 28, Pages: 1157-+, ISSN: 1078-8956

Journal article

Maier-Hein L, Reinke A, Godau P, Tizabi MD, Büttner F, Christodoulou E, Glocker B, Isensee F, Kleesiek J, Kozubek M, Reyes M, Riegler MA, Wiesenfarth M, Kavur E, Sudre CH, Baumgartner M, Eisenmann M, Heckmann-Nötzel D, Rädsch AT, Acion L, Antonelli M, Arbel T, Bakas S, Benis A, Blaschko M, Cardoso MJ, Cheplygina V, Cimini BA, Collins GS, Farahani K, Ferrer L, Galdran A, Ginneken BV, Haase R, Hashimoto DA, Hoffman MM, Huisman M, Jannin P, Kahn CE, Kainmueller D, Kainz B, Karargyris A, Karthikesalingam A, Kenngott H, Kofler F, Kopp-Schneider A, Kreshuk A, Kurc T, Landman BA, Litjens G, Madani A, Maier-Hein K, Martel AL, Mattson P, Meijering E, Menze B, Moons KGM, Müller H, Nichyporuk B, Nickel F, Petersen J, Rajpoot N, Rieke N, Saez-Rodriguez J, Sánchez CI, Shetty S, Smeden MV, Summers RM, Taha AA, Tiulpin A, Tsaftaris SA, Calster BV, Varoquaux G, Jäger PFet al., 2022, Metrics reloaded: Pitfalls and recommendations for image analysis validation

Working paper

Bernhardt M, Ribeiro FDS, Glocker B, 2022, Failure Detection in Medical Image Classification: A Reality Check and Benchmarking Testbed

Failure detection in automated image classification is a critical safeguardfor clinical deployment. Detected failure cases can be referred to humanassessment, ensuring patient safety in computer-aided clinical decision making.Despite its paramount importance, there is insufficient evidence about theability of state-of-the-art confidence scoring methods to detect test-timefailures of classification models in the context of medical imaging. This paperprovides a reality check, establishing the performance of in-domainmisclassification detection methods, benchmarking 9 widely used confidencescores on 6 medical imaging datasets with different imaging modalities, inmulticlass and binary classification settings. Our experiments show that theproblem of failure detection is far from being solved. We found that none ofthe benchmarked advanced methods proposed in the computer vision and machinelearning literature can consistently outperform a simple softmax baseline,demonstrating that improved out-of-distribution detection or model calibrationdo not necessarily translate to improved in-domain misclassification detection.Our developed testbed facilitates future work in this important area

Working paper

Langley J, Monteiro M, Jones C, Pawlowski N, Glocker Bet al., 2022, Structured Uncertainty in the Observation Space of Variational Autoencoders

Variational autoencoders (VAEs) are a popular class of deep generative modelswith many variants and a wide range of applications. Improvements upon thestandard VAE mostly focus on the modelling of the posterior distribution overthe latent space and the properties of the neural network decoder. In contrast,improving the model for the observational distribution is rarely considered andtypically defaults to a pixel-wise independent categorical or normaldistribution. In image synthesis, sampling from such distributions producesspatially-incoherent results with uncorrelated pixel noise, resulting in onlythe sample mean being somewhat useful as an output prediction. In this paper,we aim to stay true to VAE theory by improving the samples from theobservational distribution. We propose SOS-VAE, an alternative model for theobservation space, encoding spatial dependencies via a low-rankparameterisation. We demonstrate that this new observational distribution hasthe ability to capture relevant covariance between pixels, resulting inspatially-coherent samples. In contrast to pixel-wise independentdistributions, our samples seem to contain semantically-meaningful variationsfrom the mean allowing the prediction of multiple plausible outputs with asingle forward pass.

Working paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00795421&limit=30&person=true