Imperial College London

Dr Ben Glocker

Faculty of EngineeringDepartment of Computing

Professor in Machine Learning for Imaging
 
 
 
//

Contact

 

+44 (0)20 7594 8334b.glocker Website CV

 
 
//

Location

 

377Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

261 results found

Dorent R, Kujawa A, Ivory M, Bakas S, Rieke N, Joutard S, Glocker B, Cardoso J, Modat M, Batmanghelich K, Belkov A, Calisto MB, Choi JW, Dawant BM, Dong H, Escalera S, Fan Y, Hansen L, Heinrich MP, Joshi S, Kashtanova V, Kim HG, Kondo S, Kruse CN, Lai-Yuen SK, Li H, Liu H, Ly B, Oguz I, Shin H, Shirokikh B, Su Z, Wang G, Wu J, Xu Y, Yao K, Zhang L, Ourselin S, Shapey J, Vercauteren Tet al., 2023, CrossMoDA 2021 challenge: Benchmark of cross-modality domain adaptation techniques for vestibular schwannoma and cochlea segmentation, Publisher: ELSEVIER

Working paper

Chalkidou A, Shokraneh F, Kijauskaite G, Taylor-Phillips S, Halligan S, Wilkinson L, Glocker B, Garrett P, Denniston AK, Mackie A, Seedat Fet al., 2022, Recommendations for the development and use of imaging test sets to investigate the test performance of artificial intelligence in health screening., Lancet Digit Health, Vol: 4, Pages: e899-e905

Rigorous evaluation of artificial intelligence (AI) systems for image classification is essential before deployment into health-care settings, such as screening programmes, so that adoption is effective and safe. A key step in the evaluation process is the external validation of diagnostic performance using a test set of images. We conducted a rapid literature review on methods to develop test sets, published from 2012 to 2020, in English. Using thematic analysis, we mapped themes and coded the principles using the Population, Intervention, and Comparator or Reference standard, Outcome, and Study design framework. A group of screening and AI experts assessed the evidence-based principles for completeness and provided further considerations. From the final 15 principles recommended here, five affect population, one intervention, two comparator, one reference standard, and one both reference standard and comparator. Finally, four are appliable to outcome and one to study design. Principles from the literature were useful to address biases from AI; however, they did not account for screening specific biases, which we now incorporate. The principles set out here should be used to support the development and use of test sets for studies that assess the accuracy of AI within screening programmes, to ensure they are fit for purpose and minimise bias.

Journal article

Kart T, Fischer M, Winzeck S, Glocker B, Bai W, Buelow R, Emmel C, Friedrich L, Kauczor H-U, Keil T, Kroencke T, Mayer P, Niendorf T, Peters A, Pischon T, Schaarschmidt BM, Schmidt B, Schulze MB, Umutle L, Voelzke H, Kuestner T, Bamberg F, Schoelkopf B, Rueckert D, Gatidis Set al., 2022, Automated imaging-based abdominal organ segmentation and quality control in 20,000 participants of the UK Biobank and German National Cohort Studies, SCIENTIFIC REPORTS, Vol: 12, ISSN: 2045-2322

Journal article

Satchwell L, Wedlake L, Greenlay E, Li X, Messiou C, Glocker B, Barwick T, Barfoot T, Doran S, Leach MO, Koh DM, Kaiser M, Winzeck S, Qaiser T, Aboagye E, Rockall Aet al., 2022, Development of machine learning support for reading whole body diffusion-weighted MRI (WB-MRI) in myeloma for the detection and quantification of the extent of disease before and after treatment (MALIMAR): protocol for a cross-sectional diagnostic test accuracy study, BMJ OPEN, Vol: 12, ISSN: 2044-6055

Journal article

Li Z, Kamnitsas K, Islam M, Chen C, Glocker Bet al., 2022, Estimating model performance under domain shifts with class-specific confidence scores, MICCAI 2022 25th International Conference, Publisher: Springer Nature Switzerland, Pages: 693-703, ISSN: 0302-9743

Machine learning models are typically deployed in a test setting that differs from the training setting, potentially leading to decreased model performance because of domain shift. If we could estimate the performance that a pre-trained model would achieve on data from a specific deployment setting, for example a certain clinic, we could judge whether the model could safely be deployed or if its performance degrades unacceptably on the specific data. Existing approaches estimate this based on the confidence of predictions made on unlabeled test data from the deployment’s domain. We find existing methods struggle with data that present class imbalance, because the methods used to calibrate confidence do not account for bias induced by class imbalance, consequently failing to estimate class-wise accuracy. Here, we introduce class-wise calibration within the framework of performance estimation for imbalanced datasets. Specifically, we derive class-specific modifications of state-of-the-art confidence-based model evaluation methods including temperature scaling (TS), difference of confidences (DoC), and average thresholded confidence (ATC). We also extend the methods to estimate Dice similarity coefficient (DSC) in image segmentation. We conduct experiments on four tasks and find the proposed modifications consistently improve the estimation accuracy for imbalanced datasets. Our methods improve accuracy estimation by 18% in classification under natural domain shifts, and double the estimation accuracy on segmentation tasks, when compared with prior methods (Code is available at https://github.com/ZerojumpLine/ModelEvaluationUnderClassImbalance).

Conference paper

Vasey B, Nagendran M, Campbell B, Clifton DA, Collins GS, Denaxas S, Denniston AK, Faes L, Geerts B, Ibrahim M, Liu X, Mateen BA, Mathur P, McCradden MD, Morgan L, Ordish J, Rogers C, Saria S, Ting DSW, Watkinson P, Weber W, Wheatstone P, McCulloch Pet al., 2022, Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI (May, 10.1038/s41591-022-01772-9, 2022), NATURE MEDICINE, ISSN: 1078-8956

Journal article

Taylor-Phillips S, Seedat F, Kijauskaite G, Marshall J, Halligan S, Hyde C, Given-Wilson R, Wilkinson L, Denniston AK, Glocker B, Garrett P, Mackie A, Steele RJet al., 2022, UK National Screening Committee's approach to reviewing evidence on artificial intelligence in breast cancer screening, The Lancet Digital Health, Vol: 4, Pages: e558-e565, ISSN: 2589-7500

Artificial intelligence (AI) could have the potential to accurately classify mammograms according to the presence or absence of radiological signs of breast cancer, replacing or supplementing human readers (radiologists). The UK National Screening Committee's assessments of the use of AI systems to examine screening mammograms continues to focus on maximising benefits and minimising harms to women screened, when deciding whether to recommend the implementation of AI into the Breast Screening Programme in the UK. Maintaining or improving programme specificity is important to minimise anxiety from false positive results. When considering cancer detection, AI test sensitivity alone is not sufficiently informative, and additional information on the spectrum of disease detected and interval cancers is crucial to better understand the benefits and harms of screening. Although large retrospective studies might provide useful evidence by directly comparing test accuracy and spectrum of disease detected between different AI systems and by population subgroup, most retrospective studies are biased due to differential verification (ie, the use of different reference standards to verify the target condition among study participants). Enriched, multiple-reader, multiple-case, test set laboratory studies are also biased due to the laboratory effect (ie, radiologists' performance in retrospective, laboratory, observer studies is substantially different to their performance in a clinical environment). Therefore, assessment of the effect of incorporating any AI system into the breast screening pathway in prospective studies is required as it will provide key evidence for the effect of the interaction of medical staff with AI, and the impact on women's outcomes.

Journal article

Bernhardt M, Jones C, Glocker B, 2022, Potential sources of dataset bias complicate investigation of underdiagnosis by machine learning algorithms, NATURE MEDICINE, Vol: 28, Pages: 1157-+, ISSN: 1078-8956

Journal article

Liu X, Glocker B, McCradden MM, Ghassemi M, Denniston AK, Oakden-Rayner Let al., 2022, The medical algorithmic audit (vol 4, pg e384, 2022), LANCET DIGITAL HEALTH, Vol: 4, Pages: E405-E405

Journal article

Grzech D, Azampour MF, Qiu H, Glocker B, Kainz B, Folgoc LLet al., 2022, Uncertainty quantification in non-rigid image registration via stochastic gradient Markov chain Monte Carlo, Publisher: ArXiv

We develop a new Bayesian model for non-rigid registration ofthree-dimensional medical images, with a focus on uncertainty quantification.Probabilistic registration of large images with calibrated uncertaintyestimates is difficult for both computational and modelling reasons. To addressthe computational issues, we explore connections between the Markov chain MonteCarlo by backpropagation and the variational inference by backpropagationframeworks, in order to efficiently draw samples from the posteriordistribution of transformation parameters. To address the modelling issues, weformulate a Bayesian model for image registration that overcomes the existingbarriers when using a dense, high-dimensional, and diffeomorphic transformationparametrisation. This results in improved calibration of uncertainty estimates.We compare the model in terms of both image registration accuracy anduncertainty quantification to VoxelMorph, a state-of-the-art image registrationmodel based on deep learning.

Working paper

Liu X, Glocker B, McCradden MM, Ghassemi M, Denniston AK, Oakden-Rayner Let al., 2022, The medical algorithmic audit., The Lancet Digital Health, Vol: 4, Pages: e384-e397, ISSN: 2589-7500

Artificial intelligence systems for health care, like any other medical device, have the potential to fail. However, specific qualities of artificial intelligence systems, such as the tendency to learn spurious correlates in training data, poor generalisability to new deployment settings, and a paucity of reliable explainability mechanisms, mean they can yield unpredictable errors that might be entirely missed without proactive investigation. We propose a medical algorithmic audit framework that guides the auditor through a process of considering potential algorithmic errors in the context of a clinical task, mapping the components that might contribute to the occurrence of errors, and anticipating their potential consequences. We suggest several approaches for testing algorithmic errors, including exploratory error analysis, subgroup testing, and adversarial testing, and provide examples from our own work and previous studies. The medical algorithmic audit is a tool that can be used to better understand the weaknesses of an artificial intelligence system and put in place mechanisms to mitigate their impact. We propose that safety monitoring and medical algorithmic auditing should be a joint responsibility between users and developers, and encourage the use of feedback mechanisms between these groups to promote learning and maintain safe deployment of artificial intelligence systems.

Journal article

Pati S, Baid U, Edwards B, Sheller M, Wang S-H, Reina GA, Foley P, Gruzdev A, Karkada D, Davatzikos C, otherset al., 2022, Federated learning enables big data for rare cancer boundary detection, Publisher: arXiv

Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train accurate and generalizable ML models, by only sharing numerical model updates. Here we present findings from the largest FL study to-date, involving data from 71 healthcare institutions across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, utilizing the largest dataset of such patients ever used in the literature (25,256 MRI scans from 6,314 patients). We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent. We anticipate our study to: 1) enable more studies in healthcare informed by large and diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further quantitative analyses for glioblastoma via performance optimization of our consensus model for eventual public release, and 3) demonstrate the effectiveness of FL at such scale and task complexity as a paradigm shift for multi-site collaborations, alleviating the need for data sharing.

Working paper

Dou Q, So TY, Jiang M, Liu Q, Vardhanabhuti V, Kaissis G, Li Z, Si W, Lee HHC, Yu K, Feng Z, Dong L, Burian E, Jungmann F, Braren R, Makowski M, Kainz B, Rueckert D, Glocker B, Yu SCH, Heng PAet al., 2022, Author Correction: Federated deep learning for detecting COVID-19 lung abnormalities in CT: a privacy-preserving multinational validation study, npj Digital Medicine, Vol: 5, ISSN: 2398-6352

Correction to: npj Digital Medicine https://doi.org/10.1038/s41746-021-00431-6, published online 29 March 2021

Journal article

Newcombe VFJ, Ashton NJ, Posti JP, Glocker B, Manktelow A, Chatfield DA, Winzeck S, Needham E, Correia MM, Williams GB, Simren J, Takala RSK, Katila AJ, Maanpaa H-R, Tallus J, Frantzen J, Blennow K, Tenovuo O, Zetterberg H, Menon DKet al., 2022, Post-acute blood biomarkers and disease progression in traumatic brain injury, BRAIN, Vol: 145, Pages: 2064-2076, ISSN: 0006-8950

Journal article

Popescu SG, Sharp DJ, Cole JH, Kamnitsas K, Glocker Bet al., 2022, Distributional Gaussian Processes Layers for Out-of-Distribution Detection, Journal of Machine Learning for Biomedical Imaging

Machine learning models deployed on medical imaging tasks must be equippedwith out-of-distribution detection capabilities in order to avoid erroneouspredictions. It is unsure whether out-of-distribution detection models relianton deep neural networks are suitable for detecting domain shifts in medicalimaging. Gaussian Processes can reliably separate in-distribution data pointsfrom out-of-distribution data points via their mathematical construction.Hence, we propose a parameter efficient Bayesian layer for hierarchicalconvolutional Gaussian Processes that incorporates Gaussian Processes operatingin Wasserstein-2 space to reliably propagate uncertainty. This directlyreplaces convolving Gaussian Processes with a distance-preserving affineoperator on distributions. Our experiments on brain tissue-segmentation showthat the resulting architecture approaches the performance of well-establisheddeterministic segmentation algorithms (U-Net), which has not been achieved withprevious hierarchical Gaussian Processes. Moreover, by applying the samesegmentation model to out-of-distribution data (i.e., images with pathologysuch as brain tumors), we show that our uncertainty estimates result inout-of-distribution detection that outperforms the capabilities of previousBayesian networks and reconstruction-based approaches that learn normativedistributions. To facilitate future work our code is publicly available.

Journal article

Bernhardt M, Castro DC, Tanno R, Schwaighofer A, Tezcan KC, Monteiro M, Bannur S, Lungren M, Nori A, Glocker B, Alvarez-Valle J, Oktay Oet al., 2022, Active label cleaning for improved dataset quality under resource constraints, NATURE COMMUNICATIONS, Vol: 13

Journal article

Santhirasekaram A, Kori A, Winkler M, Rockall A, Ben Get al., 2022, Vector Quantisation for Robust Segmentation, Publisher: SPRINGER INTERNATIONAL PUBLISHING AG

Working paper

Grzech D, Azampour MF, Glocker B, Schnabel J, Navab N, Kainz B, Folgoc LLet al., 2022, A variational Bayesian method for similarity learning in non-rigid image registration, Pages: 119-128, ISSN: 1063-6919

We propose a novel variational Bayesian formulation for diffeomorphic non-rigid registration of medical images, which learns in an unsupervised way a data-specific similarity metric. The proposed framework is general and may be used together with many existing image registration models. We evaluate it on brain MRI scans from the UK Biobank and show that use of the learnt similarity metric, which is parametrised as a neural network, leads to more accurate results than use of traditional functions, e.g. SSD and LCC, to which we initialise the model, without a negative impact on image registration speed or transformation smoothness. In addition, the method estimates the uncertainty associated with the transformation. The code and the trained models are available in a public repository: https://github.com/dgrzech/learnsim.

Conference paper

Rosnati M, Soreq E, Monteiro M, Li L, Graham NSN, Zimmerman K, Rossi C, Carrara G, Bertolini G, Sharp DJ, Glocker Bet al., 2022, Automatic Lesion Analysis for Increased Efficiency in Outcome Prediction of Traumatic Brain Injury, Pages: 135-146, ISSN: 0302-9743

The accurate prognosis for traumatic brain injury (TBI) patients is difficult yet essential to inform therapy, patient management, and long-term after-care. Patient characteristics such as age, motor and pupil responsiveness, hypoxia and hypotension, and radiological findings on computed tomography (CT), have been identified as important variables for TBI outcome prediction. CT is the acute imaging modality of choice in clinical practice because of its acquisition speed and widespread availability. However, this modality is mainly used for qualitative and semi-quantitative assessment, such as the Marshall scoring system, which is prone to subjectivity and human errors. This work explores the predictive power of imaging biomarkers extracted from routinely-acquired hospital admission CT scans using a state-of-the-art, deep learning TBI lesion segmentation method. We use lesion volumes and corresponding lesion statistics as inputs for an extended TBI outcome prediction model. We compare the predictive power of our proposed features to the Marshall score, independently and when paired with classic TBI biomarkers. We find that automatically extracted quantitative CT features perform similarly or better than the Marshall score in predicting unfavourable TBI outcomes. Leveraging automatic atlas alignment, we also identify frontal extra-axial lesions as important indicators of poor outcome. Our work may contribute to a better understanding of TBI, and provides new insights into how automated neuroimaging analysis can be used to improve prognostication after TBI.

Conference paper

Whitehouse DP, Monteiro M, Czeiter E, Vyvere TV, Valerio F, Ye Z, Amrein K, Kamnitsas K, Xu H, Yang Z, Verheyden J, Das T, Kornaropoulos EN, Steyerberg E, Maas AIR, Wang KKW, B√ľki A, Glocker B, Menon DK, Newcombe VFJ, CENTER-TBI Participants and Investigatorset al., 2021, Relationship of admission blood proteomic biomarkers levels to lesion type and lesion burden in traumatic brain injury: A CENTER-TBI study., EBioMedicine, Vol: 75, Pages: 1-15, ISSN: 2352-3964

BACKGROUND: We aimed to understand the relationship between serum biomarker concentration and lesion type and volume found on computed tomography (CT) following all severities of TBI. METHODS: Concentrations of six serum biomarkers (GFAP, NFL, NSE, S100B, t-tau and UCH-L1) were measured in samples obtained <24 hours post-injury from 2869 patients with all severities of TBI, enrolled in the CENTER-TBI prospective cohort study (NCT02210221). Imaging phenotypes were defined as intraparenchymal haemorrhage (IPH), oedema, subdural haematoma (SDH), extradural haematoma (EDH), traumatic subarachnoid haemorrhage (tSAH), diffuse axonal injury (DAI), and intraventricular haemorrhage (IVH). Multivariable polynomial regression was performed to examine the association between biomarker levels and both distinct lesion types and lesion volumes. Hierarchical clustering was used to explore imaging phenotypes; and principal component analysis and k-means clustering of acute biomarker concentrations to explore patterns of biomarker clustering. FINDINGS: 2869 patient were included, 68% (n=1946) male with a median age of 49 years (range 2-96). All severities of TBI (mild, moderate and severe) were included for analysis with majority (n=1946, 68%) having a mild injury (GCS 13-15). Patients with severe diffuse injury (Marshall III/IV) showed significantly higher levels of all measured biomarkers, with the exception of NFL, than patients with focal mass lesions (Marshall grades V/VI). Patients with either DAI+IVH or SDH+IPH+tSAH, had significantly higher biomarker concentrations than patients with EDH. Higher biomarker concentrations were associated with greater volume of IPH (GFAP, S100B, t-tau;adj r2 range:0·48-0·49; p<0·05), oedema (GFAP, NFL, NSE, t-tau, UCH-L1;adj r2 range:0·44-0·44; p<0·01), IVH (S100B;adj r2 range:0.48-0.49; p<0.05), Unsupervised k-means biomarker clustering revealed two clusters explaining 83·9% of varian

Journal article

Popescu SG, Glocker B, Sharp DJ, Cole JHet al., 2021, Local brain-age: A u-net model, Frontiers in Aging Neuroscience, Vol: 13, Pages: 1-17, ISSN: 1663-4365

We propose a new framework for estimating neuroimaging-derived “brain-age” at a local level within the brain, using deep learning. The local approach, contrary to existing global methods, provides spatial information on anatomical patterns of brain ageing. We trained a U-Net model using brain MRI scans from n = 3,463 healthy people (aged 18–90 years) to produce individualised 3D maps of brain-predicted age. When testing on n = 692 healthy people, we found a median (across participant) mean absolute error (within participant) of 9.5 years. Performance was more accurate (MAE around 7 years) in the prefrontal cortex and periventricular areas. We also introduce a new voxelwise method to reduce the age-bias when predicting local brain-age “gaps.” To validate local brain-age predictions, we tested the model in people with mild cognitive impairment or dementia using data from OASIS3 (n = 267). Different local brain-age patterns were evident between healthy controls and people with mild cognitive impairment or dementia, particularly in subcortical regions such as the accumbens, putamen, pallidum, hippocampus, and amygdala. Comparing groups based on mean local brain-age over regions-of-interest resulted in large effects sizes, with Cohen's d values >1.5, for example when comparing people with stable and progressive mild cognitive impairment. Our local brain-age framework has the potential to provide spatial information leading to a more mechanistic understanding of individual differences in patterns of brain ageing in health and disease.

Journal article

Folgoc LL, Baltatzis V, Alansary A, Desai S, Devaraj A, Ellis S, Manzanera OEM, Kanavati F, Nair A, Schnabel J, Glocker Bet al., 2021, Bayesian analysis of the prevalence bias: learning and predicting from imbalanced data, Publisher: ArXiv

Datasets are rarely a realistic approximation of the target population. Say,prevalence is misrepresented, image quality is above clinical standards, etc.This mismatch is known as sampling bias. Sampling biases are a major hindrancefor machine learning models. They cause significant gaps between modelperformance in the lab and in the real world. Our work is a solution toprevalence bias. Prevalence bias is the discrepancy between the prevalence of apathology and its sampling rate in the training dataset, introduced uponcollecting data or due to the practioner rebalancing the training batches. Thispaper lays the theoretical and computational framework for training models, andfor prediction, in the presence of prevalence bias. Concretely a bias-correctedloss function, as well as bias-corrected predictive rules, are derived underthe principles of Bayesian risk minimization. The loss exhibits a directconnection to the information gain. It offers a principled alternative toheuristic training losses and complements test-time procedures based onselecting an operating point from summary curves. It integrates seamlessly inthe current paradigm of (deep) learning using stochastic backpropagation andnaturally with Bayesian models.

Working paper

Sounderajah V, Ashrafian H, Rose S, Shah NH, Ghassemi M, Golub R, Kahn CE, Esteva A, Karthikesalingam A, Mateen B, Webster D, Milea D, Ting D, Treanor D, Cushnan D, King D, McPherson D, Glocker B, Greaves F, Harling L, Ordish J, Cohen JF, Deeks J, Leeflang M, Diamond M, McInnes MDF, McCradden M, Abramoff MD, Normahani P, Markar SR, Chang S, Liu X, Mallett S, Shetty S, Denniston A, Collins GS, Moher D, Whiting P, Bossuyt PM, Darzi Aet al., 2021, A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI, NATURE MEDICINE, Vol: 27, Pages: 1663-1665, ISSN: 1078-8956

Journal article

Baltatzis V, Bintsi K-M, Folgoc LL, Manzanera OEM, Ellis S, Nair A, Desai S, Glocker B, Schnabel JAet al., 2021, The pitfalls of sample selection: a case study on lung nodule classification, Predictive Intelligence in Medicine at MICCAI, Publisher: Springer, Pages: 201-211

Using publicly available data to determine the performance of methodologicalcontributions is important as it facilitates reproducibility and allowsscrutiny of the published results. In lung nodule classification, for example,many works report results on the publicly available LIDC dataset. In theory,this should allow a direct comparison of the performance of proposed methodsand assess the impact of individual contributions. When analyzing seven recentworks, however, we find that each employs a different data selection process,leading to largely varying total number of samples and ratios between benignand malignant cases. As each subset will have different characteristics withvarying difficulty for classification, a direct comparison between the proposedmethods is thus not always possible, nor fair. We study the particular effectof truthing when aggregating labels from multiple experts. We show thatspecific choices can have severe impact on the data distribution where it maybe possible to achieve superior performance on one sample distribution but noton another. While we show that we can further improve on the state-of-the-arton one sample selection, we also find that on a more challenging sampleselection, on the same database, the more advanced models underperform withrespect to very simple baseline methods, highlighting that the selected datadistribution may play an even more important role than the model architecture.This raises concerns about the validity of claimed methodologicalcontributions. We believe the community should be aware of these pitfalls andmake recommendations on how these can be avoided in future work.

Conference paper

Qaiser T, Winzeck S, Barfoot T, Barwick T, Doran SJ, Kaiser MF, Wedlake L, Tunariu N, Koh D-M, Messiou C, Rockall A, Glocker Bet al., 2021, Multiple instance learning with auxiliary task weighting for multiple myeloma classification, International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), Publisher: Springer, Pages: 786-796, ISSN: 0302-9743

Whole body magnetic resonance imaging (WB-MRI) is the recommended modality for diagnosis of multiple myeloma (MM). WB-MRI is used to detect sites of disease across the entire skeletal system, but it requires significant expertise and is time-consuming to report due to the great number of images. To aid radiological reading, we propose an auxiliary task-based multiple instance learning approach (ATMIL) for MM classification with the ability to localize sites of disease. This approach is appealing as it only requires patient-level annotations where an attention mechanism is used to identify local regions with active disease. We borrow ideas from multi-task learning and define an auxiliary task with adaptive reweighting to support and improve learning efficiency in the presence of data scarcity. We validate our approach on both synthetic and real multi-center clinical data. We show that the MIL attention module provides a mechanism to localize bone regions while the adaptive reweighting of the auxiliary task considerably improves the performance.

Conference paper

Baltatzis V, Folgoc LL, Ellis S, Manzanera OEM, Bintsi K-M, Nair A, Desai S, Glocker B, Schnabel JAet al., 2021, The effect of the loss on generalization: empirical study on syntheticlung nodule data, Interpretability of Machine Intelligence in Medical Image Computing at MICCAI 2021, Publisher: Springer, Pages: 56-64

Convolutional Neural Networks (CNNs) are widely used for image classificationin a variety of fields, including medical imaging. While most studies deploycross-entropy as the loss function in such tasks, a growing number ofapproaches have turned to a family of contrastive learning-based losses. Eventhough performance metrics such as accuracy, sensitivity and specificity areregularly used for the evaluation of CNN classifiers, the features that theseclassifiers actually learn are rarely identified and their effect on theclassification performance on out-of-distribution test samples isinsufficiently explored. In this paper, motivated by the real-world task oflung nodule classification, we investigate the features that a CNN learns whentrained and tested on different distributions of a synthetic dataset withcontrolled modes of variation. We show that different loss functions lead todifferent features being learned and consequently affect the generalizationability of the classifier on unseen data. This study provides some importantinsights into the design of deep learning solutions for medical imaging tasks.

Conference paper

Kamnitsas K, Winzeck S, Kornaropoulos EN, Whitehouse D, Englman C, Phyu P, Pao N, Menon DK, Rueckert D, Das T, Newcombe VFJ, Glocker Bet al., 2021, Transductive image segmentation: Self-training and effect of uncertainty estimation, MICCAI Workshop on Domain Adaptation and Representation Transfer, Publisher: Springer, Pages: 79-89

Semi-supervised learning (SSL) uses unlabeled data during training to learnbetter models. Previous studies on SSL for medical image segmentation focusedmostly on improving model generalization to unseen data. In some applications,however, our primary interest is not generalization but to obtain optimalpredictions on a specific unlabeled database that is fully available duringmodel development. Examples include population studies for extracting imagingphenotypes. This work investigates an often overlooked aspect of SSL,transduction. It focuses on the quality of predictions made on the unlabeleddata of interest when they are included for optimization during training,rather than improving generalization. We focus on the self-training frameworkand explore its potential for transduction. We analyze it through the lens ofInformation Gain and reveal that learning benefits from the use of calibratedor under-confident models. Our extensive experiments on a large MRI databasefor multi-class segmentation of traumatic brain lesions shows promising resultswhen comparing transductive with inductive predictions. We believe this studywill inspire further research on transductive learning, a well-suited paradigmfor medical image analysis.

Conference paper

Budd S, Sinclair M, Day T, Vlontzos A, Tan J, Liu T, Matthew J, Skelton E, Simpson J, Razavi R, Glocker B, Rueckert D, Robinson EC, Kainz Bet al., 2021, Detecting hypo-plastic left heart syndrome in fetal ultrasound via disease-specific atlas maps, 24th International Conference on Medical Image Computing and Computer Assisted Intervention, Publisher: Springer, Pages: 207-217, ISSN: 0302-9743

Fetal ultrasound screening during pregnancy plays a vital role in the early detection of fetal malformations which have potential long-term health impacts. The level of skill required to diagnose such malformations from live ultrasound during examination is high and resources for screening are often limited. We present an interpretable, atlas-learning segmentation method for automatic diagnosis of Hypo-plastic Left Heart Syndrome (HLHS) from a single ‘4 Chamber Heart’ view image. We propose to extend the recently introduced Image-and-Spatial Transformer Networks (Atlas-ISTN) into a framework that enables sensitising atlas generation to disease. In this framework we can jointly learn image segmentation, registration, atlas construction and disease prediction while providing a maximum level of clinical interpretability compared to direct image classification methods. As a result our segmentation allows diagnoses competitive with expert-derived manual diagnosis and yields an AUC-ROC of 0.978 (1043 cases for training, 260 for validation and 325 for testing).

Conference paper

Filbrandt G, Kamnitsas K, Bernstein D, Taylor A, Glocker Bet al., 2021, Learning from Partially Overlapping Labels: Image Segmentation under Annotation Shift, MICCAI Workshop on Domain Adaptation and Representation Transfer

Scarcity of high quality annotated images remains a limiting factor fortraining accurate image segmentation models. While more and more annotateddatasets become publicly available, the number of samples in each individualdatabase is often small. Combining different databases to create larger amountsof training data is appealing yet challenging due to the heterogeneity as aresult of differences in data acquisition and annotation processes, oftenyielding incompatible or even conflicting information. In this paper, weinvestigate and propose several strategies for learning from partiallyoverlapping labels in the context of abdominal organ segmentation. We find thatcombining a semi-supervised approach with an adaptive cross entropy loss cansuccessfully exploit heterogeneously annotated data and substantially improvesegmentation accuracy compared to baseline and alternative approaches.

Conference paper

Glocker B, Musolesi M, Richens J, Uhler Cet al., 2021, Causality in digital medicine, Nature Communications, Vol: 12, Pages: 1-6, ISSN: 2041-1723

Ben Glocker (an expert in machine learning for medical imaging, Imperial College London), Mirco Musolesi (a data science and digital health expert, University College London), Jonathan Richens (an expert in diagnostic machine learning models, Babylon Health) and Caroline Uhler (a computational biology expert, MIT) talked to Nature Communications about their research interests in causality inference and how this can provide a robust framework for digital medicine studies and their implementation, across different fields of application.

Journal article

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00795421&limit=30&person=true