Publications from our Researchers

Several of our current PhD candidates and fellow researchers at the Data Science Institute have published, or in the proccess of publishing, papers to present their research.  

Search or filter publications

Filter by type:

Filter by publication type

Filter by year:



  • Showing results for:
  • Reset all filters

Search results

  • Conference paper
    Gadotti A, Houssiau F, Rocher L, Livshits B, de Montjoye Y-Aet al., 2019,

    When the signal is in the noise: exploiting Diffix's sticky noise

    , 28th USENIX Security Symposium (USENIX Security '19), Publisher: USENIX, Pages: 1081-1098

    Anonymized data is highly valuable to both businesses and researchers. A large body of research has however shown the strong limits of the de-identification release-and-forget model, where data is anonymized and shared. This has led to the development of privacy-preserving query-based systems. Based on the idea of “sticky noise”, Diffix has been recently pro-posed as a novel query-based mechanism satisfying alone the EU Article 29 Working Party’s definition of anonymization. According to its authors, Diffix adds less noise to answers than solutions based on differential privacy while allowing for an unlimited number of queries.This paper presents a new class of noise-exploitation attacks, exploiting the noise added by the system to infer privateinformation about individuals in the dataset. Our first differential attack uses samples extracted from Diffix in a likelihood ratio test to discriminate between two probability distributions.We show that using this attack against a synthetic best-case dataset allows us to infer private information with 89.4% accuracy using only 5 attributes. Our second cloning attack uses dummy conditions that conditionally strongly affect the output of the query depending on the value of the private attribute. Using this attack on four real-world datasets, we show that we can infer private attributes of at least 93% of the users in the dataset with accuracy between 93.3% and 97.1%, issuing a median of 304 queries per user. We show how to optimize this attack, targeting 55.4% of the users and achieving 91.7% accuracy, using a maximum of only 32 queries per user. Our attacks demonstrate that adding data-dependent noise, as done by Diffix, is not sufficient to prevent inference of private attributes. We furthermore argue that Diffix alone fails to satisfy Art. 29 WP’s definition of anonymization. We conclude by discussing how non-provable privacy-preserving systems can be combined with fundamental security principles su

  • Journal article
    Rocher L, Hendrickx J, de Montjoye Y-A, 2019,

    Estimating the success of re-identifications in incomplete datasets using generative models

    , Nature Communications, Vol: 10, ISSN: 2041-1723

    While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.

  • Conference paper
    Bai W, Chen C, Tarroni G, Duan J, Guitton F, Petersen SE, Guo Y, Matthews PM, Rueckert Det al., 2019,

    Self-supervised learning for cardiac MR image segmentation by anatomicalposition prediction

    , International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)

    In the recent years, convolutional neural networks have transformed the field of medical image analysis due to their capacity to learn discriminative image features for a variety of classification and regression tasks. However, successfully learning these features requires a large amount of manuallyannotated data, which is expensive to acquire and limited by the availableresources of expert image analysts. Therefore, unsupervised, weakly-supervised and self-supervised feature learning techniques receive a lot of attention, which aim to utilise the vast amount of available data, while at the same time avoid or substantially reduce the effort of manual annotation. In this paper, we propose a novel way for training a cardiac MR image segmentation network, in which features are learnt in a self-supervised manner by predicting anatomical positions. The anatomical positions serve as a supervisory signal and do not require extra manual annotation. We demonstrate that this seemingly simple task provides a strong signal for feature learning and with self-supervised learning, we achieve a high segmentation accuracy that is better than or comparable to a U-net trained from scratch, especially at a small data setting. When only five annotated subjects are available, the proposed method improves the mean Dice metric from 0.811 to 0.852 for short-axis image segmentation, compared to the baseline U-net.

  • Report
    Crémer J, de Montjoye Y-A, Schweitzer H, 2019,

    Competition policy for the digital era

    , Competition policy for the digital era, Brussels, Publisher: EU Publications
  • Conference paper
    Jain S, Bensaid E, de Montjoye Y-A, 2019,

    UNVEIL: capture and visualise WiFi data leakages

    , The Web Conference 2019, Publisher: ACM, Pages: 3550-3554

    In the past few years, numerous privacy vulnerabilities have been discovered in the WiFi standards and their implementations for mobile devices. These vulnerabilities allow an attacker to collect large amounts of data on the device user, which could be used to infer sensitive information such as religion, gender, and sexual orientation. Solutions for these vulnerabilities are often hard to design and typically require many years to be widely adopted, leaving many devices at risk.In this paper, we present UNVEIL - an interactive and extendable platform to demonstrate the consequences of these attacks. The platform performs passive and active attacks on smartphones to collect and analyze data leaked through WiFi and communicate the analysis results to users through simple and interactive visualizations.The platform currently performs two attacks. First, it captures probe requests sent by nearby devices and combines them with public WiFi location databases to generate a map of locations previously visited by the device users. Second, it creates rogue access points with SSIDs of popular public WiFis (e.g. _Heathrow WiFi, Railways WiFi) and records the resulting internet traffic. This data is then analyzed and presented in a format that highlights the privacy leakage. The platform has been designed to be easily extendable to include more attacks and to be easily deployable in public spaces. We hope that UNVEIL will help raise public awareness of privacy risks of WiFi networks.

  • Journal article
    Brinkman P, Wagener AH, Hekking P-P, Bansal AT, Maitland-van der Zee A-H, Wang Y, Weda H, Knobel HH, Vink TJ, Rattray NJ, D'Amico A, Pennazza G, Santonico M, Lefaudeux D, De Meulder B, Auffray C, Bakke PS, Caruso M, Chanez P, Chung KF, Corfield J, Dahlen S-E, Djukanovic R, Geiser T, Horvath I, Krug N, Musial J, Sun K, Riley JH, Shaw DE, Sandstrom T, Sousa AR, Montuschi P, Fowler SJ, Sterk PJet al., 2019,

    Identification and prospective stability of electronic nose (eNose)-derived inflammatory phenotypes in patients with severe asthma

    , JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY, Vol: 143, Pages: 1811-+, ISSN: 0091-6749
  • Journal article
    Tarroni G, Oktay O, Bai W, Schuh A, Suzuki H, Passerat-Palmbach J, de Marvao A, O'Regan D, Cook S, Glocker B, Matthews P, Rueckert Det al., 2019,

    Learning-based quality control for cardiac MR images

    , IEEE Transactions on Medical Imaging, Vol: 38, Pages: 1127-1138, ISSN: 0278-0062

    The effectiveness of a cardiovascular magnetic resonance (CMR) scan depends on the ability of the operator to correctly tune the acquisition parameters to the subject being scanned and on the potential occurrence of imaging artefacts such as cardiac and respiratory motion. In the clinical practice, a quality control step is performed by visual assessment of the acquired images: however, this procedure is strongly operatordependent, cumbersome and sometimes incompatible with the time constraints in clinical settings and large-scale studies. We propose a fast, fully-automated, learning-based quality control pipeline for CMR images, specifically for short-axis image stacks. Our pipeline performs three important quality checks: 1) heart coverage estimation, 2) inter-slice motion detection, 3) image contrast estimation in the cardiac region. The pipeline uses a hybrid decision forest method - integrating both regression and structured classification models - to extract landmarks as well as probabilistic segmentation maps from both long- and short-axis images as a basis to perform the quality checks. The technique was tested on up to 3000 cases from the UK Biobank as well as on 100 cases from the UK Digital Heart Project, and validated against manual annotations and visual inspections performed by expert interpreters. The results show the capability of the proposed pipeline to correctly detect incomplete or corrupted scans (e.g. on UK Biobank, sensitivity and specificity respectively 88% and 99% for heart coverage estimation, 85% and 95% for motion detection), allowing their exclusion from the analysed dataset or the triggering of a new acquisition.

  • Journal article
    Cox DJ, Bai W, Price AN, Edwards AD, Rueckert D, Groves AMet al., 2019,

    Ventricular remodeling in preterm infants: computational cardiac magnetic resonance atlasing shows significant early remodeling of the left ventricle

    , PEDIATRIC RESEARCH, Vol: 85, Pages: 807-815, ISSN: 0031-3998
  • Journal article
    Gomez-Romero J, Fernandez-Basso CJ, Cambronero MV, Molina-Solana M, Campana JR, Ruiz MD, Martin-Bautista MJet al., 2019,

    A probabilistic algorithm for predictive control with full-complexity models in non-residential buildings

    , IEEE Access, Vol: 7, Pages: 38748-38765, ISSN: 2169-3536

    Despite the increasing capabilities of information technologies for data acquisition and processing, building energy management systems still require manual configuration and supervision to achieve optimal performance. Model predictive control (MPC) aims to leverage equipment control – particularly heating, ventilation and air conditioning (HVAC)– by using a model of the building to capture its dynamic characteristics and to predict its response to alternative control scenarios. Usually, MPC approaches are based on simplified linear models, which support faster computation but also present some limitations regarding interpretability, solution diversification and longer-term optimization. In this work, we propose a novel MPC algorithm that uses a full-complexity grey-box simulation model to optimize HVAC operation in non-residential buildings. Our system generates hundreds of candidate operation plans, typically for the next day, and evaluates them in terms of consumption and comfort by means of a parallel simulator configured according to the expected building conditions (weather, occupancy, etc.) The system has been implemented and tested in an office building in Helsinki, both in a simulated environment and in the real building, yielding energy savings around 35% during the intermediate winter season and 20% in the whole winter season with respect to the current operation of the heating equipment.

  • Journal article
    Creswell A, Bharath AA, 2019,

    Denoising adversarial autoencoders

    , IEEE Transactions on Neural Networks and Learning Systems, Vol: 30, Pages: 968-984, ISSN: 2162-2388

    Unsupervised learning is of growing interest because it unlocks the potential held in vast amounts of unlabeled data to learn useful representations for inference. Autoencoders, a form of generative model, may be trained by learning to reconstruct unlabeled input data from a latent representation space. More robust representations may be produced by an autoencoder if it learns to recover clean input samples from corrupted ones. Representations may be further improved by introducing regularization during training to shape the distribution of the encoded data in the latent space. We suggest denoising adversarial autoencoders (AAEs), which combine denoising and regularization, shaping the distribution of latent space using adversarial training. We introduce a novel analysis that shows how denoising may be incorporated into the training and sampling of AAEs. Experiments are performed to assess the contributions that denoising makes to the learning of representations for classification and sample synthesis. Our results suggest that autoencoders trained using a denoising criterion achieve higher classification performance and can synthesize samples that are more consistent with the input data than those trained without a corruption process.

  • Journal article
    Rueda R, Cuéllar M, Molina-Solana M, Guo Y, Pegalajar Met al., 2019,

    Generalised regression hypothesis induction for energy consumption forecasting

    , Energies, Vol: 12, Pages: 1069-1069, ISSN: 1996-1073

    This work addresses the problem of energy consumption time series forecasting. In our approach, a set of time series containing energy consumption data is used to train a single, parameterised prediction model that can be used to predict future values for all the input time series. As a result, the proposed method is able to learn the common behaviour of all time series in the set (i.e., a fingerprint) and use this knowledge to perform the prediction task, and to explain this common behaviour as an algebraic formula. To that end, we use symbolic regression methods trained with both single- and multi-objective algorithms. Experimental results validate this approach to learn and model shared properties of different time series, which can then be used to obtain a generalised regression model encapsulating the global behaviour of different energy consumption time series.

  • Journal article
    Robinson R, Valindria VV, Bai W, Oktay O, Kainz B, Suzuki H, Sanghvi MM, Aung N, Paiva JÉM, Zemrak F, Fung K, Lukaschuk E, Lee AM, Carapella V, Kim YJ, Piechnik SK, Neubauer S, Petersen SE, Page C, Matthews PM, Rueckert D, Glocker Bet al., 2019,

    Automated quality control in image segmentation: application to the UK Biobank cardiac MR imaging study

    , Journal of Cardiovascular Magnetic Resonance, Vol: 21, ISSN: 1097-6647

    Background: The trend towards large-scale studies including population imaging poses new challenges in terms of quality control (QC). This is a particular issue when automatic processing tools, e.g. image segmentation methods, are employed to derive quantitative measures or biomarkers for later analyses. Manual inspection and visual QC of each segmentation isn't feasible at large scale. However, it's important to be able to automatically detect when a segmentation method fails so as to avoid inclusion of wrong measurements into subsequent analyses which could lead to incorrect conclusions. Methods: To overcome this challenge, we explore an approach for predicting segmentation quality based on Reverse Classification Accuracy, which enables us to discriminate between successful and failed segmentations on a per-cases basis. We validate this approach on a new, large-scale manually-annotated set of 4,800 cardiac magnetic resonance scans. We then apply our method to a large cohort of 7,250 cardiac MRI on which we have performed manual QC. Results: We report results used for predicting segmentation quality metrics including Dice Similarity Coefficient (DSC) and surface-distance measures. As initial validation, we present data for 400 scans demonstrating 99% accuracy for classifying low and high quality segmentations using predicted DSC scores. As further validation we show high correlation between real and predicted scores and 95% classification accuracy on 4,800 scans for which manual segmentations were available. We mimic real-world application of the method on 7,250 cardiac MRI where we show good agreement between predicted quality metrics and manual visual QC scores. Conclusions: We show that RCA has the potential for accurate and fully automatic segmentation QC on a per-case basis in the context of large-scale population imaging as in the UK Biobank Imaging Study.

  • Conference paper
    Chen C, Bai W, Rueckert D, 2019,

    Multi-task learning for left atrial segmentation on GE-MRI

    , International Workshop on Statistical Atlases and Computational Models of the Heart, Publisher: Springer Verlag, Pages: 292-301, ISSN: 0302-9743

    Segmentation of the left atrium (LA) is crucial for assessing its anatomy in both pre-operative atrial fibrillation (AF) ablation planning and post-operative follow-up studies. In this paper, we present a fully automated framework for left atrial segmentation in gadolinium-enhanced magnetic resonance images (GE-MRI) based on deep learning. We propose a fully convolutional neural network and explore the benefits of multi-task learning for performing both atrial segmentation and pre/post ablation classification. Our results show that, by sharing features between related tasks, the network can gain additional anatomical information and achieve more accurate atrial segmentation, leading to a mean Dice score of 0.901 on a test set of 20 3D MRI images. Code of our proposed algorithm is available at

  • Journal article
    Gilbert K, Bai W, Mauger C, Medrano-Gracia P, Suinesiaputra A, Lee AM, Sanghvi MM, Aung N, Piechnik SK, Neubauer S, Petersen SE, Rueckert D, Young AAet al., 2019,

    Independent left ventricular morphometric atlases show consistent relationships with cardiovascular risk factors: A UK Biobank study

    , Scientific Reports, Vol: 9, ISSN: 2045-2322

    Left ventricular (LV) mass and volume are important indicators of clinical and pre-clinical disease processes. However, much of the shape information present in modern imaging examinations is currently ignored. Morphometric atlases enable precise quantification of shape and function, but there has been no objective comparison of different atlases in the same cohort. We compared two independent LV atlases using MRI scans of 4547 UK Biobank participants: (i) a volume atlas derived by automatic non-rigid registration of image volumes to a common template, and (ii) a surface atlas derived from manually drawn epicardial and endocardial surface contours. The strength of associations between atlas principal components and cardiovascular risk factors (smoking, diabetes, high blood pressure, high cholesterol and angina) were quantified with logistic regression models and five-fold cross validation, using area under the ROC curve (AUC) and Akaike Information Criterion (AIC) metrics. Both atlases exhibited similar principal components, showed similar relationships with risk factors, and had stronger associations (higher AUC and lower AIC) than a reference model based on LV mass and volume, for all risk factors (DeLong p < 0.05). Morphometric variations associated with each risk factor could be quantified and visualized and were similar between atlases. UK Biobank LV shape atlases are robust to construction method and show stronger relationships with cardiovascular risk factors than mass and volume.

  • Journal article
    Jevnikar Z, Östling J, Ax E, Calvén J, Thörn K, Israelsson E, Öberg L, Singhania A, Lau LCK, Wilson SJ, Ward JA, Chauhan A, Sousa AR, De Meulder B, Loza MJ, Baribaud F, Sterk PJ, Chung KF, Sun K, Guo Y, Adcock IM, Payne D, Dahlen B, Chanez P, Shaw DE, Krug N, Hohlfeld JM, Sandström T, Djukanovic R, James A, Hinks TSC, Howarth PH, Vaarala O, van Geest M, Olsson HK, U-BIOPRED study groupet al., 2019,

    Epithelial IL-6 trans-signaling defines a new asthma phenotype with increased airway inflammation

    , Journal of Allergy and Clinical Immunology, Vol: 143, Pages: 577-590, ISSN: 0091-6749

    BACKGROUND: Although several studies link high levels of IL-6 and soluble IL-6 receptor (sIL-6R) with asthma severity and decreased lung function, the role of IL-6 trans-signaling (IL-6TS) in asthma is unclear. OBJECTIVE: To explore the association between epithelial IL-6TS pathway activation and molecular and clinical phenotypes in asthma. METHODS: An IL-6TS gene signature, obtained from air-liquid interface (ALI) cultures of human bronchial epithelial cells stimulated with IL-6 and sIL-6R, was used to stratify lung epithelium transcriptomic data (U-BIOPRED cohorts) by hierarchical clustering. IL-6TS-specific protein markers were used to stratify sputum biomarker data (Wessex cohort). Molecular phenotyping was based on transcriptional profiling of epithelial brushings, pathway analysis and immunohistochemical analysis of bronchial biopsies. RESULTS: Activation of IL-6TS in ALI cultures reduced epithelial integrity and induced a specific gene signature enriched in genes associated with airway remodeling. The IL-6TS signature identified a subset of IL-6TS High asthma patients with increased epithelial expression of IL-6TS inducible genes in absence of systemic inflammation. The IL-6TS High subset had an overrepresentation of frequent exacerbators, blood eosinophilia, and submucosal infiltration of T cells and macrophages. In bronchial brushings, TLR pathway genes were up-regulated while the expression of tight junction genes was reduced. Sputum sIL-6R and IL-6 levels correlated with sputum markers of remodeling and innate immune activation, in particular YKL-40, MMP3, MIP-1β, IL-8 and IL-1β. CONCLUSIONS: Local lung epithelial IL-6TS activation in absence of type 2 airway inflammation defines a novel subset of asthmatics and may drive airway inflammation and epithelial dysfunction in these patients.

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: Request URI: /respub/WEB-INF/jsp/search-t4-html.jsp Query String: id=607&limit=15&page=4&respub-action=search.html Current Millis: 1623516012150 Current Time: Sat Jun 12 17:40:12 BST 2021