Publications from our Researchers

Several of our current PhD candidates and fellow researchers at the Data Science Institute have published, or in the proccess of publishing, papers to present their research.  

Search or filter publications

Filter by type:

Filter by publication type

Filter by year:

to

Results

  • Showing results for:
  • Reset all filters

Search results

  • Journal article
    Martínez V, Fernando S, Molina-Solana M, Guo Yet al., 2020,

    Tuoris: A middleware for visualizing dynamic graphics in scalable resolution display environments

    , Future Generation Computer Systems, ISSN: 0167-739X

    In the era of big data, large-scale information visualization has become an important challenge. Scalable resolution display environments (SRDEs) have emerged as a technological solution for building high-resolution display systems by tiling lower resolution screens. These systems bring serious advantages, including lower construction cost and better maintainability compared to other alternatives. However, they require specialized software but also purpose-built content to suit the inherently complex underlying systems. This creates several challenges when designing visualizations for big data, such that can be reused across several SRDEs of varying dimensions. This is not yet a common practice but is becoming increasingly popular among those who engage in collaborative visual analytics in data observatories. In this paper, we define three key requirements for systems suitable for such environments, point out limitations of existing frameworks, and introduce Tuoris, a novel open-source middleware for visualizing dynamic graphics in SRDEs. Tuoris manages the complexity of distributing and synchronizing the information among different components of the system, eliminating the need for purpose-built content. This makes it possible for users to seamlessly port existing graphical content developed using standard web technologies, and simplifies the process of developing advanced, dynamic and interactive web applications for large-scale information visualization. Tuoris is designed to work with Scalable Vector Graphics (SVG), reducing bandwidth consumption and achieving high frame rates in visualizations with dynamic animations. It scales independent of the display wall resolution and contrasts with other frameworks that transmit visual information as blocks of images.

  • Journal article
    Fernando S, AmadorDíazLópez J, Şerban O, Gómez-Romero J, Molina-Solana M, Guo Yet al., 2019,

    Towards a large-scale twitter observatory for political events

    , Future Generation Computer Systems, ISSN: 0167-739X

    Explosion in usage of social media has made its analysis a relevant topic of interest, and particularly so in the political science area. Within Data Science, no other techniques are more widely accepted and appealing than visualisation. However, with datasets growing in size, visualisation tools also require a paradigm shift to remain useful in big data contexts. This work presents our proposal for a Large-Scale Twitter Observatory that enables researchers to efficiently retrieve, analyse and visualise data from this social network to gain actionable insights and knowledge related with political events. In addition to describing the supporting technologies, we put forward a working pipeline and validate the setup with different examples.

  • Journal article
    Rajpal H, Rosas De Andraca FE, Jensen HJ, 2019,

    Tangled worldview model of opinion dynamics

    , Frontiers in Physics, Vol: 7, ISSN: 2296-424X

    We study the joint evolution of worldviews by proposing a model of opinion dynamics, which is inspired in notions fromevolutionary ecology. Agents update their opinion on a specific issue based on their propensity to change – asserted by thesocial neighbours – weighted by their mutual similarity on other issues. Agents are, therefore, more influenced by neighbourswith similar worldviews (set of opinions on various issues), resulting in a complex co-evolution of each opinion. Simulationsshow that the worldview evolution exhibits events of intermittent polarization when the social network is scale-free. This, in turn,triggers extreme crashes and surges in the popularity of various opinions. Using the proposed model, we highlight the role ofnetwork structure, bounded rationality of agents, and the role of key influential agents in causing polarization and intermittentreformation of worldviews on scale-free networks.

  • Journal article
    Cofré R, Herzog R, Corcoran D, Rosas FEet al., 2019,

    A comparison of the maximum entropy principle across biological spatial scales

    , Entropy: international and interdisciplinary journal of entropy and information studies, Vol: 21, Pages: 1-20, ISSN: 1099-4300

    Despite their differences, biological systems at different spatial scales tend to exhibit common organizational patterns. Unfortunately, these commonalities are often hard to grasp due to the highly specialized nature of modern science and the parcelled terminology employed by various scientific sub-disciplines. To explore these common organizational features, this paper provides a comparative study of diverse applications of the maximum entropy principle, which has found many uses at different biological spatial scales ranging from amino acids up to societies. By presenting these studies under a common approach and language, this paper aims to establish a unified view over these seemingly highly heterogeneous scenarios.

  • Journal article
    Cofré R, Videla L, Rosas F, 2019,

    An introduction to the non-equilibrium steady states of maximum entropy spike trains

    , Entropy, Vol: 21, Pages: 1-28, ISSN: 1099-4300

    Although most biological processes are characterized by a strong temporal asymmetry, several popular mathematical models neglect this issue. Maximum entropy methods provide a principled way of addressing time irreversibility, which leverages powerful results and ideas from the literature of non-equilibrium statistical mechanics. This tutorial provides a comprehensive overview of these issues, with a focus in the case of spike train statistics. We provide a detailed account of the mathematical foundations and work out examples to illustrate the key concepts and results from non-equilibrium statistical mechanics.

  • Conference paper
    Truong N, Sun K, Guo Y,

    Blockchain-based personal data management: from fiction to solution

    , The 18th IEEE International Symposium on Network Computing and Applications (NCA 2019), Publisher: IEEE

    The emerging blockchain technology has enabledvarious decentralised applications in a trustless environmentwithout relying on a trusted intermediary. It is expected as apromising solution to tackle sophisticated challenges on personaldata management, thanks to its advanced features such as im-mutability, decentralisation and transparency. Although certainapproaches have been proposed to address technical difficultiesin personal data management; most of them only provided pre-liminary methodological exploration. Alarmingly, when utilisingBlockchain for developing a personal data management system,fictions have occurred in existing approaches and been promul-gated in the literature. Such fictions are theoretically doable;however, by thoroughly breaking down consensus protocols andtransaction validation processes, we clarify that such existingapproaches are either impractical or highly inefficient due tothe natural limitations of the blockchain and Smart Contractstechnologies. This encourages us to propose a feasible solution inwhich such fictions are reduced by designing a novel systemarchitecture with a blockchain-based “proof of permission”protocol. We demonstrate the feasibility and efficiency of theproposed models by implementing a clinical data sharing servicebuilt on top of a public blockchain platform. We believe thatour research resolves existing ambiguity and take a step furtheron providing a practically feasible solution for decentralisedpersonal data management.

  • Conference paper
    Gadotti A, Houssiau F, Rocher L, Livshits B, de Montjoye Y-Aet al.,

    When the Signal is in the Noise: Exploiting Diffix's Sticky Noise

    , 28th USENIX Security Symposium (USENIX Security '19), Publisher: USENIX

    Anonymized data is highly valuable to both businesses andresearchers. A large body of research has however shown thestrong limits of the de-identification release-and-forget model,where data is anonymized and shared. This has led to the de-velopment of privacy-preserving query-based systems. Basedon the idea of “sticky noise”, Diffix has been recently pro-posed as a novel query-based mechanism satisfying alone theEU Article 29 Working Party’s definition of anonymization.According to its authors, Diffix adds less noise to answersthan solutions based on differential privacy while allowingfor an unlimited number of queries.This paper presents a new class of noise-exploitation at-tacks, exploiting the noise added by the system to infer privateinformation about individuals in the dataset. Our first differen-tial attack uses samples extracted from Diffix in a likelihoodratio test to discriminate between two probability distributions.We show that using this attack against a synthetic best-casedataset allows us to infer private information with 89.4% ac-curacy using only 5 attributes. Our second cloning attack usesdummy conditions that conditionally strongly affect the out-put of the query depending on the value of the private attribute.Using this attack on four real-world datasets, we show thatwe can infer private attributes of at least 93% of the users inthe dataset with accuracy between 93.3% and 97.1%, issuinga median of 304 queries per user. We show how to optimizethis attack, targeting 55.4% of the users and achieving 91.7%accuracy, using a maximum of only 32 queries per user.Our attacks demonstrate that adding data-dependent noise,as done by Diffix, is not sufficient to prevent inference ofprivate attributes. We furthermore argue that Diffix alone failsto satisfy Art. 29 WP’s definition of anonymization. We con-clude by discussing how non-provable privacy-preserving systems can be combined with fundamental security principlessuch as defense-in

  • Journal article
    Rocher L, Hendrickx J, de Montjoye Y-A, 2019,

    Estimating the success of re-identifications in incomplete datasets using generative models

    , Nature Communications, Vol: 10, ISSN: 2041-1723

    While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.

  • Conference paper
    Fernando S, Birch D, Molina-Solana M, McIlwraith D, Guo Yet al., 2019,

    Compositional Microservices for Immersive Social Visual Analytics

    , Pages: 216-223, ISSN: 1093-9547

    © 2019 IEEE. As humans, we have developed to process highly complex visual data from our surroundings. This is why data visualization and interaction is one of the quickest ways to facilitate investigation and communicate understanding. To perform visual analytics effectively at the big data scale it is crucial that we develop an integrated processing and visualization ecosystem. However, to date, in Large High-Resolution Display (LHRD) environments the worlds of data processing and visualization remain largely disconnected. In this paper, we propose a common architectural approach to enable integrated data processing and distributed visualization via the composition of discrete microservices. Each of these microservices provides a very specific clearly-defined function, such as analyzing data, creating a visualization, sharding data or providing a synchronization source. By defining common transport, data and API formats we enable the composition of these microservices from processing raw data through to analytics, visualization and rendering. This compositionality, inspired by successful data-driven visualization frameworks provides a common platform for immersive social visual analytics.

  • Report
    Crémer J, de Montjoye Y-A, Schweitzer H, 2019,

    Competition policy for the digital era

    , Competition policy for the digital era, Brussels, Publisher: EU Publications
  • Conference paper
    Jain S, Bensaid E, de Montjoye Y-A, 2019,

    UNVEIL: capture and visualise WiFi data leakages

    , The Web Conference 2019, Publisher: ACM, Pages: 3550-3554

    In the past few years, numerous privacy vulnerabilities have been discovered in the WiFi standards and their implementations for mobile devices. These vulnerabilities allow an attacker to collect large amounts of data on the device user, which could be used to infer sensitive information such as religion, gender, and sexual orientation. Solutions for these vulnerabilities are often hard to design and typically require many years to be widely adopted, leaving many devices at risk.In this paper, we present UNVEIL - an interactive and extendable platform to demonstrate the consequences of these attacks. The platform performs passive and active attacks on smartphones to collect and analyze data leaked through WiFi and communicate the analysis results to users through simple and interactive visualizations.The platform currently performs two attacks. First, it captures probe requests sent by nearby devices and combines them with public WiFi location databases to generate a map of locations previously visited by the device users. Second, it creates rogue access points with SSIDs of popular public WiFis (e.g. _Heathrow WiFi, Railways WiFi) and records the resulting internet traffic. This data is then analyzed and presented in a format that highlights the privacy leakage. The platform has been designed to be easily extendable to include more attacks and to be easily deployable in public spaces. We hope that UNVEIL will help raise public awareness of privacy risks of WiFi networks.

  • Journal article
    Brinkman P, Wagener AH, Hekking P-P, Bansal AT, Maitland-van der Zee A-H, Wang Y, Weda H, Knobel HH, Vink TJ, Rattray NJ, D'Amico A, Pennazza G, Santonico M, Lefaudeux D, De Meulder B, Auffray C, Bakke PS, Caruso M, Chanez P, Chung KF, Corfield J, Dahlen S-E, Djukanovic R, Geiser T, Horvath I, Krug N, Musial J, Sun K, Riley JH, Shaw DE, Sandstrom T, Sousa AR, Montuschi P, Fowler SJ, Sterk PJet al., 2019,

    Identification and prospective stability of electronic nose (eNose)-derived inflammatory phenotypes in patients with severe asthma

    , JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY, Vol: 143, Pages: 1811-+, ISSN: 0091-6749
  • Journal article
    Gomez-Romero J, Fernandez-Basso CJ, Cambronero MV, Molina-Solana M, Campana JR, Ruiz MD, Martin-Bautista MJet al., 2019,

    A probabilistic algorithm for predictive control with full-complexity models in non-residential buildings

    , IEEE Access, Vol: 7, Pages: 38748-38765, ISSN: 2169-3536

    Despite the increasing capabilities of information technologies for data acquisition and processing, building energy management systems still require manual configuration and supervision to achieve optimal performance. Model predictive control (MPC) aims to leverage equipment control – particularly heating, ventilation and air conditioning (HVAC)– by using a model of the building to capture its dynamic characteristics and to predict its response to alternative control scenarios. Usually, MPC approaches are based on simplified linear models, which support faster computation but also present some limitations regarding interpretability, solution diversification and longer-term optimization. In this work, we propose a novel MPC algorithm that uses a full-complexity grey-box simulation model to optimize HVAC operation in non-residential buildings. Our system generates hundreds of candidate operation plans, typically for the next day, and evaluates them in terms of consumption and comfort by means of a parallel simulator configured according to the expected building conditions (weather, occupancy, etc.) The system has been implemented and tested in an office building in Helsinki, both in a simulated environment and in the real building, yielding energy savings around 35% during the intermediate winter season and 20% in the whole winter season with respect to the current operation of the heating equipment.

  • Journal article
    Creswell A, Bharath AA, 2019,

    Denoising adversarial autoencoders

    , IEEE Transactions on Neural Networks and Learning Systems, Vol: 30, Pages: 968-984, ISSN: 2162-2388

    Unsupervised learning is of growing interest becauseit unlocks the potential held in vast amounts of unlabelled data tolearn useful representations for inference. Autoencoders, a formof generative model, may be trained by learning to reconstructunlabelled input data from a latent representation space. Morerobust representations may be produced by an autoencoderif it learns to recover clean input samples from corruptedones. Representations may be further improved by introducingregularisation during training to shape the distribution of theencoded data in the latent space. We suggestdenoising adversarialautoencoders, which combine denoising and regularisation, shap-ing the distribution of latent space using adversarial training.We introduce a novel analysis that shows how denoising maybe incorporated into the training and sampling of adversarialautoencoders. Experiments are performed to assess the contri-butions that denoising makes to the learning of representationsfor classification and sample synthesis. Our results suggest thatautoencoders trained using a denoising criterion achieve higherclassification performance, and can synthesise samples that aremore consistent with the input data than those trained withouta corruption process.

  • Journal article
    Rueda R, Cuéllar M, Molina-Solana M, Guo Y, Pegalajar Met al., 2019,

    Generalised regression hypothesis induction for energy consumption forecasting

    , Energies, Vol: 12, Pages: 1069-1069, ISSN: 1996-1073

    This work addresses the problem of energy consumption time series forecasting. In our approach, a set of time series containing energy consumption data is used to train a single, parameterised prediction model that can be used to predict future values for all the input time series. As a result, the proposed method is able to learn the common behaviour of all time series in the set (i.e., a fingerprint) and use this knowledge to perform the prediction task, and to explain this common behaviour as an algebraic formula. To that end, we use symbolic regression methods trained with both single- and multi-objective algorithms. Experimental results validate this approach to learn and model shared properties of different time series, which can then be used to obtain a generalised regression model encapsulating the global behaviour of different energy consumption time series.

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-t4-html.jsp Request URI: /respub/WEB-INF/jsp/search-t4-html.jsp Query String: id=607&limit=15&respub-action=search.html Current Millis: 1580265964908 Current Time: Wed Jan 29 02:46:04 GMT 2020