Publications from our Researchers

Several of our current PhD candidates and fellow researchers at the Data Science Institute have published, or in the proccess of publishing, papers to present their research.  

Search or filter publications

Filter by type:

Filter by publication type

Filter by year:



  • Showing results for:
  • Reset all filters

Search results

  • Conference paper
    Gadotti A, Houssiau F, Rocher L, Livshits B, de Montjoye Y-Aet al.,

    When the signal is in the noise: exploiting diffix's sticky noise

    , 28th USENIX Security Symposium (USENIX Security '19), Publisher: USENIX

    Anonymized data is highly valuable to both businesses andresearchers. A large body of research has however shown thestrong limits of the de-identification release-and-forget model,where data is anonymized and shared. This has led to the de-velopment of privacy-preserving query-based systems. Basedon the idea of “sticky noise”, Diffix has been recently pro-posed as a novel query-based mechanism satisfying alone theEU Article 29 Working Party’s definition of anonymization.According to its authors, Diffix adds less noise to answersthan solutions based on differential privacy while allowingfor an unlimited number of queries.This paper presents a new class of noise-exploitation at-tacks, exploiting the noise added by the system to infer privateinformation about individuals in the dataset. Our first differen-tial attack uses samples extracted from Diffix in a likelihoodratio test to discriminate between two probability distributions.We show that using this attack against a synthetic best-casedataset allows us to infer private information with 89.4% ac-curacy using only 5 attributes. Our second cloning attack usesdummy conditions that conditionally strongly affect the out-put of the query depending on the value of the private attribute.Using this attack on four real-world datasets, we show thatwe can infer private attributes of at least 93% of the users inthe dataset with accuracy between 93.3% and 97.1%, issuinga median of 304 queries per user. We show how to optimizethis attack, targeting 55.4% of the users and achieving 91.7%accuracy, using a maximum of only 32 queries per user.Our attacks demonstrate that adding data-dependent noise,as done by Diffix, is not sufficient to prevent inference ofprivate attributes. We furthermore argue that Diffix alone failsto satisfy Art. 29 WP’s definition of anonymization. We con-clude by discussing how non-provable privacy-preserving systems can be combined with fundamental security principlessuch as defense-in

  • Journal article
    Rocher L, Hendrickx J, de Montjoye Y-A, 2019,

    Estimating the success of re-identifications in incomplete datasets using generative models

    , Nature Communications, ISSN: 2041-1723
  • Report
    Crémer J, de Montjoye Y-A, Schweitzer H, 2019,

    Competition policy for the digital era

    , Competition policy for the digital era, Brussels, Publisher: EU Publications
  • Conference paper
    Jain S, Bensaid E, de Montjoye Y-A, 2019,

    UNVEIL: capture and visualise WiFi data leakages

    , The Web Conference 2019, Publisher: ACM, Pages: 3550-3554

    In the past few years, numerous privacy vulnerabilities have been discovered in the WiFi standards and their implementations for mobile devices. These vulnerabilities allow an attacker to collect large amounts of data on the device user, which could be used to infer sensitive information such as religion, gender, and sexual orientation. Solutions for these vulnerabilities are often hard to design and typically require many years to be widely adopted, leaving many devices at risk.In this paper, we present UNVEIL - an interactive and extendable platform to demonstrate the consequences of these attacks. The platform performs passive and active attacks on smartphones to collect and analyze data leaked through WiFi and communicate the analysis results to users through simple and interactive visualizations.The platform currently performs two attacks. First, it captures probe requests sent by nearby devices and combines them with public WiFi location databases to generate a map of locations previously visited by the device users. Second, it creates rogue access points with SSIDs of popular public WiFis (e.g. _Heathrow WiFi, Railways WiFi) and records the resulting internet traffic. This data is then analyzed and presented in a format that highlights the privacy leakage. The platform has been designed to be easily extendable to include more attacks and to be easily deployable in public spaces. We hope that UNVEIL will help raise public awareness of privacy risks of WiFi networks.

  • Journal article
    Gomez-Romero J, Fernandez-Basso CJ, Cambronero MV, Molina-Solana M, Campana JR, Ruiz MD, Martin-Bautista MJet al., 2019,

    A probabilistic algorithm for predictive control with full-complexity models in non-residential buildings

    , IEEE Access, Vol: 7, Pages: 38748-38765, ISSN: 2169-3536

    Despite the increasing capabilities of information technologies for data acquisition and processing, building energy management systems still require manual configuration and supervision to achieve optimal performance. Model predictive control (MPC) aims to leverage equipment control – particularly heating, ventilation and air conditioning (HVAC)– by using a model of the building to capture its dynamic characteristics and to predict its response to alternative control scenarios. Usually, MPC approaches are based on simplified linear models, which support faster computation but also present some limitations regarding interpretability, solution diversification and longer-term optimization. In this work, we propose a novel MPC algorithm that uses a full-complexity grey-box simulation model to optimize HVAC operation in non-residential buildings. Our system generates hundreds of candidate operation plans, typically for the next day, and evaluates them in terms of consumption and comfort by means of a parallel simulator configured according to the expected building conditions (weather, occupancy, etc.) The system has been implemented and tested in an office building in Helsinki, both in a simulated environment and in the real building, yielding energy savings around 35% during the intermediate winter season and 20% in the whole winter season with respect to the current operation of the heating equipment.

  • Journal article
    Rueda R, Cuéllar M, Molina-Solana M, Guo Y, Pegalajar Met al., 2019,

    Generalised regression hypothesis induction for energy consumption forecasting

    , Energies, Vol: 12, Pages: 1069-1069, ISSN: 1996-1073

    This work addresses the problem of energy consumption time series forecasting. In our approach, a set of time series containing energy consumption data is used to train a single, parameterised prediction model that can be used to predict future values for all the input time series. As a result, the proposed method is able to learn the common behaviour of all time series in the set (i.e., a fingerprint) and use this knowledge to perform the prediction task, and to explain this common behaviour as an algebraic formula. To that end, we use symbolic regression methods trained with both single- and multi-objective algorithms. Experimental results validate this approach to learn and model shared properties of different time series, which can then be used to obtain a generalised regression model encapsulating the global behaviour of different energy consumption time series.

  • Journal article
    de Montjoye Y-A, Gambs S, Blondel V, Canright G, de Cordes N, Deletaille S, Engø-Monsen K, Garcia-Herranz M, Kendall J, Kerry C, Krings G, Letouzé E, Luengo-Oroz M, Oliver N, Rocher L, Rutherford A, Smoreda Z, Steele J, Wetter E, Pentland AS, Bengtsson Let al., 2018,

    On the privacy-conscientious use of mobile phone data

    , Scientific Data, Vol: 5, ISSN: 2052-4463

    The breadcrumbs we leave behind when using our mobile phones—who somebody calls, for how long, and from where—contain unprecedented insights about us and our societies. Researchers have compared the recent availability of large-scale behavioral datasets, such as the ones generated by mobile phones, to the invention of the microscope, giving rise to the new field of computational social science.

  • Journal article
    Gomez-Romero J, Molina-Solana MJ, Oehmichen A, Guo Yet al., 2018,

    Visualizing large knowledge graphs: a performance analysis

    , Future Generation Computer Systems, Vol: 89, Pages: 224-238, ISSN: 0167-739X

    Knowledge graphs are an increasingly important source of data and context information in Data Science. A first step in data analysis is data exploration, in which visualization plays a key role. Currently, Semantic Web technologies are prevalent for modelling and querying knowledge graphs; however, most visualization approaches in this area tend to be overly simplified and targeted to small-sized representations. In this work, we describe and evaluate the performance of a Big Data architecture applied to large-scale knowledge graph visualization. To do so, we have implemented a graph processing pipeline in the Apache Spark framework and carried out several experiments with real-world and synthetic graphs. We show that distributed implementations of the graph building, metric calculation and layout stages can efficiently manage very large graphs, even without applying partitioning or incremental processing strategies.

  • Journal article
    Molina-Solana M, Kennedy M, Amador Diaz Lopez J, 2018,

    foo.castr: visualising the future AI workforce

    , Big Data Analytics, Vol: 3, ISSN: 2058-6345

    Organization of companies and their HR departments are becoming hugely affected by recent advancements in computational power and Artificial Intelligence, with this trend likely to dramatically rise in the next few years. This work presents foo.castr, a tool we are developing to visualise, communicate and facilitate the understanding of the impact of these advancements in the future of workforce. It builds upon the idea that particular tasks within job descriptions will be progressively taken by computers, forcing the shaping of human jobs. In its current version, foo.castr presents three different scenarios to help HR departments planning potential changes and disruptions brought by the adoption of Artificial Intelligence.

  • Journal article
    Dolan D, Jensen H, Martinez Mediano P, Molina-Solana MJ, Rajpal H, Rosas De Andraca F, Sloboda JAet al., 2018,

    The improvisational state of mind: a multidisciplinary study of an improvisatory approach to classical music repertoire performance

    , Frontiers in Psychology, Vol: 9, ISSN: 1664-1078

    The recent re-introduction of improvisation as a professional practice within classical music, however cautious and still rare, allows direct and detailed contemporary comparison between improvised and “standard” approaches to performances of the same composition, comparisons which hitherto could only be inferred from impressionistic historical accounts. This study takes an interdisciplinary multi-method approach to discovering the contrasting nature and effects of prepared and improvised approaches during live chamber-music concert performances of a movement from Franz Schubert’s “Shepherd on the Rock”, given by a professional trio consisting of voice, flute, and piano, in the presence of an invited audience of 22 adults with varying levels of musical experience and training. The improvised performances were found to be differ systematically from prepared performances in their timing, dynamic, and timbral features as well as in the degree of risk-taking and “mind reading” between performers including during moments of added extemporised notes. Post-performance critical reflection by the performers characterised distinct mental states underlying the two modes of performance. The amount of overall body movements was reduced in the improvised performances, which showed less unco-ordinated movements between performers when compared to the prepared performance. Audience members, who were told only that the two performances would be different, but not how, rated the improvised version as more emotionally compelling and musically convincing than the prepared version. The size of this effect was not affected by whether or not the audience could see the performers, or by levels of musical training. EEG measurements from 19 scalp locations showed higher levels of Lempel-Ziv complexity (associated with awareness and alertness) in the improvised version in both performers and audience. Results are discussed in terms of their potential

  • Journal article
    Gómez-Romero J, Molina-Solana M, Ros M, Ruiz MD, Martin-Bautista MJet al., 2018,

    Comfort as a service: a new paradigm for residential environmental quality control

    , Sustainability, Vol: 10, ISSN: 1937-0709

    This paper introduces the concept of Comfort as a Service (CaaS), a new energy supply paradigm for providing comfort to residential customers. CaaS takes into account the available passive and active elements, the external factors that affect energy consumption and associated costs, and occupants' behaviors to generate optimal control strategies for the domestic equipment automatically. As a consequence, it releases building occupants from operating the equipment, which gives rise to a disruption of the traditional model of paying per consumed energy in favor of a model of paying per provided comfort. In the paper, we envision a realization of CaaS based on several technologies such as ambient intelligence, big data, cloud computing and predictive computing. We discuss the opportunities and the barriers of CaaS-centered business and exemplify the potential of CaaS deployments by quantifying the expected energy savings achieved after limiting occupants' control over the air conditioning system in a test scenario.

  • Journal article
    Creswell A, Bharath AA, 2018,

    Denoising adversarial autoencoders

    , IEEE Transactions on Neural Networks and Learning Systems, ISSN: 2162-2388

    Unsupervised learning is of growing interest becauseit unlocks the potential held in vast amounts of unlabelled data tolearn useful representations for inference. Autoencoders, a formof generative model, may be trained by learning to reconstructunlabelled input data from a latent representation space. Morerobust representations may be produced by an autoencoderif it learns to recover clean input samples from corruptedones. Representations may be further improved by introducingregularisation during training to shape the distribution of theencoded data in the latent space. We suggestdenoising adversarialautoencoders, which combine denoising and regularisation, shap-ing the distribution of latent space using adversarial training.We introduce a novel analysis that shows how denoising maybe incorporated into the training and sampling of adversarialautoencoders. Experiments are performed to assess the contri-butions that denoising makes to the learning of representationsfor classification and sample synthesis. Our results suggest thatautoencoders trained using a denoising criterion achieve higherclassification performance, and can synthesise samples that aremore consistent with the input data than those trained withouta corruption process.

  • Journal article
    Song J, Fan S, Lin W, Mottet L, Woodward H, Wykes MD, Arcucci R, Xiao D, Debay J-E, ApSimon H, Aristodemou E, Birch D, Carpentieri M, Fang F, Herzog M, Hunt GR, Jones RL, Pain C, Pavlidis D, Robins AG, Short CA, Linden PFet al., 2018,

    Natural ventilation in cities: the implications of fluid mechanics

    , BUILDING RESEARCH AND INFORMATION, Vol: 46, Pages: 809-828, ISSN: 0961-3218
  • Journal article
    Jahani E, Sundsøy P, Bjelland J, Bengtsson L, Pentland AS, de Montjoye Y-Aet al., 2017,

    Improving official statistics in emerging markets using machine learning and mobile phone data

    , EPJ Data Science, Vol: 6, ISSN: 2193-1127

    Mobile phones are one of the fastest growing technologies in the developing world with global penetration rates reaching 90%. Mobile phone data, also called CDR, are generated everytime phones are used and recorded by carriers at scale. CDR have generated groundbreaking insights in public health, official statistics, and logistics. However, the fact that most phones in developing countries are prepaid means that the data lacks key information about the user, including gender and other demographic variables. This precludes numerous uses of this data in social science and development economic research. It furthermore severely prevents the development of humanitarian applications such as the use of mobile phone data to target aid towards the most vulnerable groups during crisis. We developed a framework to extract more than 1400 features from standard mobile phone data and used them to predict useful individual characteristics and group estimates. We here present a systematic cross-country study of the applicability of machine learning for dataset augmentation at low cost. We validate our framework by showing how it can be used to reliably predict gender and other information for more than half a million people in two countries. We show how standard machine learning algorithms trained on only 10,000 users are sufficient to predict individual’s gender with an accuracy ranging from 74.3 to 88.4% in a developed country and from 74.5 to 79.7% in a developing country using only metadata. This is significantly higher than previous approaches and, once calibrated, gives highly accurate estimates of gender balance in groups. Performance suffers only marginally if we reduce the training size to 5,000, but significantly decreases in a smaller training set. We finally show that our indicators capture a large range of behavioral traits using factor analysis and that the framework can be used to predict other indicators of vulnerability such as age or socio-economic status. M

  • Journal article
    Steele JE, Sundsoy PR, Pezzulo C, Alegana VA, Bird TJ, Blumenstock J, Bjelland J, Engo-Monsen K, de Montjoye YKJV, Iqbal AM, Hadiuzzaman KN, Lu X, Wetter E, Tatem AJ, Bengtsson Let al., 2017,

    Mapping poverty using mobile phone and satellite data

    , Journal of the Royal Society Interface, Vol: 14, ISSN: 1742-5689

    Poverty is one of the most important determinants of adverse health outcomesglobally, a major cause of societal instability and one of the largest causes of losthuman potential. Traditional approaches to measuring and targeting povertyrely heavily on census data, which in most low- and middle-income countries(LMICs) are unavailable or out-of-date.Alternate measures are needed to comp-lement and update estimates between censuses. This study demonstrates howpublic and private data sources that are commonly available for LMICs can beused to provide novel insight into the spatial distribution of poverty. We evalu-ate the relative value of modelling three traditional poverty measures usingaggregate data from mobile operators and widely available geospatial data.Taken together, models combining these data sources providethebest predictivepower (highestr2¼0.78) and lowest error, but generally models employingmobile data only yield comparable results, offering the potential to measurepoverty more frequently and at finer granularity. Stratifying models intourban and rural areas highlights the advantage of using mobile data in urbanareas and different data in different contexts. The findings indicate the possibilityto estimate and continually monitor poverty rates at high spatial resolution incountries with limited capacity to support traditional methods of datacollection.

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: Request URI: /respub/WEB-INF/jsp/search-t4-html.jsp Query String: id=607&limit=15&page=1&respub-action=search.html Current Millis: 1566437495597 Current Time: Thu Aug 22 02:31:35 BST 2019