Computational Methods Workshop for Massive/Complex Data 2014

Complex Data W/S

We are living in an information age - massive and complex data streams are now the norm in every sector of society. Adapting existing statistical methods to massive/complex data poses enormous computational challenges.

The aim of the workshop was to bring together experts on state-of-the-art data analysis (data mining/machine learning/(bio)statistics) of large-scale datasets, computer scientists that are already focusing on powerful strategies that exploit the nature of the analysis algorithms, numerical analysts that are creating efficient techniques for detailed results that necessitate handled large quantities of data, and database researchers working on accessing huge data sets distributed across multiple large-scale computer nodes.

The workshop covered a broad spectrum of different ares of research ranging from computational architectures to obtain computational speedups in the analysis of complex models for massive datasets to the application on these new computational environments in data mining/machine learning/(bio)statistics.

The workshop took place on the 19-20 June 2014 in the Department of Mathematics at Imperial College London

Further Details


Workshop Location
  • The workshop was held in the Clore Lecture Theatre, Level 2, Huxley Building
Workshop Accommodation
  • Imperial College London Reservation Team offers accommodation at a discounted rate.
Workshop Dinner

The workshop dinner was held in the Senior Common Room, on the evening of 19 June.

Getting to Imperial College South Kensington Campus

Information about travelling to the Imperial College South Kensington Campus

Invited Speakers

Workshop Programme

Day 1 (19 June 2014):

09:00-10:00 Registration
10:00-10:45      Mark Girolami 
"Never Mind the Big Data, here's the Big Models"
10:45-11:15 Coffee Break
11:15-12:00 James Hensman
"Scalable Gaussian Process methods"
12:00-14:00 Lunch Break
14:00-14:45 Marc Suchard
"Massive parallelization for large-scale observational healthcare problems"
14:45-15:30 Lizhen Lin 
"Robust and scalable inference using median posteriors"
15:30-16:00 Coffee Break
16:00-16.45 Hadley Wickham 
"Data tidying and munging"
from 6pm onwards Dinner

Day 2 (20 June 2014):

09:00-09:10                   Yike Guo
09:15-10:00    Hugo Larochelle
"The Neural Autoregressive Distribution Estimator"
10:00-10:45  Iain Murray
"Flexible and deep models for density estimation"
10:45-11:15 Coffee Break
11:15-12:00 Alexander Bertram
"Autoparallellization of R with Renjin: A case study in automatic distribution of calculations across public clusters"
12:00-14:00 Lunch Break
14:00-14:45 Pierre Jacob
"On non-negative unbiased estimators"
14:45-15:30 Christos Bouganis
"MCMC on Parallel Hardware"
15:30-16.00 Coffee Break
16.00-16.45 Joaquín Míguez
"Parallel particle filtering schemes for complex dynamical networks"


"Never Mind the Big Data, here's the Big Models" by Mark Girolami

In addition to the many opportunities and challenges the era of Big Data presents there is another revolution taking place in the quantification of uncertainty with the advent of Big Models. Big Models refers to mathematical models of natural, physical, economic and financial systems at a massive - global - scale. Whilst Big Models provide the potential to learn about and predict complex phenomena such as e.g. weather and global financial market dynamics they bring an attendant range of research issues that goes way beyond the computational. In this talk I will present some recent advances in the quantification of uncertainty in Big Models from my own research group and present emerging methodologies for probabilistically solving and reasoning about such Big Models.

"Scalable Gaussian Process methods" by James Hensman

The Gaussian Process is  an expressive yet tractable distribution over functions, and as such finds wide applicability in machine learning and statistics. Historically, a significant drawback has been that the scalability of the method -- dominated by factorization of the covariance matrix -- is O(N^3).

In the machine learning community, some work has focussed on making Gaussian process scale, and some approximations scale as O(nm^2), where m is a tunable parameter that trades off computational resources and statistical accuracy. Still, large data applications remain challenging.

In parametric models, such as neural networks, scalability is usually achieved by stochastic optimization of a finite parameter vector. In this talk, we show how this can be achieved for the nonparametric Gaussian process. 

"Massive parallelization for large-scale observational healthcare problems" by Marc Suchard

Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products.  National large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns.  In this talk, I discuss how high-performance statistical computation, including graphics processing units, can enable complex inference methods in these massive datasets.  I focus on algorithm restructuring through techniques like block relaxation (Gibbs, cyclic coordinate descent, MM) to exploit increased data/parameter conditional independence within traditional serial structures. I find orders-of-magnitude improvement in overall run-time fitting models involving tens of millions of observations.  These computational approaches also remain effective in seemingly non-parallelizable matrix factorization, and I conclude with one example that has been critical in building predictive models that scale for Big Data.

"Robust and scalable inference using median posteriors" by Lizhen Lin

 Fully Bayesian inference based on Markov chain Monte Carlo (MCMC)  faces well-known problems in scaling to big data.  A natural solution is to divide the data into non-overlapping subsets, with MCMC implemented in parallel on each subset.  However, it is not clear how to combine the samples from these subsets in approximating the posterior distribution conditional on all the available data.  We propose a robust and scalable method using a new alternative to exact Bayesian inference we refer to as the median posterior. We divide the data into subsects, obtain a stochastic approximation to the full data posterior for each subset, and run MCMC to generate samples from this approximation.  The median posterior is defined as the geometric median of the subset-specific approximations.  We show several strong theoretical results for the median posterior, including concentration rates and provable robustness.  We illustrate and validate the method through experiments on simulated and real data.

[Joint work with Stas Minker, Sanvesh Srivastava and David Dunson]

"Data tidying and munging" by Hadley Wickham

Data tidying and munging are unglamorous, but critical parts of the data analysis process. Data tidying gets you from your original source data to something that you can actually do statistics on, and data munging takes care of the many minor transformations that aren't quite statistics. Statistics as a discipline has spent little time wrestling with these challenges, even though they are necessary (but not sufficient) for good data analysis. In this talk, I'll discuss my work in these areas, including a framework for defining tidy data and a unifying set of tools to make messy data tidy; and some new data munging tools that work identically on in-memory (e.g. R data frames) and on-disk data (e.g. SQL databases).

"The Neural Autoregressive Distribution Estimator" by Hugo Larochelle

The problem of estimating the distribution of multivariate data is one of the most general problems addressed by machine learning research. Good models of multivariate distributions can be used to extract meaningful features from unlabeled data, to provide a good prior over some output space in a structured prediction task (e.g. image denoising) or to provide regularization cues to a supervised model (e.g. a neural network classifier).

In this talk, I’ll describe the Neural Autoregressive Distribution Estimator (NADE), at state-of-the-art distribution estimator based on a neural network architecture. The main idea in NADE is to use the probability product rule to decompose the joint distribution p(x) into the sequential product of its conditionals, and model all these conditionals using a single neural network. Thus, the task of distribution modeling is now tackled as the task of predicting the dimensions within x sequentially, in some given (usually random) order.

I’ll describe how to model binary and categorical observations with NADE. We’ll see that NADE achieve's excellent performance on many datasets, including datasets of binarized images and document bags-of-words. I’ll also discuss how NADE can be used successfully for document information retrieval, image classification and scene annotation.

Joint work with Iain Murray, Stanislas Lauly and Yin Zheng.

"Flexible and deep models for density estimation" by Iain Murray

We bring the empirical success of deep neural network models in regression and classification to high-dimensional density estimation.

Using the product rule, density estimation in D-dimensions can be reduced to D regression tasks. We tie these tasks together to improve computational and statistical efficiency, obtaining state-of-the-art fits across a wide range of benchmark tasks. I'll give an example Bayesian data analysis application from cosmology.

Work with Benigno Uria and Hugo Larochelle.

"Autoparallellization of R with Renjin: A case study in automatic distribution of calculations across public clusters" by Alexander Bertram

To keep pace with the exponential growth in available data, researchers must increasingly turn towards architectures that deliver performance gains through concurrency rather than faster single-core performance. 

Though these architectures, from GPUs to distributed commodity clusters, hold great potential, they often require researchers to rewrite existing sequential algorithms into collections of independent tasks that can be run concurrently.

This type of explicit concurrent programming is considerably more complex than writing sequential algorithms, and risks impeding researcher’s access to the computational power needed for today’s growing datasets.

Renjin, a new implementation of the R Language for Statistics, seeks to address this challenge by interpreting a sequence of computations in R as a type of query, which can be analyzed and then dispatched to the most suitable available architecture, in the same way that a modern database executes a SQL statement.

This presentation will examine how Renjin is able to auto-parallelize existing R code and dispatch for execution to Google AppEngine, a platform-as-a-service offered by Google.  We will look at the magnitude of performance gains, the current limitations of the approach, and avenues of future development.

"On non-negative unbiased estimators" by Pierre Jacob

I will introduce / study the existence of algorithms generating almost surely non-negative unbiased estimators of some quantities of interest. I will show that given a non-constant real-valued function f and a sequence of unbiased estimators of some lambda in R, there is no algorithm yielding almost surely non-negative unbiased estimators of f(lambda) in R+. The study is motivated by pseudo-marginal Monte Carlo algorithms that rely on such non-negativ e unbiased estimators. These methods allow "exact inference" in intractable models, in the sense that integrals with respect to a target distribution can be estimated without any systematic error, even though the associated probability density function cannot be evaluated pointwise. I will the consequences on the applicability of those sampling algorithms in various statistical settings, such as inference for diffusions, inference in the regime of large datasets, doubly intractable distributions and posterior distributions arising from reference priors.

Joint work with Alexandre H. Thiery (National University of Singapore).

"MCMC on Parallel Hardware" by Christos Bouganis

Markov Chain Monte Carlo (MCMC) is a widely used method to so lve Bayesian inference problems. Due to the increasing complexity of probabilistic models, as well as the explosion in the amount of data that these models need to process, MCMC samplers can become unacceptably slow when running on desktop PCs. The talk will focus on our recent work on accelerating population-based MCMC samplers using specialised hardware accelerators such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), with emphasis on the challenges and opportunities that exist in the utilisation of the above accelerators

"Parallel particle filtering schemes for complex dynamical networks" by Joaquín Míguez

Considerable effort has been devoted in the past decade to the design of schemes for the parallel, or distributed, implementation of particle filter s. The approaches vary from the totally heuristic to the mathematically well-principled. However, the former are largely based on (often loose) approximations that prevent the claim of any rigorous guarantees of convergence, whereas the latter involve considerable overhead to ensure the proper interaction of particles, which impairs the efficiency of the intended parallelisation. In this paper we investigate the use of possibly the simplest scheme for the parallelisation of the standard particle filter, that consists in splitting the computational budget into M fully independent particle filters with N particles each, and then obtaining the desired estimators by averaging over the M independent outcomes of the filters. This approach minimises the overhead and displays highly desirable theoretical properties. In particular, we prove that the L2 norms of the estimation errors vanish asymptotically with the same rate (1 / √MN) as a standard (centralised) particle filter with the same total number of particles and propose a time-error index to compare the performance of different parallelisation schemes. We provide an illustrative numerical example for the tracking of a dynamical high-dimensional network of FitzHugh-Nagumo stochastic oscillators using a graphical processing unit (GPU) for the implementation of the proposed algorithm.