ABSTRACT

“The high-fidelity resolution of complex flow physics often requires large-scale computational resources and is one of the drivers towards exascale computing. Energy usage and system-level reliability are major concerns for these systems. The anticipated power consumption and poor performance of traditional resilience mechanisms – such as disk-based checkpoint/restart to a parallel filesystem – at exascale is likely to make them prohibitive. Novel approaches are therefore needed which can exploit properties of the numerical algorithms to provide energy-efficient protection in software.

In this talk, Dr Cantwell will present some on-going efforts to address this challenge in the context of time-dependent PDE solvers. We leverage User-Level Failure Mitigation, a proposed addition to the MPI 4.0 standard, and remote in-memory check-pointing in order to augment existing software tools with scalable fault tolerance capabilities in a minimally intrusive way. This approach substantially improves performance over conventional techniques by avoiding the parallel file system completely, and allows one or more failed ranks to be restored on-the-fly and independently of other non-failed processes from a pool of spare ranks. I will describe the motivation, algorithms, implementation and analysis of their performance characteristics, illustrating their application through the Nektar++ spectral/hp element framework.”

BIO

Dr. Chris Cantwell is a Senior Lecturer in the Department of Aeronautics at Imperial College London, United Kingdom. Dr. Cantwell received an MMath in Mathematics in 2005 and an MSc in Scientific Computing in 2006, from the University of Warwick. He received a PhD in Scientific Computing in 2009, also from the University of Warwick, investigating the stability and transient growth of perturbations in fluid flow. He moved to Imperial College London developing high-order spectral/hp element methods, followed by a discipline-hop award from the British Heart Foundation to develop an interdisciplinary research programme in the field of cardiac electrophysiology and is a founding member of the ElectroCardioMaths programme, part of the Imperial College Centre for Cardiac Engineering.

Dr. Chris Cantwell’s research is centred around developing novel and scalable numerical approaches for efficiently modelling and understanding complex physical processes in the aerodynamics and biomedical domains. Much of his work to date has focused on the efficient implementation and application of spectral/hp element methods for performing high-fidelity numerical simulation and making these tools more accessible to users without a detailed understanding of the numerical methods. However, his research interests now extend to the fusion of numerical modelling with statistical methods and machine learning. He is a strong proponent of open-source software and is a project leader of the Nektar spectral/hp element framework which acts as a vehicle for much of his research.

Getting here