Imperial College London

Dr Chris Cantwell

Faculty of EngineeringDepartment of Aeronautics

Senior Lecturer in Aeronautics
 
 
 
//

Contact

 

+44 (0)20 7594 5050c.cantwell Website

 
 
//

Location

 

Department of Aeronautics, Room 219City and Guilds BuildingSouth Kensington Campus

//

Summary

 

Publications

Citation

BibTex format

@article{Benacchio:2021:10.1177/1094342021990433,
author = {Benacchio, T and Bonaventura, L and Altenbernd, M and Cantwell, CD and Duben, PD and Gillard, M and Giraud, L and Goeddeke, D and Raffin, E and Teranishi, K and Wedi, N},
doi = {10.1177/1094342021990433},
journal = {International Journal of High Performance Computing Applications},
pages = {285--311},
title = {Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction},
url = {http://dx.doi.org/10.1177/1094342021990433},
volume = {35},
year = {2021}
}

RIS format (EndNote, RefMan)

TY  - JOUR
AB - Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
AU - Benacchio,T
AU - Bonaventura,L
AU - Altenbernd,M
AU - Cantwell,CD
AU - Duben,PD
AU - Gillard,M
AU - Giraud,L
AU - Goeddeke,D
AU - Raffin,E
AU - Teranishi,K
AU - Wedi,N
DO - 10.1177/1094342021990433
EP - 311
PY - 2021///
SN - 1094-3420
SP - 285
TI - Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
T2 - International Journal of High Performance Computing Applications
UR - http://dx.doi.org/10.1177/1094342021990433
UR - http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000627544300001&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=1ba7043ffcc86c417c072aa74d649202
UR - https://journals.sagepub.com/doi/10.1177/1094342021990433
UR - http://hdl.handle.net/10044/1/87950
VL - 35
ER -