Publications

Ekanayake SD, Reguly IZ, Luporini F, Mudalige GRet al., 2023, Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2, Pages: 380-391

In this paper, we investigate data movement-reducing and communication-avoiding optimizations and their practicable implementation for large-scale unstructured-mesh applications. Utilizing the high-level abstraction of the OP2 DSL for the unstructured-mesh class of codes, we reason about techniques for reduced communications across a consecutive sequence of loops - a loop-chain. The careful trade-off with increased redundant computation in place of data movement is analyzed for distributed-memory parallelization. A new communication-avoiding (CA) back-end for OP2 is designed, codifying these techniques such that they can be applied automatically to any OP2 application. The back-end is extended to operate on a cluster of GPUs, integrating GPU-to-GPU communication with CUDA, in combination with MPI. The new CA back-end is applied automatically to two non-trivial applications, including the OP2 version of Rolls-Royce's production CFD application, Hydra. Performance is investigated on both CPU and GPU clusters on representative problems of 8M and 24M node mesh sizes. Results demonstrate how for select configurations the new CA back-end provides between 30 - 65% runtime reductions for the loop-chains in these applications for the mesh sizes on both an HPE Cray EX system and an NVIDIA V100 GPU cluster. We model and examine the determinants and characteristics of a given unstructured-mesh loop-chain that can lead to performance benefits with CA techniques, providing insights into the general feasibility and profitability of using the optimizations for this class of applications.

Abstract
Cite

Conference paper

Luporini F, Louboutin M, Lange M, Kukreja N, Witte P, Huckelheim J, Yount C, Kelly PHJ, Herrmann FJ, Gorman GJet al., 2020, Architecture and Performance of Devito, a System for Automated Stencil Computation, Publisher: ASSOC COMPUTING MACHINERY

Working paper

Mitchell L, Ham D, Gibson TH, Rathgeber F, Homolya M, McRae ATT, Zhou F, Cotter CJ, Lange M, ksagiyam, Wechsung F, Andrej J, Luporini F, Kirby R, Sun T, APaganini, Poulin FJ, Roy T, Hawkes C, Kärnä T, Salwa T, Kalogirou A, Barral N, Greaves T, Kramer S, Gregory A, Funke SW, Büsing H, tommbendall, McBain Get al., 2019, firedrakeproject/firedrake: an automated finite element system

firedrakeproject/firedrake: an automated finite element system

Abstract
Cite

Software

Rathgeber F, Mitchel L, Luporini F, Ham D, Markall G, Homolya M, Sun T, Barton SN, Dearman H, Gibson TH, gbts, Wechsung F, Kramer S, Funke SW, Shapero D, Kirby R, Lange M, Sato K, Russell F, Zhou Fet al., 2019, OP2/PyOP2: Framework for performance-portable parallel computations on unstructured meshes

OP2/PyOP2: Framework for performance-portable parallel computations on unstructured meshes

Abstract
Cite

Software

Hückelheim J, Kukreja N, Narayanan SHK, Luporini F, Gorman G, Hovland Pet al., 2019, Automatic differentiation for adjoint stencil loops, Publisher: arXiv

Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable.In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications.

Working paper

Luporini F, Lange M, Jacobs CT, Gorman GJ, Ramanujam J, Kelly PHJet al., 2019, Automated tiling of unstructured mesh computations with application to seismological modeling, ACM Transactions on Mathematical Software, Vol: 45, ISSN: 0098-3500

Publication rights licensed to ACM. Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]-hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation-not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.

Journal article

Witte PA, Louboutin M, Kukreja N, Luporini F, Lange M, Gorman GJ, Herrmann FJet al., 2019, A large-scale framework for symbolic implementations of seismic inversion algorithms in Julia, Geophysics, Vol: 84, Pages: F57-F71, ISSN: 0016-8033

Writing software packages for seismic inversion is a very challenging task because problems such as full-waveform inversion or least-squares imaging are algorithmically and computationally demanding due to the large number of unknown parameters and the fact that waves are propagated over many wavelengths. Therefore, software frameworks need to combine versatility and performance to provide geophysicists with the means and flexibility to implement complex algorithms that scale to exceedingly large 3D problems. Following these principles, we have developed the Julia Devito Inversion framework, an open-source software package in Julia for large-scale seismic modeling and inversion based on Devito, a domain-specific language compiler for automatic code generation. The framework consists of matrix-free linear operators for implementing seismic inversion algorithms that closely resemble the mathematical notation, a flexible resilient parallelization, and an interface to Devito for generating optimized stencil code to solve the underlying wave equations. In comparison with many manually optimized industry codes written in low-level languages, our software is built on the idea of independent layers of abstractions and user interfaces with symbolic operators. Through a series of numerical examples, we determined that this allows users to implement a series of increasingly complex algorithms for waveform inversion and imaging as simple Julia scripts that scale to large-scale 3D problems. This illustrates that software based on the paradigms of abstract user interfaces and automatic code generation and makes it possible to manage the complexity of the algorithms and performance optimizations, thus providing a high-performance research and production framework.

Journal article

Louboutin M, Lange M, Luporini F, Kukreja N, Witte PA, Herrmann FJ, Velesko P, Gorman GJet al., 2019, Devito (v3.1.0): An embedded domain-specific language for finite differences and geophysical exploration, Geoscientific Model Development, Vol: 12, Pages: 1165-1187, ISSN: 1991-959X

© Author(s) 2019. We introduce Devito, a new domain-specific language for implementing high-performance finite-difference partial differential equation solvers. The motivating application is exploration seismology for which methods such as full-waveform inversion and reverse-time migration are used to invert terabytes of seismic data to create images of the Earth's subsurface. Even using modern supercomputers, it can take weeks to process a single seismic survey and create a useful subsurface image. The computational cost is dominated by the numerical solution of wave equations and their corresponding adjoints. Therefore, a great deal of effort is invested in aggressively optimizing the performance of these wave-equation propagators for different computer architectures. Additionally, the actual set of partial differential equations being solved and their numerical discretization is under constant innovation as increasingly realistic representations of the physics are developed, further ratcheting up the cost of practical solvers. By embedding a domain-specific language within Python and making heavy use of SymPy, a symbolic mathematics library, we make it possible to develop finite-difference simulators quickly using a syntax that strongly resembles the mathematics. The Devito compiler reads this code and applies a wide range of analysis to generate highly optimized and parallel code. This approach can reduce the development time of a verified and optimized solver from months to days.

Journal article

Rathgeber F, Mitchell L, Luporini F, Ham D, Markall G, Homolya M, tj-sun, Barton N, Dearman H, Gibson T, gbts, Wechsung F, Kramer S, Funke SW, Shapero D, Kirby R, Lange M, Sato K, Russell F, Fangyi Zet al., 2019, OP2/PyOP2: Framework for performance-portable parallel computations on unstructured meshes

This release is specifically created to document the version of PyOP2 used in a particular set of experiments using Firedrake. Please do not cite this as a general source for Firedrake or any of its dependencies. Instead, refer to https://www.firedrakeproject.org/citing.html

Abstract
Cite

Software

Kukreja N, Luporini F, Lange M, Louboutin M, Pandolfo V, Cavalcante L, Gorman G, Kazakas P, Greaves T, Bisbas G, Astic T, Rasal R, McCormick D, Wolff Cet al., 2018, opesci/devito: Devito-3.4

Release notesPreliminary support for MPI (no changes to user code requested)Support for staggered gridsImproved compilation technologyImproved Operator autotuningMore powerful DSL (e.g., take derivatives of entire expressions such as (u+v).dx)More efficient picklingMisc bug fixesNew modeling examples based on the elastic wave equationNew examples describing aspects of the compilation technology

Abstract
Cite

Software

Mitchell L, Ham D, Gibson T, Rathgeber F, Homolya M, McRae ATT, Fangyi Z, Cotter CJ, Lange M, Andrej J, Luporini F, Kirby R, Wechsung F, Paganini A, Poulin FJ, Roy T, Hawkes C, Salwa T, Kalogirou A, Karna T, Greaves T, tj-sun, Gregory A, Kramer S, Funke SW, Busing H, McBain G, Mueller EHet al., 2018, firedrakeproject/firedrake: an automated finite element system

This release is specifically created to document the version of firedrake used in a particular set of experiments using Firedrake. Please do not cite this as a general source for Firedrake or any of its dependencies. Instead, refer to https://www.firedrakeproject.org/citing.html

Abstract
Cite

Software

Ham D, Luporini F, Mitchell L, Homolya M, Gibson TH, Barton N, Rathgeber F, Lange M, Markall G, tj-sun, Blechta J, Russell F, Engwer Cet al., 2018, coneoproject/COFFEE: A Compiler for Fast Expression Evaluation

This release is specifically created to document the version of COFFEE used in a particular set of experiments using Firedrake. Please do not cite this as a general source for Firedrake or any of its dependencies. Instead, refer to https://www.firedrakeproject.org/citing.html

Abstract
Cite

Software

Homolya M, Mitchell L, Luporini F, Ham DAet al., 2018, TSFC: a structure-preserving form compiler, SIAM Journal on Scientific Computing, ISSN: 1064-8275

A form compiler takes a high-level description of the weak form of partialdifferential equations and produces low-level code that carries out the finiteelement assembly. In this paper we present the Two-Stage Form Compiler (TSFC),a new form compiler with the main motivation to maintain the structure of theinput expression as long as possible. This facilitates the application ofoptimizations at the highest possible level of abstraction. TSFC features anovel, structure-preserving method for separating the contributions of a formto the subblocks of the local tensor in discontinuous Galerkin problems. Thisenables us to preserve the tensor structure of expressions longer through thecompilation process than other form compilers. This is also achieved in part bya two-stage approach that cleanly separates the lowering of finite elementconstructs to tensor algebra in the first stage, from the scheduling of thosetensor operations in the second stage. TSFC also efficiently traversescomplicated expressions, and experimental evaluation demonstrates goodcompile-time performance even for highly complex forms.

Journal article

Witte P, Louboutin M, Lensink K, Lange M, Kukreja N, Luporini F, Gorman G, Herrmann FJet al., 2018, Full-waveform inversion, Part 3: Optimization, Leading Edge, Vol: 37, Pages: 142-145, ISSN: 1070-485X

This tutorial is the third part of a full-waveform inversion (FWI) tutorial series with a step-by-step walkthrough of setting up forward and adjoint wave equations and building a basic FWI inversion framework. For discretizing and solving wave equations, we use Devito (http://www.opesci.org/devito-public), a Python-based domain-specific language for automated generation of finite-difference code (Lange et al., 2016). The first two parts of this tutorial (Louboutin et al., 2017, 2018) demonstrated how to solve the acoustic wave equation for modeling seismic shot records and how to compute the gradient of the FWI objective function using the adjoint-state method. With these two key ingredients, we will now build an inversion framework that can be used to minimize the FWI least-squares objective function.

Journal article

Louboutin M, Witte P, Lange M, Kukreja N, Luporini F, Gorman G, Herrmann FJet al., 2018, Full-waveform inversion, Part 2: Adjoint modeling, Leading Edge, Vol: 37, Pages: 69-72, ISSN: 1070-485X

This is the second part of a three-part tutorial series on full-waveform inversion (FWI) in which we provide a step-by-step walk through of setting up forward and adjoint wave equation solvers and an optimization framework for inversion. In Part 1 (Louboutin et al., 2017), we showed how to use Devito (http://www.opesci.org/devito-public) to set up and solve acoustic wave equations with (impulsive) seismic sources and sample wavefields at the receiver locations to forward model shot records. Here in Part 2, we will discuss how to set up and solve adjoint wave equations with Devito and, from that, how we can calculate gradients and function values of the FWI objective function.

Journal article

Louboutin M, Witte P, Lange M, Kukreja N, Luporini F, Gorman G, Herrmann FJet al., 2017, Full-waveform inversion, Part 1: Forward modeling, Leading Edge, Vol: 36, Pages: 1033-1036, ISSN: 1070-485X

Since its reintroduction by Pratt (1999), full-waveform inversion (FWI) has gained a lot of attention in geophysical exploration because of its ability to build high-resolution velocity models more or less automatically in areas of complex geology. While there is an extensive and growing literature on the topic, publications focus mostly on technical aspects, making this topic inaccessible for a broader audience due to the lack of simple introductory resources for newcomers to computational geophysics. We will accomplish this by providing a hands-on walkthrough of FWI using Devito (Lange et al., 2016), a system based on domain-specific languages that automatically generates code for time-domain finite differences.

Journal article

Hückelheim JC, Luo Z, Luporini F, Kukreja N, Lange M, Gorman G, Siegel S, Dwyer M, Hovland Pet al., 2017, Towards self-verification in finite difference code generation, SC17, Publisher: ACM

Code generation from domain-specific languages is becoming increasingly popular as a method to obtain optimised low-level code that performs well on a given platform and for a given problem instance. Ensuring the correctness of generated codes is crucial. At the same time, testing or manual inspection of the code is problematic, as the generated code can be complex and hard to read. Moreover, the generated code may change depending on the problem type, domain size, or target platform, making conventional code review or testing methods impractical. As a solution, we propose the integration of formal verification tools into the code generation process. We present a case study in which the CIVL verification tool is combined with the Devito finite difference framework that generates optimised stencil code for PDE solvers from symbolic equations. We show a selection of properties of the generated code that can be automatically specified and verified during the code generation process. Our approach allowed us to detect a previously unknown bug in the Devito code generation tool.

Conference paper

Luporini F, Ham DA, Kelly PHJ, 2017, An algorithm for the optimization of finite element integration loops, ACM Transactions on Mathematical Software, Vol: 44, ISSN: 0098-3500

We present an algorithm for the optimization of a class of finite element integration loop nests. This algo-rithm, which exploits fundamental mathematical properties of finite element operators, is proven to achievea locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive nu-merical experiments demonstrate significant performance improvements over the state of the art in finiteelement code generation in almost all cases. This validates the effectiveness of the algorithm presented here,and illustrates its limitations.

Journal article

Kukreja N, Louboutin M, Vieira F, Luporini F, Lange M, Gorman Get al., 2017, Devito: Automated fast finite difference computation, Pages: 11-19

Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance optimization on different computer architectures. Although a large number of stencil languages are available, finite difference domain specific languages have proved challenging to design because most practical use cases require additional features that fall outside the finite difference abstraction. Inspired by the complexity of real-world seismic imaging problems, we introduce Devito, a domain specific language in which high level equations are expressed using symbolic expressions from the SymPy package. Complex equations are automatically manipulated, optimized, and translated into highly optimized C code that aims to perform comparably or better than hand-tuned code. All this is transparent to users, who only see concise symbolic mathematical expressions.

Abstract
Open Access Link
Cite
Citations: 17

Conference paper

Kukreja N, Louboutin M, Lange M, Luporini F, Gorman Get al., 2017, Rapid development of seismic imaging applications using symbolic math, Pages: 9-12

In this talk, I will discuss our approach to the formulation and the performance optimization of finite difference methods for PDEs arising in FWI. Our framework consists of a stack of domain specific languages and optimizing compilers. The mathematical specification of a finite difference method is translated by a compiler, Devito, into C code, applying a sophisticated sequence of transformations. These include standard loop transformations, such as blocking and vectorization, as well as symbolic manipulations to reduce the unusually high arithmetic intensity of the stencils arising in forward and adjoint operators. These include common subexpressions elimination, factorization, code motion and approximation of transient functions. I will show the impact of these transformations on standard Intel Xeon architectures as well as on Intel Knights Landing. Compelling evidence points in the direction that our stencil kernels are significantly bound by the L1 cache. I will conclude discussing future challenges and goals of our work.

Abstract
Cite
Citations: 2

Conference paper

Mitchell L, Ham DA, McRae ATT, Rathgeber F, Lange M, Luporini F, Bercea G-T, Markall G, Kelly PHJet al., 2017, Firedrake: automating the finite element method by composing abstractions, ACM Transactions on Mathematical Software, Vol: 43, Pages: 1-27, ISSN: 1557-7295

Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrakeadopts the domain-specific language for the finite element method of the FEniCS project, but with a purePython runtime-only implementation centred on the composition of several existing and new abstractions forparticular aspects of scientific computing. The result is a more complete separation of concerns which easesthe incorporation of separate contributions from computer scientists, numerical analysts and applicationspecialists. These contributions may add functionality, or improve performance.Firedrake benefits from automatically applying new optimisations. This includes factorising mixed functionspaces, transforming and vectorising inner loops, and intrinsically supporting block matrix operations.Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users toimplement common operations that fall outside pure variational formulations, such as flux-limiters.

Journal article

Bercea G, McRae ATT, Ham DA, Mitchell L, Rathgeber F, Nardi L, Luporini F, Kelly PHJet al., 2016, A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake, Geoscientific Model Development, Vol: 9, Pages: 3803-3815, ISSN: 1991-9603

We present a generic algorithm for numbering and then efﬁciently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical ﬁnite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake ﬁnite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10-20 layers as long as the underlying mesh is well-ordered. We characterise the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretisations. On meshes with realistic numbers of layers the performance achieved is between 70% and 90% of a theoretical hardware-speciﬁc limit.

Journal article

Gorman GJ, Luporini F, Kukreja N, 2016, Devito

Devito is a prototype Domain-specific Language (DSL) and code generation framework for the design of highly optimised finite difference kernels for use in inversion methods. Devito utilises SymPy to allow the definition of operators from high-level symbolic equations and generates optimised and automatically tuned code specific to a given target architecture.

Abstract
Cite

Software

Lange M, Kukreja N, Louboutin M, Luporini F, Vieira F, Pandolfo V, Velesko P, Kazakas P, Gorman Get al., 2016, Devito: Towards a generic Finite Difference DSL using Symbolic Python, 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC), Publisher: IEEE, Pages: 67-75

Conference paper

Luporini F, Varbanescu AL, Rathgeber F, Bercea D, Rananujam J, Ham DA, Kelly PHJet al., 2015, Cross-loop optimization of arithmetic intensity for finite element local assembly, ACM Transactions on Architecture and Code Optimization, Vol: 11, ISSN: 1544-3973

We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler (called COFFEE) for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates abstract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector SIMD intrinsics. Experiments using a range of real-world finite element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed.

Journal article

Strout MM, Luporini F, Krieger CD, Bertolli C, Bercea GT, Olschanowsky C, Ramanujam J, Kelly PHJet al., 2014, Generalizing Run-Time Tiling with the Loop Chain Abstraction, 28th IEEE International Parallel & Distributed Processing Symposium, Publisher: IEEE Press, Pages: 1136-1145, ISSN: 1530-2075

Many scientific applications are organized in a data parallel way: as sequences of parallel and/or reduction loops. This exposes parallelism well, but does not convert data reuse between loops into data locality. This paper focuses on this issue in parallel loops whose loop-to-loop dependence structure is data-dependent due to indirect references such as A[B[i]]. Such references are a common occurrence in sparse matrix computations, molecu- lar dynamics simulations, and unstructured-mesh computational fluid dynamics (CFD). Previously, sparse tiling approaches were developed for individual benchmarks to group iterations across such loops to improve data locality. These approaches were shown to benefit applications such as moldyn, Gauss-Seidel, and the matrix powers kernel, however the run-time routines for performing sparse tiling were hand coded per application. In this paper, we present a generalized full sparse tiling algorithm that uses the newly developed loop chain abstraction as input, improves inter-loop data locality, and creates a task graph to expose shared-memory parallelism at runtime. We evaluate the overhead and performance impact of the generalized full sparse tiling algorithm on two codes: a sparse Jacobi iterative solver and the Airfoil CFD benchmark.

Abstract
Cite

Conference paper

Fabio Luporini

Contact

Location

Summary