125 results found
Bisbas G, Luporini F, Louboutin M, et al., 2021, Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources, 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Publisher: IEEE COMPUTER SOC, Pages: 497-506, ISSN: 1530-2075
Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid (“off-the-grid”). Our work is motivated by modelling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x.
Kukreja N, Hückelheim J, Louboutin M, et al., 2020, Lossy Checkpoint Compression in Full Waveform Inversion, Geoscientific Model Development, ISSN: 1991-959X
This paper proposes a new method that combines check- pointing methods with error-controlled lossy compression for large-scale high-performance Full-Waveform Inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can signif- icantly reduce data movement, allowing a reduction in run time as well as peak memory.In the Exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit.Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory foot- print during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations.Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.
Kramer S, Wilson C, Davies R, et al., 2020, FluidityProject/fluidity: New test cases "Analytical solutions for mantle flow in cylindrical and spherical shells"
This release adds new test cases described in the GMD paper "Analytical solutions for mantle flow in cylindrical and spherical shells"
Luporini F, Louboutin M, Lange M, et al., 2020, devitocodes/devito: v4.2.3
SynopsisPerformance optimizations in the symbolic layer and generated code for x86, GPU and MPI.Various minor correctness and performance bug fixes.Improvements to application developer API.Added new tutorial notebooks.Increased test coverage - particularly for MPI and GPU's.
Louboutin M, Luporini F, Bisbas G, et al., 2020, mloubout/SC20Paper: First release
SC20 in Atlanta submission
Luporini F, Lange M, Louboutin M, et al., 2020, Architecture and performance of Devito, a system for automated stencil computation, ACM Transactions on Mathematical Software, Vol: 46, Pages: 1-24, ISSN: 0098-3500
Stencil computations are a key part of many high-performance computing applications, such as imageprocessing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devitois a framework capable of generating highly-optimized code given symbolic equations expressed in Python,specialized in, but not limited to, affine (stencil) codes. The lowering process—from mathematical equations down to C++ code—is performed by the Devito compiler through a series of intermediate representations.Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integratedin the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performanceoptimizations is demonstrated using operators drawn from seismic imaging applications.
Louboutin M, Luporini F, Witte P, et al., 2020, Scaling through abstractions -- high-performance vectorial wave simulations for seismic inversion with Devito, Publisher: arXiv
[Devito] is an open-source Python project based on domain-specific languageand compiler technology. Driven by the requirements of rapid HPC applicationsdevelopment in exploration seismology, the language and compiler have evolvedsignificantly since inception. Sophisticated boundary conditions, tensorcontractions, sparse operations and features such as staggered grids andsub-domains are all supported; operators of essentially arbitrary complexitycan be generated. To accommodate this flexibility whilst ensuring performance,data dependency analysis is utilized to schedule loops and detectcomputational-properties such as parallelism. In this article, the generationand simulation of MPI-parallel propagators (along with their adjoints) for thepseudo-acoustic wave-equation in tilted transverse isotropic media and theelastic wave-equation are presented. Simulations are carried out on industryscale synthetic models in a HPC Cloud system and reach a performance of28TFLOP/s, hence demonstrating Devito's suitability for production-gradeseismic inversion problems.
Luporini F, Louboutin M, Lange M, et al., 2020, Architecture and Performance of Devito, a System for Automated Stencil Computation, Publisher: ASSOC COMPUTING MACHINERY
Luporini F, Nelson R, Burgess T, et al., 2020, Automated distributed-memory parallelism from symbolic specification in devito
Automated Distributed-memory Parallelism has been added to Devito, a rapidly evolving framework adopted by a dynamic, heterogeneous and fast-growing community. The key innovations are the abstractions provided to the user and the compiler- based implementation approach, which we consider invaluable for long-term sustainable software to replace (partly or fully) obsolete, impenetrable, hardly extendable and often inefficient legacy code. The auto-tuner, which determines, among the other things, the best block shape for each tiled loop nest in an Operator, has already been tweaked to support DMP. Single-node multi-socket (one MPI process per socket) as well as Multi-node experiments, both weak and strong scaling, are planned for the near future.
Luporini F, Louboutin M, Lange M, et al., 2019, opesci/devito: Devito-4.0
Tensor algebra support (#873):VectorFunction and VectorTimeFunction(2nd order) TensorFunction and TensorTimeFunctionFull support for FD and related operations (derivatives, shortcuts, solve, ...)Differential operators such as div, grad and curlFD extensions:Custom FD with user-supplied coefficients as Function (#964)Extended and more rigorous support for staggered grids (#873):True half-grid staggering (u(x + h_x/2))Automatic evaluation at half-nodes (averaging only)Automatic staggered FD of any order
Rodrigues VHM, Cavalcante L, Pereira MB, et al., 2019, GPU support for automatic generation of finite-differences stencil Kernels, Publisher: arXiv
The growth of data to be processed in the Oil & Gas industry matches therequirements imposed by evolving algorithms based on stencil computations, suchas Full Waveform Inversion and Reverse Time Migration. Graphical processingunits (GPUs) are an attractive architectural target for stencil computationsbecause of its high degree of data parallelism. However, the rapidarchitectural and technological progression makes it difficult for even themost proficient programmers to remain up-to-date with the technologicaladvances at a micro-architectural level. In this work, we present an extensionfor an open source compiler designed to produce highly optimized finitedifference kernels for use in inversion methods named Devito. We embed it withthe Oxford Parallel Domain Specific Language (OP-DSL) in order to enableautomatic code generation for GPU architectures from a high-levelrepresentation. We aim to enable users coding in a symbolic representationlevel to effortlessly get their implementations leveraged by the processingcapacities of GPU architectures. The implemented backend is evaluated on aNVIDIA GTX Titan Z, and on a NVIDIA Tesla V100 in terms of operationalintensity through the roof-line model for varying space-order discretizationlevels of 3D acoustic isotropic wave propagation stencil kernels with andwithout symbolic optimizations. It achieves approximately 63% of V100's peakperformance and 24% of Titan Z's peak performance for stencil kernels overgrids with 256 points. Our study reveals that improving memory usage should bethe most efficient strategy for leveraging the performance of the implementedsolution on the evaluated architectures.
Witte PA, Louboutin M, Luporini F, et al., 2019, Compressive least-squares migration with on-the-fly Fourier transforms, GEOPHYSICS, Vol: 84, Pages: R655-R672, ISSN: 0016-8033
Luporin F, Lange M, Louboutin M, et al., 2019, opesci/devito: Devito-3.5
Release notesMPI support:Python-level: MPI-distributed NumPy arrays.C-level: code generation for sub-domains, staggered grids, operators with coupled PDEs.C-level: performance optimizations (e.g., computation-communication overlap).Lazy evaluation of derivatives.Revisited staggered grids API (now Dimension-based, previously mask-based).Re-engineered clustering (which means smarter loop fusion/fission).DSE: Improved aliases detection.DLE: OpenMP nested parallelism; hierarchical loop blocking.Auto-padding for Functions/TimeFunctions.Improved data dependency analysis.Smarter Operator auto-tuning.New tutorials: Operator application, MPI, new propagators, custom stencils, and more.Revisited benchmarking scripts.Revisited examples, new models and propagators (e.g., visco-elastic).Smarter continuous integration: now Travis sided by Azure Pipelines; dropped Jenkins.Misc bug fixes.Hundreds of tests added.More sophisticated platform auto-detection.
Hückelheim J, Kukreja N, Narayanan SHK, et al., 2019, Automatic differentiation for adjoint stencil loops, Publisher: arXiv
Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable.In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications.
Luporini F, Lange M, Jacobs CT, et al., 2019, Automated tiling of unstructured mesh computations with application to seismological modeling, ACM Transactions on Mathematical Software, Vol: 45, ISSN: 0098-3500
Publication rights licensed to ACM. Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]-hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation-not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.
Witte PA, Louboutin M, Kukreja N, et al., 2019, A large-scale framework for symbolic implementations of seismic inversion algorithms in Julia, Geophysics, Vol: 84, Pages: F57-F71, ISSN: 0016-8033
Writing software packages for seismic inversion is a very challenging task because problems such as full-waveform inversion or least-squares imaging are algorithmically and computationally demanding due to the large number of unknown parameters and the fact that waves are propagated over many wavelengths. Therefore, software frameworks need to combine versatility and performance to provide geophysicists with the means and flexibility to implement complex algorithms that scale to exceedingly large 3D problems. Following these principles, we have developed the Julia Devito Inversion framework, an open-source software package in Julia for large-scale seismic modeling and inversion based on Devito, a domain-specific language compiler for automatic code generation. The framework consists of matrix-free linear operators for implementing seismic inversion algorithms that closely resemble the mathematical notation, a flexible resilient parallelization, and an interface to Devito for generating optimized stencil code to solve the underlying wave equations. In comparison with many manually optimized industry codes written in low-level languages, our software is built on the idea of independent layers of abstractions and user interfaces with symbolic operators. Through a series of numerical examples, we determined that this allows users to implement a series of increasingly complex algorithms for waveform inversion and imaging as simple Julia scripts that scale to large-scale 3D problems. This illustrates that software based on the paradigms of abstract user interfaces and automatic code generation and makes it possible to manage the complexity of the algorithms and performance optimizations, thus providing a high-performance research and production framework.
Louboutin M, Lange M, Luporini F, et al., 2019, Devito (v3.1.0): An embedded domain-specific language for finite differences and geophysical exploration, Geoscientific Model Development, Vol: 12, Pages: 1165-1187, ISSN: 1991-959X
© Author(s) 2019. We introduce Devito, a new domain-specific language for implementing high-performance finite-difference partial differential equation solvers. The motivating application is exploration seismology for which methods such as full-waveform inversion and reverse-time migration are used to invert terabytes of seismic data to create images of the Earth's subsurface. Even using modern supercomputers, it can take weeks to process a single seismic survey and create a useful subsurface image. The computational cost is dominated by the numerical solution of wave equations and their corresponding adjoints. Therefore, a great deal of effort is invested in aggressively optimizing the performance of these wave-equation propagators for different computer architectures. Additionally, the actual set of partial differential equations being solved and their numerical discretization is under constant innovation as increasingly realistic representations of the physics are developed, further ratcheting up the cost of practical solvers. By embedding a domain-specific language within Python and making heavy use of SymPy, a symbolic mathematics library, we make it possible to develop finite-difference simulators quickly using a syntax that strongly resembles the mathematics. The Devito compiler reads this code and applies a wide range of analysis to generate highly optimized and parallel code. This approach can reduce the development time of a verified and optimized solver from months to days.
Kukreja N, Shilova A, Beaumont O, et al., 2019, Training on the Edge: The why and the how, Publisher: arXiv
Edge computing is the natural progression from Cloud computing, where, instead of collecting all data and processing it centrally, like in a cloud computing environment, we distribute the computing power and try to do as much processing as possible, close to the source of the data. There are various reasons this model is being adopted quickly, including privacy, and reduced power and bandwidth requirements on the Edge nodes. While it is common to see inference being done on Edge nodes today, it is much less common to do training on the Edge. The reasons for this range from computational limitations, to it not being advantageous in reducing communications between the Edge nodes. In this paper, we explore some scenarios where it is advantageous to do training on the Edge, as well as the use of checkpointing strategies to save memory.
Kukreja N, Hückelheim J, Louboutin M, et al., 2019, Combining Checkpointing and Data Compression to Accelerate Adjoint-Based Optimization Problems, Pages: 87-100, ISSN: 0302-9743
Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected points in time, and values at other times are recomputed as needed from the last stored state. This allows arbitrarily large adjoint computations with limited memory, at the cost of additional recomputations. In this paper, we combine compression and checkpointing for the first time to compute a realistic seismic inversion. The combination of checkpointing and compression allows larger adjoint computations compared to using only compression, and reduces the recomputation overhead significantly compared to using only checkpointing.
Kukreja N, Luporini F, Lange M, et al., 2018, opesci/devito: Devito-3.4
Release notesPreliminary support for MPI (no changes to user code requested)Support for staggered gridsImproved compilation technologyImproved Operator autotuningMore powerful DSL (e.g., take derivatives of entire expressions such as (u+v).dx)More efficient picklingMisc bug fixesNew modeling examples based on the elastic wave equationNew examples describing aspects of the compilation technology
This tutorial is the third part of a full-waveform inversion (FWI) tutorial series with a step-by-step walkthrough of setting up forward and adjoint wave equations and building a basic FWI inversion framework. For discretizing and solving wave equations, we use Devito (http://www.opesci.org/devito-public), a Python-based domain-specific language for automated generation of finite-difference code (Lange et al., 2016). The first two parts of this tutorial (Louboutin et al., 2017, 2018) demonstrated how to solve the acoustic wave equation for modeling seismic shot records and how to compute the gradient of the FWI objective function using the adjoint-state method. With these two key ingredients, we will now build an inversion framework that can be used to minimize the FWI least-squares objective function.
This is the second part of a three-part tutorial series on full-waveform inversion (FWI) in which we provide a step-by-step walk through of setting up forward and adjoint wave equation solvers and an optimization framework for inversion. In Part 1 (Louboutin et al., 2017), we showed how to use Devito (http://www.opesci.org/devito-public) to set up and solve acoustic wave equations with (impulsive) seismic sources and sample wavefields at the receiver locations to forward model shot records. Here in Part 2, we will discuss how to set up and solve adjoint wave equations with Devito and, from that, how we can calculate gradients and function values of the FWI objective function.
Since its reintroduction by Pratt (1999), full-waveform inversion (FWI) has gained a lot of attention in geophysical exploration because of its ability to build high-resolution velocity models more or less automatically in areas of complex geology. While there is an extensive and growing literature on the topic, publications focus mostly on technical aspects, making this topic inaccessible for a broader audience due to the lack of simple introductory resources for newcomers to computational geophysics. We will accomplish this by providing a hands-on walkthrough of FWI using Devito (Lange et al., 2016), a system based on domain-specific languages that automatically generates code for time-domain finite differences.
Hückelheim JC, Luo Z, Luporini F, et al., 2017, Towards self-verification in finite difference code generation, SC17, Publisher: ACM
Code generation from domain-specific languages is becoming increasingly popular as a method to obtain optimised low-level code that performs well on a given platform and for a given problem instance. Ensuring the correctness of generated codes is crucial. At the same time, testing or manual inspection of the code is problematic, as the generated code can be complex and hard to read. Moreover, the generated code may change depending on the problem type, domain size, or target platform, making conventional code review or testing methods impractical. As a solution, we propose the integration of formal verification tools into the code generation process. We present a case study in which the CIVL verification tool is combined with the Devito finite difference framework that generates optimised stencil code for PDE solvers from symbolic equations. We show a selection of properties of the generated code that can be automatically specified and verified during the code generation process. Our approach allowed us to detect a previously unknown bug in the Devito code generation tool.
Lange M, Kukreja N, Luporini F, et al., 2017, Optimised finite difference computation from symbolic equations, 15th Python in Science Conference (SciPy 2017), Pages: 89-96
Domain-specific high-productivity environments are playing an increasinglyimportant role in scientific computing due to the levels of abstraction andautomation they provide. In this paper we introduce Devito, an open-sourcedomain-specific framework for solving partial differential equations fromsymbolic problem definitions by the finite difference method. We highlight thegeneration and automated execution of highly optimized stencil code from only afew lines of high-level symbolic Python for a set of scientific equations,before exploring the use of Devito operators in seismic inversion problems.
Louboutin M, Lange M, Hermann FJ, et al., 2017, Performance prediction of finite-difference solvers for different computer architectures, Computers & Geosciences, Vol: 105, Pages: 148-157, ISSN: 0098-3004
The life-cycle of a partial differential equation (PDE) solver is often characterized by three development phases: the development of a stable numerical discretization; development of a correct (verified) implementation; and the optimization of the implementation for different computer architectures. Often it is only after significant time and effort has been invested that the performance bottlenecks of a PDE solver are fully understood, and the precise details varies between different computer architectures. One way to mitigate this issue is to establish a reliable performance model that allows a numerical analyst to make reliable predictions of how well a numerical method would perform on a given computer architecture, before embarking upon potentially long and expensive implementation and optimization phases. The availability of a reliable performance model also saves developer effort as it both informs the developer on what kind of optimisations are beneficial, and when the maximum expected performance has been reached and optimisation work should stop. We show how discretization of a wave-equation can be theoretically studied to understand the performance limitations of the method on modern computer architectures. We focus on the roofline model, now broadly used in the high-performance computing community, which considers the achievable performance in terms of the peak memory bandwidth and peak floating point performance of a computer with respect to algorithmic choices. A first principles analysis of operational intensity for key time-stepping finite-difference algorithms is presented. With this information available at the time of algorithm design, the expected performance on target computer systems can be used as a driver for algorithm design.
Kukreja N, Louboutin M, Vieira F, et al., 2017, Devito: Automated fast finite difference computation, Pages: 11-19
Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance optimization on different computer architectures. Although a large number of stencil languages are available, finite difference domain specific languages have proved challenging to design because most practical use cases require additional features that fall outside the finite difference abstraction. Inspired by the complexity of real-world seismic imaging problems, we introduce Devito, a domain specific language in which high level equations are expressed using symbolic expressions from the SymPy package. Complex equations are automatically manipulated, optimized, and translated into highly optimized C code that aims to perform comparably or better than hand-tuned code. All this is transparent to users, who only see concise symbolic mathematical expressions.
Kukreja N, Louboutin M, Lange M, et al., 2017, Rapid development of seismic imaging applications using symbolic math, Pages: 9-12
In this talk, I will discuss our approach to the formulation and the performance optimization of finite difference methods for PDEs arising in FWI. Our framework consists of a stack of domain specific languages and optimizing compilers. The mathematical specification of a finite difference method is translated by a compiler, Devito, into C code, applying a sophisticated sequence of transformations. These include standard loop transformations, such as blocking and vectorization, as well as symbolic manipulations to reduce the unusually high arithmetic intensity of the stencils arising in forward and adjoint operators. These include common subexpressions elimination, factorization, code motion and approximation of transient functions. I will show the impact of these transformations on standard Intel Xeon architectures as well as on Intel Knights Landing. Compelling evidence points in the direction that our stencil kernels are significantly bound by the L1 cache. I will conclude discussing future challenges and goals of our work.
Lange M, Mitchell L, Knepley M, et al., 2016, Efficient Mesh Management in Firedrake Using PETSc DMPlex, SIAM Journal on Scientific Computing, Vol: 38, Pages: S143-S155, ISSN: 1095-7197
The use of composable abstractions allows the application of new and established algorithms to a wide range of problems, while automatically inheriting the benefits of well-known performance optimizations. This work highlights the composition of the PETSc DMPlex domain topology abstraction with the Firedrake automated finite element system to create a PDE solving environment that combines expressiveness, flexibility, and high performance. We describe how Firedrake utilizes DMPlex to provide the indirection maps required for finite element assembly, while supporting various mesh input formats and runtime domain decomposition. In particular, we describe how DMPlex and its accompanying data structures allow the generic creation of user-defined discretizations, while utilizing data layout optimizations that improve cache coherency and ensure overlapped communication during assembly computation.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.