Publications
189 results found
Mudalige GR, Giles MB, Thiyagalingam J, et al., 2013, Design and Initial Performance of a High-level Unstructured Mesh Framework on Heterogeneous Parallel Systems, Parallel Computing, Vol: n/a, ISSN: 0167-8191
OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2’s recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using the OP2 API is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
Russell FP, Kelly PHJ, 2013, Optimized Code Generation for Finite Element Local Assembly Using Symbolic Manipulation, ACM Transactions on Mathematical Software, ISSN: 0098-3500
Automated code generators for finite element local assembly have facilitated the exploration of various alternative implementation strategies within the generated code. However, even with respect to a theoretical performance indicator such as operation count, an optimal strategy for performing local assembly is currently unknown. We explore a code generation strategy based on symbolic integration and polynomial common sub-expression elimination (CSE), designed to expose optimization opportunities. We present our implementation of a finite element assembly code generator using these techniques. We systematically evaluate our approach measuring operation count, execution time and numerical error using a benchmark suite of synthetic variational forms, comparing against the FEniCS Form Compiler (FFC). Our benchmark forms span complexities chosen to expose the performance characteristics of different code generation approaches. At additional computation cost in code generation, we show that it is possible to consistently achieve much of, and sometimes substantially exceed, the performance of other competing approaches without compromising numerical accuracy. Although the approach of using symbolic integration and CSE for optimizing finite element assembly is not new, we distinguish our work through our strategies for maintaining numerical precision and detecting common sub-expressions. We also discuss the benefits of the symbolic approach with respect to inferring numerical relationships for optimization. Finally we discuss the potential for further optimisations, and the relationship between this approach and other proposed techniques which also involve greater computational complexity than those utilized by FFC.
Kelly PH, 2013, Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles, 6th Workshop on General Purpose Processor Using Graphics Processing Units, Publisher: ACM Press
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.
Kelly PH, Cohen A, Grosser T, et al., 2013, Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles, 6th Workshop on General Purpose Processor Using Graphics Processing Units, Publisher: ACM Press
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.
Spacey S, Luk W, Kuhn D, et al., 2013, Parallel partitioning for distributed systems using sequential assignment, Journal of Parallel and Distributed Computing, Vol: 73, Pages: 207-219
This paper introduces a method to combine the advantages of both task parallelism and fine-grained co-design specialisation to achieve faster execution times than either method alone on distributed heterogeneous architectures. The method uses a novel mixed integer linear programming formalisation to assign code sections from parallel tasks to share computational components with the optimal trade-off between acceleration from component specialism and serialisation delay. The paper provides results for software benchmarks partitioned using the method and formal implementations of previous alternatives to demonstrate both the practical tractability of the linear programming approach and the increase in program acceleration potential deliverable.
Krieger CD, Strout MM, Olschanowsky C, et al., 2013, Loop chaining: A programming abstraction for balancing locality and parallelism, Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013, Pages: 375-384
There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering. This partial ordering is dictated by data dependencies that, as part of the abstraction, are exposed, thereby avoiding inter-procedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or run-time system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domain-specific library, OP2, used in full applications with unstructured grids, and a domain-specific library, Chombo, used in full applications with structured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores. © 2013 IEEE.
Birch D, Liang H, Ko J, et al., 2013, Multidisciplinary Engineering Models: Methodology and Case Study in Spreadsheet Analytics, European Spreadsheet Risks Interest Group 14th Annual Conference (EuSpRIG 2013), Publisher: EuSpRIG, Pages: 1-12
Birch D, Kelly P, Field A, et al., 2013, Computationally Unifying Urban Masterplanning, CF 2013 Proceedings of the ACM International Conference on Computing Frontiers, Pages: 1-10
Markall GR, Rathgeber F, Mitchell L, et al., 2013, Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS, International Supercomputing Conference (ISC), Publisher: Springer, Pages: 279-289, ISSN: 0302-9743
We describe a toolchain that provides a fully automated compilation pathway from a finite element domain-specific language to low-level code for multicore and GPGPU platforms. We demonstrate that the generated code exceeds the performance of the best available alternatives, without requiring manual tuning or modification of the generated code. The toolchain can easily be integrated with existing finite element solvers, providing a means to add performance portable methods without having to rebuild an entire complex implementation from scratch.
Bertolli C, Betts A, Mudalige GR, et al., 2013, Compiler optimizations for industrial unstructured mesh CFD applications on GPUs, International Workshop on Languages and Compilers for Parallel Computing (LCPC), Publisher: Springer, Pages: 112-126
Salas-Moreno RF, Newcombe RA, Strasdat H, et al., 2013, SLAM++: Simultaneous Localisation and Mapping at the Level of Objects, Computer Vision and Pattern Recognition, Publisher: IEEE Press, Pages: 1352-1359, ISSN: 1063-6919
Mudalige GR, Giles MB, Reguly I, et al., 2012, OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures, 2012 Innovative Parallel Computing, InPar 2012
OP2 is an "active" library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 library, and investigate its capabilities in achieving performance portability, near-optimal performance, and scaling on modern multi-core and many-core processor based systems. A key feature of this work is OP2's recent extension facilitating the development and execution of applications on a distributed memory cluster of GPUs. We discuss the main design issues in parallelizing unstructured mesh based applications on heterogeneous platforms. These include handling data dependencies in accessing indirectly referenced data, the impact of unstructured mesh data layouts (array of structs vs. struct of arrays) and design considerations in generating code for execution on a cluster of GPUs. A representative CFD application written using the OP2 framework is utilized to provide a contrasting benchmarking and performance analysis study on a range of multi-core/many-core systems. These include multi-core CPUs from Intel (Westmere and Sandy Bridge) and AMD (Magny-Cours), GPUs from NVIDIA (GTX560Ti, Tesla C2070), a distributed memory CPU cluster (Cray XE6) and a distributed memory GPU cluster (Tesla C2050 GPUs with InfiniBand). OP2's design choices are explored with quantitative insights into their contributions to performance. We demonstrate that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer. © 2012 IEEE.
Spacey S, Luk W, Kelly PHJ, et al., 2012, Improving communication latency with the write-only architecture, JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, Vol: 72, Pages: 1617-1627, ISSN: 0743-7315
- Author Web Link
- Cite
- Citations: 4
Franke H, Kelly PHJ, Trancoso P, 2012, Guest Editorial: Computing Frontiers, INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, Vol: 40, Pages: 551-552, ISSN: 0885-7458
Collingbourne P, Cadar C, Kelly PHJ, 2012, Symbolic testing of OpenCL code, Pages: 203-218, ISSN: 0302-9743
We present an effective technique for crosschecking a C or C++ program against an accelerated OpenCL version, as well as a technique for detecting data races in OpenCL programs. Our techniques are implemented in KLEE-CL, a symbolic execution engine based on KLEE and KLEE-FP that supports symbolic reasoning on the equivalence between symbolic values. Our approach is to symbolically model the OpenCL environment using an OpenCL runtime library targeted to symbolic execution. Using this model we are able to run OpenCL programs symbolically, keeping track of memory accesses for the purpose of race detection. We then compare the symbolic result against the plain C or C++ implementation in order to detect mismatches between the two versions. We applied KLEE-CL to the Parboil benchmark suite, the Bullet physics library and the OP2 library, in which we were able to find a total of seven errors: two mismatches between the OpenCL and C implementations, three memory errors, one OpenCL compiler bug and one race condition. © 2012 Springer-Verlag Berlin Heidelberg.
Mudalige GR, Giles MB, Bertolli C, et al., 2012, Predictive modeling and analysis of OP2 on distributed memory GPU clusters, ACM SIGMETRICS Performance Evaluation Review, Vol: 40, Pages: 61-67, ISSN: 0163-5999
<jats:p>OP2 is an "active" library framework for the development and solution of unstructured mesh based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the backend to different multi-core/many-core hardware. This paper presents a predictive performance analysis and benchmarking study of OP2 on heterogeneous cluster systems. We first present the design of a new OP2 back-end that enables the execution of applications on distributed memory clusters, and benchmark its performance during the solution of a 1.5M and 26M edge-based CFD application written using OP2. Benchmark systems include a large-scale CrayXE6 system and an Intel Westmere/InfiniBand cluster. We then apply performance modeling to predict the application's performance on an NVIDIA Tesla C2070 based GPU cluster, enabling us to compare OP2's performance capabilities on emerging distributed memory heterogeneous systems. Results illustrate the performance benefits that can be gained through many-core solutions both on single-node and heterogeneous configurations in comparison to traditional homogeneous cluster systems for this class of applications.</jats:p>
Bertolli C, Betts A, Kelly PHJ, et al., 2012, Mesh independent loop fusion for unstructured mesh applications, Pages: 43-52
Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation. In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis - - which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion. © 2012 ACM.
Hammond K, Kelly PHJ, 2012, Introduction to the Special Issue on Automatic Program Generation for Embedded Systems, SCIENCE OF COMPUTER PROGRAMMING, Vol: 77, Pages: 81-82, ISSN: 0167-6423
- Author Web Link
- Cite
- Citations: 1
Giles MB, Mudalige GR, Bertolli C, et al., 2012, An Analytical Study of Loop Tiling for a Large-scale Unstructured Mesh Application, 2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), Pages: 477-482
- Author Web Link
- Cite
- Citations: 5
Rathgeber F, Markall GR, Mitchell L, et al., 2012, PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes, High Performance Computing, Networking Storage and Analysis, SC Companion, Publisher: IEEE Computer Society, Pages: 1116-1123
Emerging many-core platforms are very difficult to program in a performance portable manner whilst achieving high efficiency on a diverse range of architectures. We present work in progress on PyOP2, a high-level embedded domain-specific language for mesh-based simulation codes that executes numerical kernels in parallel over unstructured meshes. Just-in-time kernel compilation and parallel scheduling are delayed until runtime, when problem-specific parameters are available. Using generative metaprogramming, performance portability is achieved, while details of the parallel implementation are abstracted from the programmer. PyOP2 kernels for finite element computations can be generated automatically from equations given in the domain-specific Unified Form Language. Interfacing to the multi-phase CFD code Fluidity through a very thin layer on top of PyOP2 yields a general purpose finite element solver with an input notation very close to mathematical formulae. Preliminary performance figures show speedups of up to 3.4x compared to Fluidity's built-in solvers when running in parallel.
Markall GR, Slemmer A, Ham DA, et al., 2012, Finite element assembly strategies on multi- and many-core architectures, International Journal for Numerical Methods in Fluids
Gorman GJ, Southern J, Farrell PE, et al., 2012, Hybrid OpenMP/MPI anisotropic mesh smoothing, International Conference on Computational Science (ICCS), Publisher: ELSEVIER SCIENCE BV, Pages: 1513-1522, ISSN: 1877-0509
- Author Web Link
- Cite
- Citations: 13
Mudalige GR, Giles MB, Bertolli C, et al., 2011, Predictive modeling and analysis of OP2 on distributed memory GPU clusters, PMBS'11 - Proceedings of the 2nd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, Co-located with SC'11, Pages: 3-4
OP2 is an "active" library framework for the development and solution of unstructured mesh based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back- end to different multi-core/many-core hardware. This paper presents a summary of a predictive performance analysis and benchmarking study of OP2 on heterogeneous cluster systems. In this work, an industrial representative CFD application written using the OP2 framework is benchmarked during the solution of an unstructured mesh of 1.5M and 26Medges. Benchmark systems include a large-scale CrayXE6 system and an Intel Westmere/InfiniBand cluster. Performance modeling is then used to predict the application's performance on an NVIDIA Tesla C2070 based GPU cluster, enabling the comparison of OP2's performance capabilities on emerging distributed memory heterogeneous systems. Results illustrate the performance benefits that can be gained through many-core solutions both on single-node and heterogeneous configurations in comparison to traditional homogeneous cluster systems for this class of applications.
Court C, Kelly PHJ, 2011, Loop-Directed Mothballing: Power Gating Execution Units Using Runtime Loop Analysis, IEEE Micro, Vol: 31, Pages: 29-38
Court CA, Kelly PHJ, 2011, Loop-Directed Mothballing: Power-gating execution units using fast analysis of inner loops, IEEE Symposium on Low-Power and High-Speed Chips - 2011 IEEE COOL Chips XIV, Proceedings
Static power dissipation has been identified as a limiting factor in future microprocessor technologies. This paper presents Loop-Directed Mothballing (LDM) to reduce static power by power-gating execution units. The method accurately predicts the resource requirements and limits performance degradation by focussing on inner loops. In simulation, the energy-delay product (EDP) of the processor is reduced by 10.3%. Two prior methods show worse EDP despite having greater power savings. © 2011 IEEE.
Giles MB, Gudalige GR, Sharif Z, et al., 2011, Performance Analysis and Optimisation of the OP2 Framework on Many-core Architectures, The Computer Journal, Vol: 55
Russell FP, Mellor MR, Kelly PHJ, et al., 2011, DESOLA: An active linear algebra library using delayed evaluation and runtime code generation, SCIENCE OF COMPUTER PROGRAMMING, Vol: 76, Pages: 227-242, ISSN: 0167-6423
- Author Web Link
- Cite
- Citations: 4
Giles MB, Mudalige GR, Sharif Z, et al., 2011, Performance analysis of the OP2 framework on many-core architectures, ACM SIGMETRICS Performance Evaluation Review, Vol: 38, Pages: 9-15, ISSN: 0163-5999
<jats:p>We present a performance analysis and benchmarking study of the OP2 "active" library, which provides an abstraction framework for the solution of parallel unstructured mesh applications. OP2 aims to decouple the scientific specification of the application from its parallel implementation, achieving code longevity and near-optimal performance through re-targeting the back-end to different hardware.</jats:p> <jats:p>Runtime performance results are presented for a representative unstructured mesh application written using OP2 on a variety of many-core processor systems, including the traditional X86 architectures from Intel (Xeon based on the older Penryn and current Nehalem micro-architectures) and GPU offerings from NVIDIA (GTX260, Tesla C2050). Our analysis demonstrates the contrasting performance between the use of CPU (OpenMP) and GPU (CUDA) parallel implementations for the solution on an industrial sized unstructured mesh consisting of about 1.5 million edges.</jats:p> <jats:p>Results show the significance of choosing the correct partition and thread-block configuration, the factors limiting the GPU performance and insights into optimizations for improved performance.</jats:p>
Cantwell CD, Sherwin SJ, Kirby RM, et al., 2011, From h to p Efficiently: Selecting the Optimal Spectral/hp Discretisation in Three Dimensions, MATHEMATICAL MODELLING OF NATURAL PHENOMENA, Vol: 6, Pages: 84-96, ISSN: 0973-5348
Rokos G, Gorman G, Kelly PHJ, 2011, Accelerating Anisotropic Mesh Adaptivity on nVIDIA's CUDA Using Texture Interpolation, 17th International Euro-Par Conference on Parallel Processing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 387-398, ISSN: 0302-9743
- Author Web Link
- Cite
- Citations: 1
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.