Publications

Mudalige GR, Giles MB, Thiyagalingam J, Reguly I, Bertolli C, Kelly PHJ, Trefethen AEet al., 2013, Design and Initial Performance of a High-level Unstructured Mesh Framework on Heterogeneous Parallel Systems, Parallel Computing, Vol: n/a, ISSN: 0167-8191

OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2’s recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using the OP2 API is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

Journal article

Russell FP, Kelly PHJ, 2013, Optimized Code Generation for Finite Element Local Assembly Using Symbolic Manipulation, ACM Transactions on Mathematical Software, ISSN: 0098-3500

Automated code generators for finite element local assembly have facilitated the exploration of various alternative implementation strategies within the generated code. However, even with respect to a theoretical performance indicator such as operation count, an optimal strategy for performing local assembly is currently unknown. We explore a code generation strategy based on symbolic integration and polynomial common sub-expression elimination (CSE), designed to expose optimization opportunities. We present our implementation of a finite element assembly code generator using these techniques. We systematically evaluate our approach measuring operation count, execution time and numerical error using a benchmark suite of synthetic variational forms, comparing against the FEniCS Form Compiler (FFC). Our benchmark forms span complexities chosen to expose the performance characteristics of different code generation approaches. At additional computation cost in code generation, we show that it is possible to consistently achieve much of, and sometimes substantially exceed, the performance of other competing approaches without compromising numerical accuracy. Although the approach of using symbolic integration and CSE for optimizing finite element assembly is not new, we distinguish our work through our strategies for maintaining numerical precision and detecting common sub-expressions. We also discuss the benefits of the symbolic approach with respect to inferring numerical relationships for optimization. Finally we discuss the potential for further optimisations, and the relationship between this approach and other proposed techniques which also involve greater computational complexity than those utilized by FFC.

Journal article

Kelly PH, 2013, Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles, 6th Workshop on General Purpose Processor Using Graphics Processing Units, Publisher: ACM Press

Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.

Conference paper

Kelly PH, Cohen A, Grosser T, Ramanujam J, Sadayappan P, Verdoolaege Set al., 2013, Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles, 6th Workshop on General Purpose Processor Using Graphics Processing Units, Publisher: ACM Press

Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.

Conference paper

Spacey S, Luk W, Kuhn D, Kelly PHJet al., 2013, Parallel partitioning for distributed systems using sequential assignment, Journal of Parallel and Distributed Computing, Vol: 73, Pages: 207-219

This paper introduces a method to combine the advantages of both task parallelism and fine-grained co-design specialisation to achieve faster execution times than either method alone on distributed heterogeneous architectures. The method uses a novel mixed integer linear programming formalisation to assign code sections from parallel tasks to share computational components with the optimal trade-off between acceleration from component specialism and serialisation delay. The paper provides results for software benchmarks partitioned using the method and formal implementations of previous alternatives to demonstrate both the practical tractability of the linear programming approach and the increase in program acceleration potential deliverable.

Journal article

Krieger CD, Strout MM, Olschanowsky C, Stone A, Guzik S, Gao X, Bertolli C, Kelly PHJ, Mudalige G, Van Straalen B, Williams Set al., 2013, Loop chaining: A programming abstraction for balancing locality and parallelism, Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013, Pages: 375-384

There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering. This partial ordering is dictated by data dependencies that, as part of the abstraction, are exposed, thereby avoiding inter-procedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or run-time system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domain-specific library, OP2, used in full applications with unstructured grids, and a domain-specific library, Chombo, used in full applications with structured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores. © 2013 IEEE.

Abstract
Cite
Citations: 11

Journal article

Birch D, Liang H, Ko J, Kelly P, Field A, Mullineux G, Simondetti Aet al., 2013, Multidisciplinary Engineering Models: Methodology and Case Study in Spreadsheet Analytics, European Spreadsheet Risks Interest Group 14th Annual Conference (EuSpRIG 2013), Publisher: EuSpRIG, Pages: 1-12

Conference paper

Birch D, Kelly P, Field A, Simondetti Aet al., 2013, Computationally Unifying Urban Masterplanning, CF 2013 Proceedings of the ACM International Conference on Computing Frontiers, Pages: 1-10

Cite

Journal article

Markall GR, Rathgeber F, Mitchell L, Loriant N, Bertolli C, Kelly PHJet al., 2013, Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS, International Supercomputing Conference (ISC), Publisher: Springer, Pages: 279-289, ISSN: 0302-9743

We describe a toolchain that provides a fully automated compilation pathway from a finite element domain-specific language to low-level code for multicore and GPGPU platforms. We demonstrate that the generated code exceeds the performance of the best available alternatives, without requiring manual tuning or modification of the generated code. The toolchain can easily be integrated with existing finite element solvers, providing a means to add performance portable methods without having to rebuild an entire complex implementation from scratch.

Conference paper

Bertolli C, Betts A, Mudalige GR, Loriant N, Ham D, Giles MB, Kelly PHJet al., 2013, Compiler optimizations for industrial unstructured mesh CFD applications on GPUs, International Workshop on Languages and Compilers for Parallel Computing (LCPC), Publisher: Springer, Pages: 112-126

Cite

Conference paper

Salas-Moreno RF, Newcombe RA, Strasdat H, Kelly PHJ, Davison AJet al., 2013, SLAM++: Simultaneous Localisation and Mapping at the Level of Objects, Computer Vision and Pattern Recognition, Publisher: IEEE Press, Pages: 1352-1359, ISSN: 1063-6919

Conference paper

Mudalige GR, Giles MB, Reguly I, Bertolli C, Kelly PHJet al., 2012, OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures, 2012 Innovative Parallel Computing, InPar 2012

OP2 is an "active" library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 library, and investigate its capabilities in achieving performance portability, near-optimal performance, and scaling on modern multi-core and many-core processor based systems. A key feature of this work is OP2's recent extension facilitating the development and execution of applications on a distributed memory cluster of GPUs. We discuss the main design issues in parallelizing unstructured mesh based applications on heterogeneous platforms. These include handling data dependencies in accessing indirectly referenced data, the impact of unstructured mesh data layouts (array of structs vs. struct of arrays) and design considerations in generating code for execution on a cluster of GPUs. A representative CFD application written using the OP2 framework is utilized to provide a contrasting benchmarking and performance analysis study on a range of multi-core/many-core systems. These include multi-core CPUs from Intel (Westmere and Sandy Bridge) and AMD (Magny-Cours), GPUs from NVIDIA (GTX560Ti, Tesla C2070), a distributed memory CPU cluster (Cray XE6) and a distributed memory GPU cluster (Tesla C2050 GPUs with InfiniBand). OP2's design choices are explored with quantitative insights into their contributions to performance. We demonstrate that an application written once at a high-level using the OP2 API can be easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer. © 2012 IEEE.

Abstract
Cite
Citations: 65

Journal article

Spacey S, Luk W, Kelly PHJ, Kuhn Det al., 2012, Improving communication latency with the write-only architecture, JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, Vol: 72, Pages: 1617-1627, ISSN: 0743-7315

Author Web Link
Cite
Citations: 4

Journal article

Franke H, Kelly PHJ, Trancoso P, 2012, Guest Editorial: Computing Frontiers, INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, Vol: 40, Pages: 551-552, ISSN: 0885-7458

Journal article

Collingbourne P, Cadar C, Kelly PHJ, 2012, Symbolic testing of OpenCL code, Pages: 203-218, ISSN: 0302-9743

We present an effective technique for crosschecking a C or C++ program against an accelerated OpenCL version, as well as a technique for detecting data races in OpenCL programs. Our techniques are implemented in KLEE-CL, a symbolic execution engine based on KLEE and KLEE-FP that supports symbolic reasoning on the equivalence between symbolic values. Our approach is to symbolically model the OpenCL environment using an OpenCL runtime library targeted to symbolic execution. Using this model we are able to run OpenCL programs symbolically, keeping track of memory accesses for the purpose of race detection. We then compare the symbolic result against the plain C or C++ implementation in order to detect mismatches between the two versions. We applied KLEE-CL to the Parboil benchmark suite, the Bullet physics library and the OP2 library, in which we were able to find a total of seven errors: two mismatches between the OpenCL and C implementations, three memory errors, one OpenCL compiler bug and one race condition. © 2012 Springer-Verlag Berlin Heidelberg.

Abstract
Cite
Citations: 32

Conference paper

Mudalige GR, Giles MB, Bertolli C, Kelly PHJet al., 2012, Predictive modeling and analysis of OP2 on distributed memory GPU clusters, ACM SIGMETRICS Performance Evaluation Review, Vol: 40, Pages: 61-67, ISSN: 0163-5999

<jats:p>OP2 is an "active" library framework for the development and solution of unstructured mesh based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the backend to different multi-core/many-core hardware. This paper presents a predictive performance analysis and benchmarking study of OP2 on heterogeneous cluster systems. We first present the design of a new OP2 back-end that enables the execution of applications on distributed memory clusters, and benchmark its performance during the solution of a 1.5M and 26M edge-based CFD application written using OP2. Benchmark systems include a large-scale CrayXE6 system and an Intel Westmere/InfiniBand cluster. We then apply performance modeling to predict the application's performance on an NVIDIA Tesla C2070 based GPU cluster, enabling us to compare OP2's performance capabilities on emerging distributed memory heterogeneous systems. Results illustrate the performance benefits that can be gained through many-core solutions both on single-node and heterogeneous configurations in comparison to traditional homogeneous cluster systems for this class of applications.</jats:p>

Journal article

Bertolli C, Betts A, Kelly PHJ, Mudalige GR, Giles MBet al., 2012, Mesh independent loop fusion for unstructured mesh applications, Pages: 43-52

Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation. In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis - - which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion. © 2012 ACM.

Abstract
Cite
Citations: 3

Conference paper

Hammond K, Kelly PHJ, 2012, Introduction to the Special Issue on Automatic Program Generation for Embedded Systems, SCIENCE OF COMPUTER PROGRAMMING, Vol: 77, Pages: 81-82, ISSN: 0167-6423

Author Web Link
Cite
Citations: 1

Journal article

Giles MB, Mudalige GR, Bertolli C, Kelly PHJ, Laszlo E, Reguly Iet al., 2012, An Analytical Study of Loop Tiling for a Large-scale Unstructured Mesh Application, 2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), Pages: 477-482

Author Web Link
Cite
Citations: 5

Journal article

Rathgeber F, Markall GR, Mitchell L, Loriant N, Ham DA, Bertolli C, Kelly PHJet al., 2012, PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes, High Performance Computing, Networking Storage and Analysis, SC Companion, Publisher: IEEE Computer Society, Pages: 1116-1123

Emerging many-core platforms are very difficult to program in a performance portable manner whilst achieving high efficiency on a diverse range of architectures. We present work in progress on PyOP2, a high-level embedded domain-specific language for mesh-based simulation codes that executes numerical kernels in parallel over unstructured meshes. Just-in-time kernel compilation and parallel scheduling are delayed until runtime, when problem-specific parameters are available. Using generative metaprogramming, performance portability is achieved, while details of the parallel implementation are abstracted from the programmer. PyOP2 kernels for finite element computations can be generated automatically from equations given in the domain-specific Unified Form Language. Interfacing to the multi-phase CFD code Fluidity through a very thin layer on top of PyOP2 yields a general purpose finite element solver with an input notation very close to mathematical formulae. Preliminary performance figures show speedups of up to 3.4x compared to Fluidity's built-in solvers when running in parallel.

Conference paper

Markall GR, Slemmer A, Ham DA, Kelly PHJ, Cantwell CD, Sherwin SJet al., 2012, Finite element assembly strategies on multi- and many-core architectures, International Journal for Numerical Methods in Fluids

Journal article

Gorman GJ, Southern J, Farrell PE, Piggott MD, Rokos G, Kelly PHJet al., 2012, Hybrid OpenMP/MPI anisotropic mesh smoothing, International Conference on Computational Science (ICCS), Publisher: ELSEVIER SCIENCE BV, Pages: 1513-1522, ISSN: 1877-0509

Author Web Link
Cite
Citations: 13

Conference paper

Mudalige GR, Giles MB, Bertolli C, Kelly PHJet al., 2011, Predictive modeling and analysis of OP2 on distributed memory GPU clusters, PMBS'11 - Proceedings of the 2nd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, Co-located with SC'11, Pages: 3-4

OP2 is an "active" library framework for the development and solution of unstructured mesh based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back- end to different multi-core/many-core hardware. This paper presents a summary of a predictive performance analysis and benchmarking study of OP2 on heterogeneous cluster systems. In this work, an industrial representative CFD application written using the OP2 framework is benchmarked during the solution of an unstructured mesh of 1.5M and 26Medges. Benchmark systems include a large-scale CrayXE6 system and an Intel Westmere/InfiniBand cluster. Performance modeling is then used to predict the application's performance on an NVIDIA Tesla C2070 based GPU cluster, enabling the comparison of OP2's performance capabilities on emerging distributed memory heterogeneous systems. Results illustrate the performance benefits that can be gained through many-core solutions both on single-node and heterogeneous configurations in comparison to traditional homogeneous cluster systems for this class of applications.

Abstract
Cite
Citations: 2

Journal article

Court C, Kelly PHJ, 2011, Loop-Directed Mothballing: Power Gating Execution Units Using Runtime Loop Analysis, IEEE Micro, Vol: 31, Pages: 29-38

Cite

Journal article

Court CA, Kelly PHJ, 2011, Loop-Directed Mothballing: Power-gating execution units using fast analysis of inner loops, IEEE Symposium on Low-Power and High-Speed Chips - 2011 IEEE COOL Chips XIV, Proceedings

Static power dissipation has been identified as a limiting factor in future microprocessor technologies. This paper presents Loop-Directed Mothballing (LDM) to reduce static power by power-gating execution units. The method accurately predicts the resource requirements and limits performance degradation by focussing on inner loops. In simulation, the energy-delay product (EDP) of the processor is reduced by 10.3%. Two prior methods show worse EDP despite having greater power savings. © 2011 IEEE.

Abstract
Cite

Journal article

Giles MB, Gudalige GR, Sharif Z, Markall GR, Kelly PHJet al., 2011, Performance Analysis and Optimisation of the OP2 Framework on Many-core Architectures, The Computer Journal, Vol: 55

Cite

Journal article

Russell FP, Mellor MR, Kelly PHJ, Beckmann Oet al., 2011, DESOLA: An active linear algebra library using delayed evaluation and runtime code generation, SCIENCE OF COMPUTER PROGRAMMING, Vol: 76, Pages: 227-242, ISSN: 0167-6423

Author Web Link
Cite
Citations: 4

Journal article

Giles MB, Mudalige GR, Sharif Z, Markall G, Kelly PHJet al., 2011, Performance analysis of the OP2 framework on many-core architectures, ACM SIGMETRICS Performance Evaluation Review, Vol: 38, Pages: 9-15, ISSN: 0163-5999

<jats:p>We present a performance analysis and benchmarking study of the OP2 "active" library, which provides an abstraction framework for the solution of parallel unstructured mesh applications. OP2 aims to decouple the scientific specification of the application from its parallel implementation, achieving code longevity and near-optimal performance through re-targeting the back-end to different hardware.</jats:p> <jats:p>Runtime performance results are presented for a representative unstructured mesh application written using OP2 on a variety of many-core processor systems, including the traditional X86 architectures from Intel (Xeon based on the older Penryn and current Nehalem micro-architectures) and GPU offerings from NVIDIA (GTX260, Tesla C2050). Our analysis demonstrates the contrasting performance between the use of CPU (OpenMP) and GPU (CUDA) parallel implementations for the solution on an industrial sized unstructured mesh consisting of about 1.5 million edges.</jats:p> <jats:p>Results show the significance of choosing the correct partition and thread-block configuration, the factors limiting the GPU performance and insights into optimizations for improved performance.</jats:p>

Journal article

Cantwell CD, Sherwin SJ, Kirby RM, Kelly PHJet al., 2011, From h to p Efficiently: Selecting the Optimal Spectral/hp Discretisation in Three Dimensions, MATHEMATICAL MODELLING OF NATURAL PHENOMENA, Vol: 6, Pages: 84-96, ISSN: 0973-5348

Journal article

Rokos G, Gorman G, Kelly PHJ, 2011, Accelerating Anisotropic Mesh Adaptivity on nVIDIA's CUDA Using Texture Interpolation, 17th International Euro-Par Conference on Parallel Processing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 387-398, ISSN: 0302-9743

Author Web Link
Cite
Citations: 1

Conference paper

ProfessorPaulKelly

Contact

Location

Summary