Imperial College London

ProfessorPaulKelly

Faculty of EngineeringDepartment of Computing

Professor of Software Technology
 
 
 
//

Contact

 

+44 (0)20 7594 8332p.kelly Website

 
 
//

Location

 

Level 3 (upstairs), William Penney Building, room 304William Penney LaboratorySouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

163 results found

Luporini F, Lange M, Jacobs CT, Gorman GJ, Ramanujam J, Kelly PHJet al., 2019, Automated tiling of unstructured mesh computations with application to seismological modeling, ACM Transactions on Mathematical Software, Vol: 45, ISSN: 0098-3500

© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]-hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation-not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.

JOURNAL ARTICLE

Debrunner T, Saeedi Gharahbolagh S, Kelly P, 2019, AUKE: Automatic Kernel Code Generation for an analogue SIMD Focal-plane Sensor-Processor Array, ACM Transactions on Architecture and Code Optimization, Vol: 15, ISSN: 1544-3973

Focal-plane Sensor-Processor Arrays (FPSPs) are new imaging devices with parallel Single Instruction Multiple Data (SIMD) computational capabilities built into every pixel. Compared to traditional imaging devices, FPSPs allow for massive pixel-parallel execution of image processing algorithms. This enables the application of certain algorithms at extreme frame rates (>10,000 frames per second). By performing some early-stage processing in-situ, systems incorporating FPSPs can consume less power compared to conventional approaches using standard digital cameras. In this article, we explore code generation for an FPSP whose 256 × 256 processors operate on analogue signal data, leading to further opportunities for power reduction—and additional code synthesis challenges.While rudimentary image processing algorithms have been demonstrated on FPSPs before, progress with higher-level computer vision algorithms has been sparse due to the unique architecture and limits of the devices. This article presents a code generator for convolution filters for the SCAMP-5 FPSP, with applications in many high-level tasks such as convolutional neural networks, pose estimation, and so on. The SCAMP-5 FPSP has no effective multiply operator. Convolutions have to be implemented through sequences of more primitive operations such as additions, subtractions, and multiplications/divisions by two. We present a code generation algorithm to optimise convolutions by identifying common factors in the different weights and by determining an optimised pattern of pixel-to-pixel data movements to exploit them. We present evaluation in terms of both speed and energy consumption for a suite of well-known convolution filters. Furthermore, an application of the method is shown by the implementation of a Viola-Jones face detection algorithm.

JOURNAL ARTICLE

Kelly PHJ, 2018, IEEE Cluster 2018 Message from the Program Chair, ISSN: 1552-5244

CONFERENCE PAPER

Saeedi Gharahbolagh S, Bodin B, Wagstaff H, Nisbet A, Nardi L, Mawer J, Melot N, Palomar O, Vespa E, Gorgovan C, Webb A, Clarkson J, Tomusk E, Debrunner T, Kaszyk K, Gonzalez P, Rodchenko A, Riley G, Kotselidis C, Franke B, OBoyle M, Davison A, Kelly P, Lujan M, Furber Set al., 2018, Navigating the landscape for real-time localisation and mapping for robotics, virtual and augmented reality, Proceedings of the IEEE, ISSN: 0018-9219

Visual understanding of 3-D environments in real time, at low power, is a huge computational challenge. Often referred to as simultaneous localization and mapping (SLAM), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, and virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are: 1) tools and methodology for systematic quantitative evaluation of SLAM algorithms; 2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives; 3) end-to-end simulation tools to enable optimization of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches; and 4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.

JOURNAL ARTICLE

Bodin B, Nardi L, Wagstaff H, Kelly PHJ, O'Boyle Met al., 2018, Algorithmic Performance-Accuracy Trade-off in 3D Vision Applications, Pages: 123-124

© 2018 IEEE. Simultaneous Localisation And Mapping (SLAM) is a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is particularly true when it comes to evaluate the potential trade-offs between computation speed, accuracy, and power consumption. SLAMBench is a benchmarking framework to evaluate existing and future SLAM systems, both open and closed source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. SLAMBench is a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption across SLAM systems. In this poster we give an overview of SLAMBench and in particular we show how this framework can be used within Design Space Exploration and large-scale performance evaluation on mobile phones.

CONFERENCE PAPER

Bodin B, Wagstaff H, Saeedi S, Nardi L, Vespa E, Mawer J, Nisbet A, Lujan M, Furber S, Davison AJ, Kelly PHJ, O'Boyle MFPet al., 2018, SLAMBench2: multi-objective head-to-head benchmarking for visual SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 3637-3644, ISSN: 1050-4729

SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phone-based AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and val-idatable experimental research to investigate trade-offs across SLAM systems.

CONFERENCE PAPER

Vespa E, Nikolov N, Grimm M, Nardi L, Kelly PH, Leutenegger Set al., 2018, Efficient octree-based volumetric SLAM supporting signed-distance and occupancy mapping, IEEE Robotics and Automation Letters, Vol: 3, Pages: 1144-1151, ISSN: 2377-3766

We present a dense volumetric simultaneous localisation and mapping (SLAM) framework that uses an octree representation for efficient fusion and rendering of either a truncated signed distance field (TSDF) or an occupancy map. The primary aim of this letter is to use one single representation of the environment that can be used not only for robot pose tracking and high-resolution mapping, but seamlessly for planning. We show that our highly efficient octree representation of space fits SLAM and planning purposes in a real-time control loop. In a comprehensive evaluation, we demonstrate dense SLAM accuracy and runtime performance on-par with flat hashing approaches when using TSDF-based maps, and considerable speed-ups when using occupancy mapping compared to standard occupancy maps frameworks. Our SLAM system can run at 10-40 Hz on a modern quadcore CPU, without the need for massive parallelization on a GPU. We, furthermore, demonstrate a probabilistic occupancy mapping as an alternative to TSDF mapping in dense SLAM and show its direct applicability to online motion planning, using the example of informed rapidly-exploring random trees (RRT*).

JOURNAL ARTICLE

Nica A, Vespa E, González de Aledo P, Kelly PHJet al., 2018, Investigating automatic vectorization for real-time 3D scene understanding

© 2018 Association for Computing Machinery. Simultaneous Localization And Mapping (SLAM) is the problem of building a representation of a geometric space while simultaneously estimating the observer’s location within the space. While this seems to be a chicken-and-egg problem, several algorithms have appeared in the last decades that approximately and iteratively solve this problem. SLAM algorithms are tailored to the available resources, hence aimed at balancing the precision of the map with the constraints that the computational platform imposes and the desire to obtain real-time results. Working with KinectFusion, an established SLAM implementation, we explore in this work the vectorization opportunities present in this scenario, with the goal of using the CPU to its full potential. Using ISPC, an automatic vectorization tool, we produce a partially vectorized version of KinectFusion. Along the way we explore a number of optimization strategies, among which tiling to exploit ray-coherence and outer loop vectorization, obtaining up to 4x speed-up over the baseline on an 8-wide vector machine.

CONFERENCE PAPER

Luporini F, Ham DA, Kelly PHJ, 2017, An algorithm for the optimization of finite element integration loops, ACM Transactions on Mathematical Software, Vol: 44, ISSN: 0098-3500

We present an algorithm for the optimization of a class of finite element integration loop nests. This algo-rithm, which exploits fundamental mathematical properties of finite element operators, is proven to achievea locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive nu-merical experiments demonstrate significant performance improvements over the state of the art in finiteelement code generation in almost all cases. This validates the effectiveness of the algorithm presented here,and illustrates its limitations.

JOURNAL ARTICLE

Unat D, Dubey A, Hoefler T, Shalf J, Abraham M, Bianco M, Chamberlain BL, Cledat R, Edwards HC, Finkel H, Fuerlinger K, Hannig F, Jeannot E, Kamil A, Keasler J, Kelly PHJ, Leung V, Ltaief H, Maruyama N, Newburn CJ, Pericas Met al., 2017, Trends in Data Locality Abstractions for HPC Systems, IEEE Transactions on Parallel and Distributed Systems, Vol: 28, Pages: 3007-3020, ISSN: 1045-9219

The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.

JOURNAL ARTICLE

Bolten M, Franchetti F, Kelly PHJ, Lengauer C, Mohr Met al., 2017, Algebraic description and automatic generation of multigrid methods in SPIRAL, Concurrency and Computation: Practice and Experience, Vol: 29, ISSN: 1532-0634

SPIRAL is an autotuning, program generation and code synthesis system that offers a fully automaticgeneration of highly optimized target codes, customized for the specific execution platform at hand. Initially,SPIRAL was targeted at problem domains in digital signal processing, later also at basic linear algebra.We open SPIRAL up to a new, practically relevant and challenging domain: multigrid solvers. SPIRAL isdriven by algebraic transformation rules. We specify a set of such rules for a simple multigrid solver with aRichardson smoother for a discretized square 2D Poisson equation with Dirichlet boundary conditions. Wepresent the target code that SPIRAL generates in static single-assignment form and discuss its performance.While this example required no changes of or extensions to the SPIRAL system, more complex multigridsolvers may require small adaptations.

JOURNAL ARTICLE

Nardi L, Bodin B, Saeedi S, Vespa E, Davison AJ, Kelly Pet al., Algorithmic performance-accuracy trade-off in 3D vision applications using hypermapper, IPDPS, Publisher: IEEE

In this paper we investigate an emerging appli-cation, 3D scene understanding, likely to be significant in themobile space in the near future. The goal of this explorationis to reduce execution time while meeting our quality of resultobjectives. In previous work, we showed for the first time thatit is possible to map this application to power constrainedembedded systems, highlighting that decision choices made atthe algorithmic design-level have the most significant impact.As the algorithmic design space is too large to be exhaus-tively evaluated, we use a previously introduced multi-objectiverandom forest active learning prediction framework dubbedHyperMapper, to find good algorithmic designs. We showthat HyperMapper generalizes on a recent cutting edge 3Dscene understanding algorithm and on a modern GPU-basedcomputer architecture. HyperMapper is able to beat an experthuman hand-tuning the algorithmic parameters of the classof computer vision applications taken under consideration inthis paper automatically. In addition, we use crowd-sourcingusing a 3D scene understanding Android app to show that thePareto front obtained on an embedded system can be used toaccelerate the same application on all the 83 smart-phones andtablets with speedups ranging from 2x to over 12x.

CONFERENCE PAPER

Mitchell L, Ham DA, McRae ATT, Rathgeber F, Lange M, Luporini F, Kelly PHJ, Bercea G-T, Markall Get al., 2017, Firedrake: automating the finite element method by composing abstractions, ACM Transactions on Mathematical Software, Vol: 43, ISSN: 1557-7295

Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrakeadopts the domain-specific language for the finite element method of the FEniCS project, but with a purePython runtime-only implementation centred on the composition of several existing and new abstractions forparticular aspects of scientific computing. The result is a more complete separation of concerns which easesthe incorporation of separate contributions from computer scientists, numerical analysts and applicationspecialists. These contributions may add functionality, or improve performance.Firedrake benefits from automatically applying new optimisations. This includes factorising mixed functionspaces, transforming and vectorising inner loops, and intrinsically supporting block matrix operations.Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users toimplement common operations that fall outside pure variational formulations, such as flux-limiters.

JOURNAL ARTICLE

Saeedi Gharahbolagh S, Nardi L, Johns E, Bodin B, Kelly PHJ, Davison AJet al., Application-oriented Design Space Exploration for SLAM Algorithms, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE

In visual SLAM, there are many software and hardware parameters, such as algorithmic thresholds and GPU frequency, that need to be tuned; however, this tuning should also take into account the structure and motion of the camera. In this paper, we determine the complexity of the structure and motion with a few parameters calculated using information theory. Depending on this complexity and the desired performance metrics, suitable parameters are explored and determined. Additionally, based on the proposed structure and motion parameters, several applications are presented, including a novel active SLAM approach which guides the camera in such a way that the SLAM algorithm achieves the desired performance metrics. Real-world and simulated experimental results demonstrate the effectiveness of the proposed design space and its applications.

CONFERENCE PAPER

Bodin B, Nardi L, Kelly PHJ, O’Boyle MFPet al., 2016, Diplomat: Mapping of Multi-kernel Applications Using a Static Dataflow Abstraction, Modelling, Analysis and Simulation of Computer and Telecommunication Systems, Publisher: IEEE, ISSN: 2375-0227

In this paper we propose a novel approach toheterogeneous embedded systems programmability using a taskgraphbased framework called Diplomat. Diplomat is a taskgraphframework that exploits the potential of static dataflowmodeling and analysis to deliver performance estimation andCPU/GPU mapping. An application has to be specified once, andthen the framework can automatically propose good mappings.We evaluate Diplomat with a computer vision application on twoembedded platforms. Using the Diplomat generation we observeda 16% performance improvement on average and up to a 30%improvement over the best existing hand-coded implementation.

CONFERENCE PAPER

Bodin B, Nardi L, Zia MZ, Wagstaff H, Shenoy GS, Emani M, Mawer J, Kotselidis C, Nisbet A, Lujan M, Franke B, Kelly PHJ, O’Boyle Met al., 2016, Integrating Algorithmic Parameters into Benchmarking and Design Space Exploration in 3D Scene Understanding, International conference on Parallel Architectures and Compilation Techniques, Publisher: IEEE

System designers typically use well-studied benchmarks toevaluate and improve new architectures and compilers. Wedesign tomorrow's systems based on yesterday's applications.In this paper we investigate an emerging application,3D scene understanding, likely to be signi cant in the mobilespace in the near future. Until now, this application couldonly run in real-time on desktop GPUs. In this work, weexamine how it can be mapped to power constrained embeddedsystems. Key to our approach is the idea of incrementalco-design exploration, where optimization choices that concernthe domain layer are incrementally explored togetherwith low-level compiler and architecture choices. The goalof this exploration is to reduce execution time while minimizingpower and meeting our quality of result objective.As the design space is too large to exhaustively evaluate,we use active learning based on a random forest predictorto nd good designs. We show that our approach can, forthe rst time, achieve dense 3D mapping and tracking in thereal-time range within a 1W power budget on a popular embeddeddevice. This is a 4.8x execution time improvementand a 2.8x power reduction compared to the state-of-the-art.

CONFERENCE PAPER

Bercea G, McRae ATT, Ham DA, Mitchell L, Rathgeber F, Nardi L, Luporini F, Kelly PHJet al., 2016, A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake, Geoscientific Model Development, Vol: 9, Pages: 3803-3815, ISSN: 1991-9603

We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical finite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake finite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10-20 layers as long as the underlying mesh is well-ordered. We characterise the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretisations. On meshes with realistic numbers of layers the performance achieved is between 70% and 90% of a theoretical hardware-specific limit.

JOURNAL ARTICLE

Zia MZ, Nardi L, Jack A, Vespa E, Bodin B, Kelly PHJ, Davison AJet al., 2016, Comparative design space exploration of dense and semi-dense SLAM, 2016 IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 1292-1299, ISSN: 1050-4729

SLAM has matured significantly over the past few years, and is beginning to appear in serious commercial products. While new SLAM systems are being proposed at every conference, evaluation is often restricted to qualitative visualizations or accuracy estimation against a ground truth. This is due to the lack of benchmarking methodologies which can holistically and quantitatively evaluate these systems. Further investigation at the level of individual kernels and parameter spaces of SLAM pipelines is non-existent, which is absolutely essential for systems research and integration. We extend the recently introduced SLAMBench framework to allow comparing two state-of-the-art SLAM pipelines, namely KinectFusion and LSD-SLAM, along the metrics of accuracy, energy consumption, and processing frame rate on two different hardware platforms, namely a desktop and an embedded device. We also analyze the pipelines at the level of individual kernels and explore their algorithmic and hardware design spaces for the first time, yielding valuable insights.

CONFERENCE PAPER

Wozniak BD, Witherden FD, Russell FP, Vincent PE, Kelly PHJet al., 2016, GiMMiK - Generating Bespoke Matrix Multiplication Kernels for Accelerators: Application to High-Order Computational Fluid Dynamics, Computer Physics Communications, Vol: 202, Pages: 12-22, ISSN: 0010-4655

Matrix multiplication is a fundamental linear algebra routineubiquitous in all areas of science and engineering. Highly optimised BLAS libraries (cuBLAS and clBLAS on GPUs) are the most popular choices for an implementation of the General Matrix Multiply (GEMM) in software. In this paper we present GiMMiK - a generator of bespoke matrix multiplication kernels for the CUDA and OpenCL platforms. GiMMiK exploits a prior knowledge of the operator matrix to generate highly performant code. The performance of GiMMiK's kernels is particularly apparent in a block-bypanel type of matrix multiplication, where the block matrix is typically small (e.g. dimensions of 96 × 64). Such operations are characteristic to our motivating application in PyFR - an implementation of Flux Reconstruction schemes for high-order fluid flow simulations on mixed unstructured meshes. GiMMiK fully unrolls the matrix-vector product and embeds matrix entries directly in the code to benefit from the use of the constant cache and compiler optimisations. Further, it reduces the number of floating-point operations by removing multiplications by zeros. Together with the ability of our kernels to avoid the poorly optimised cleanup code, executed by library GEMM, we are able to outperform cuBLAS on two NVIDIA GPUs: GTX 780 Ti and Tesla K40c. We observe speedups of our kernels over cuBLAS GEMM of up to 9.98 and 63.30 times for a 294 × 1029 99% sparse PyFR matrix in double precision on the Tesla K40c and GTX 780 Ti correspondingly. In single precision, observed speedups reach 12.20 and 13.07 times for a 4 × 8 50% sparse PyFR matrix on the two aforementioned cards. Using GiMMiK as the matrix multiplication kernel provider allows us to achieve a speedup of up to 1.70 (2.19) for a simulation of an unsteady flow over a cylinder executed with PyFR in double (single) precision on the Tesla K40c. All results were generated with GiMMiK version 1.0.

JOURNAL ARTICLE

Rokos G, Gorman G, Kelly PHJ, 2015, A fast and scalable graph coloring algorithm for multi-core and many-core architectures, 21st International Conference on Parallel and Distributed Computing, Publisher: Springer, Pages: 414-425, ISSN: 0302-9743

Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Çatalyürek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Çatalyürek et al.’s algorithm.We show that the new technique scales better on multicore and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.

CONFERENCE PAPER

Kelly PHJ, Reguly IZ, Mudalige GR, Bertolli C, Giles MB, Betts A, Radford Det al., 2015, Acceleration of a Full-scale Industrial CFD Application with OP2, IEEE Transactions on Parallel and Distributed Systems, Vol: 27, Pages: 1265-1278, ISSN: 1558-2183

Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc., capable ofperforming complex simulations over highly detailed unstructured mesh geometries. Hydra presents major challenges in dataorganization and movement that need to be overcome for continued high performance on emerging platforms. We present research inachieving this goal through the OP2 domain-specific high-level framework, demonstrating the viability of such a high-level programmingapproach. OP2 targets the domain of unstructured mesh problems and enables execution on a range of back-end hardware platforms.We chart the conversion of Hydra to OP2, and map out the key difficulties encountered in the process. Specifically we show howdifferent parallel implementations can be achieved with an active library framework, even for a highly complicated industrial applicationand how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly,reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performanceas that of the hand-tuned original code could be achieved, but it can be significantly improved on conventional processor systems, andmany-core systems. Our results provide evidence of how high-level frameworks such as OP2 enable portability across a wide range ofcontrasting platforms and their significant utility in achieving near-optimal performance without the intervention of the applicationprogrammer.

JOURNAL ARTICLE

Gorman GJ, Rokos G, Southern J, Kelly PHJet al., 2015, Thread-parallel anisotropic mesh adaptation, SEMA SIMAI Springer Series, Vol: 5, Pages: 113-137, ISSN: 2199-3041

© 2015, Springer International Publishing Switzerland. Mesh adaptation is a powerful way to minimise the computational cost of mesh based computation. It is particularly successful for multi-scale problems where the required mesh resolution can vary by orders of magnitude across the domain. The end result is local control over solution accuracy and reduced time to solution. In the case of large scale simulations, where the time to solution is unacceptable or the memory requirements exceeds available RAM, mesh based computation is typically parallelised using domain decomposition methods using the Message Passing Interface (MPI). This allows a simulation to run in parallel on a distributed memory computer. While this has been a high successful strategy up until now, the drive towards low power multi- and many-core architectures means that an even higher degree of parallelism is required and the memory hierarchy exploited to maximise memory bandwidth. For this reason application codes are increasingly adopting a hybrid parallel approach whereby decomposition methods, implemented using the Message Passing Interface (MPI), are applied for inter-node parallelisation, while a threaded programming model is used for intra-node parallelisation. Mesh adaptivity has been successfully parallelised using MPI by a number of groups, and can be implemented efficiently with few modifications to the serial code. However, thread-level parallelism is significantly more challenging because each thread modifies the mesh data and therefore must be carefully marshalled to avoid data races while still ensuring enough parallelism is exposed to achieve good parallel efficiency. Here we describe a new thread-parallel algorithm for anisotropic mesh adaptation algorithms. For each mesh optimisation phase (refinement, coarsening, swapping and smoothing) we describe how independent sets of tasks are defined. We show how a deferred updates strategy can be used to update the mesh data structure

JOURNAL ARTICLE

Nardi L, Bodin B, Zia MZ, Mawer J, Nisbet A, Kelly PHJ, Davison AJ, Luján M, O'Boyle MFP, Riley G, Topham N, Furber Set al., 2015, Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM, 2015 IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 5783-5790, ISSN: 1050-4729

Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for robotics and vision researchers to implement their algorithms in a performance-portable way. In this paper we introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPU-accelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives.

CONFERENCE PAPER

, 2015,

CONFERENCE PAPER

Luporini F, Varbanescu AL, Rathgeber F, Bercea D, Rananujam J, Ham DA, Kelly PHJet al., 2015, Cross-loop optimization of arithmetic intensity for finite element local assembly, ACM Transactions on Architecture and Code Optimization, Vol: 11, ISSN: 1544-3973

We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler (called COFFEE) for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates abstract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector SIMD intrinsics. Experiments using a range of real-world finite element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed.

JOURNAL ARTICLE

Popovici T, Russell FP, Wilkinson KA, Skylaris CK, Kelly PHJ, Franchetti Fet al., Generating Optimized Fourier Interpolation Routines for Density Function Theory Using SPIRAL, IEEE International Parallel & Distributed Processing Symposium (IPDPS)

CONFERENCE PAPER

Kelly PHJ, Russell FP, Wilkinson KA, Skylaris CKet al., 2014, Optimised three-dimensional Fourier interpolation: An analysis of techniques and application to a linear-scaling density functional theory code, Computer Physics Communications, Pages: 8-19, ISSN: 1879-2944

The Fourier interpolation of 3D data-sets is a performance critical operation in many fields, includingcertain forms of image processing and density functional theory (DFT) quantum chemistry codes basedon plane wave basis sets, to which this paper is targeted. In this paper we describe three differentalgorithms for performing this operation built from standard discrete Fourier transform operations,and derive theoretical operation counts. The algorithms compared consist of the most straightforwardimplementation and two that exploit techniques such as phase-shifts and knowledge of zero paddingto reduce computational cost. Through a library implementation (tintl) we explore the performancecharacteristics of these algorithms and the performance impact of different implementation choices onactual hardware. We present comparisons within the linear-scaling DFT code ONETEP where we replacethe existing interpolation implementation with our library implementation configured to choose the mostefficient algorithm. Within the ONETEP Fourier interpolation stages, we demonstrate speed-ups of over1.55×.

JOURNAL ARTICLE

Salas-Moreno R, Glocker B, Kelly P, Davison Aet al., 2014, Dense Planar SLAM, ISMAR

CONFERENCE PAPER

Salas-Moreno RF, Glocken B, Kelly PHJ, Davison AJet al., 2014, Dense planar SLAM, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Publisher: IEEE

CONFERENCE PAPER

Collingbourne P, Cadar C, Kelly PHJ, 2014, Symbolic Crosschecking of Data-Parallel Floating-Point Code, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, Vol: 40, Pages: 710-737, ISSN: 0098-5589

JOURNAL ARTICLE

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00003206&limit=30&person=true