Imperial College London

ProfessorPaulKelly

Faculty of EngineeringDepartment of Computing

Professor of Software Technology
 
 
 
//

Contact

 

+44 (0)20 7594 8332p.kelly Website

 
 
//

Location

 

Level 3 (upstairs), William Penney Building, room 304Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

149 results found

Bodin B, Nardi L, Zia MZ, Wagstaff H, Shenoy GS, Emani M, Mawer J, Kotselidis C, Nisbet A, Lujan M, Franke B, Kelly PHJ, O’Boyle Met al., Integrating Algorithmic Parameters into Benchmarking and Design Space Exploration in 3D Scene Understanding, International conference on Parallel Architectures and Compilation Techniques

System designers typically use well-studied benchmarks toevaluate and improve new architectures and compilers. Wedesign tomorrow's systems based on yesterday's applications.In this paper we investigate an emerging application,3D scene understanding, likely to be signi cant in the mobilespace in the near future. Until now, this application couldonly run in real-time on desktop GPUs. In this work, weexamine how it can be mapped to power constrained embeddedsystems. Key to our approach is the idea of incrementalco-design exploration, where optimization choices that concernthe domain layer are incrementally explored togetherwith low-level compiler and architecture choices. The goalof this exploration is to reduce execution time while minimizingpower and meeting our quality of result objective.As the design space is too large to exhaustively evaluate,we use active learning based on a random forest predictorto nd good designs. We show that our approach can, forthe rst time, achieve dense 3D mapping and tracking in thereal-time range within a 1W power budget on a popular embeddeddevice. This is a 4.8x execution time improvementand a 2.8x power reduction compared to the state-of-the-art.

CONFERENCE PAPER

Bolten M, Franchetti F, Kelly PHJ, Lengauer C, Mohr Met al., Algebraic description and automatic generation of multigrid methods in SPIRAL, Concurrency and Computation: Practice and Experience, ISSN: 1532-0634

SPIRAL is an autotuning, program generation and code synthesis system that offers a fully automaticgeneration of highly optimized target codes, customized for the specific execution platform at hand. Initially,SPIRAL was targeted at problem domains in digital signal processing, later also at basic linear algebra.We open SPIRAL up to a new, practically relevant and challenging domain: multigrid solvers. SPIRAL isdriven by algebraic transformation rules. We specify a set of such rules for a simple multigrid solver with aRichardson smoother for a discretized square 2D Poisson equation with Dirichlet boundary conditions. Wepresent the target code that SPIRAL generates in static single-assignment form and discuss its performance.While this example required no changes of or extensions to the SPIRAL system, more complex multigridsolvers may require small adaptations.

JOURNAL ARTICLE

Kelly PHJ, Cadar, collingbourne, Symbolic Crosschecking of Data-Parallel Floating-Point Code, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, ISSN: 0098-5589

JOURNAL ARTICLE

Luporini F, Ham DA, Kelly PHJ, An algorithm for the optimization of finite element integration loops, ACM Transactions on Mathematical Software, ISSN: 0098-3500

We present an algorithm for the optimization of a class of finite element integration loop nests. This algo-rithm, which exploits fundamental mathematical properties of finite element operators, is proven to achievea locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive nu-merical experiments demonstrate significant performance improvements over the state of the art in finiteelement code generation in almost all cases. This validates the effectiveness of the algorithm presented here,and illustrates its limitations.

JOURNAL ARTICLE

Nardi L, Bodin B, Saeedi S, Vespa E, Davison AJ, Kelly Pet al., International Workshop on Automatic Performance Tuning (iWAPT), IPDPS

CONFERENCE PAPER

Saeedi Gharahbolagh S, Nardi L, Johns E, Bodin B, Kelly PHJ, Davison AJet al., Application-oriented Design Space Exploration for SLAM Algorithms, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE

In visual SLAM, there are many software and hardware parameters, such as algorithmic thresholds and GPU frequency, that need to be tuned; however, this tuning should also take into account the structure and motion of the camera. In this paper, we determine the complexity of the structure and motion with a few parameters calculated using information theory. Depending on this complexity and the desired performance metrics, suitable parameters are explored and determined. Additionally, based on the proposed structure and motion parameters, several applications are presented, including a novel active SLAM approach which guides the camera in such a way that the SLAM algorithm achieves the desired performance metrics. Real-world and simulated experimental results demonstrate the effectiveness of the proposed design space and its applications.

CONFERENCE PAPER

Rathgeber F, Ham DA, Mitchell L, Lange M, Luporini F, Mcrae ATT, Bercea G-T, Markall GR, Kelly PHJet al., 2017, Firedrake: Automating the Finite Element Method by Composing Abstractions, ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, Vol: 43, ISSN: 0098-3500

JOURNAL ARTICLE

Bercea G-T, McRae ATT, Ham DA, Mitchell L, Rathgeber F, Nardi L, Luporini F, Kelly PHJet al., 2016, A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake, GEOSCIENTIFIC MODEL DEVELOPMENT, Vol: 9, Pages: 3803-3815, ISSN: 1991-959X

JOURNAL ARTICLE

Bodin B, Nardi L, Kelly PHJ, O'Boyle MFPet al., 2016, Diplomat: Mapping of Multi-kernel Applications Using a Static Dataflow Abstraction, 24th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), Publisher: IEEE, Pages: 241-250, ISSN: 1526-7539

CONFERENCE PAPER

Reguly IZ, Mudalige GR, Bertolli C, Giles MB, Betts A, Kelly PHJ, Radford Det al., 2016, Acceleration of a Full-Scale Industrial CFD Application with OP2, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, Vol: 27, Pages: 1265-1278, ISSN: 1045-9219

JOURNAL ARTICLE

Wozniak BD, Witherden FD, Russell FP, Vincent PE, Kelly PHJet al., 2016, GiMMiK-Generating bespoke matrix multiplication kernels for accelerators: Application to high-order Computational Fluid Dynamics, COMPUTER PHYSICS COMMUNICATIONS, Vol: 202, Pages: 12-22, ISSN: 0010-4655

JOURNAL ARTICLE

Zia MZ, Nardi L, Jack A, Vespa E, Bodin B, Kelly PHJ, Davison AJet al., 2016, Comparative Design Space Exploration of Dense and Semi-Dense SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 1292-1299, ISSN: 1050-4729

CONFERENCE PAPER

Nardi L, Bodin B, Zia MZ, Mawer J, Nisbet A, Kelly PHJ, Davison AJ, Lujan M, O'Boyle MEP, Riley G, Topham N, Furber Set al., 2015, Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE COMPUTER SOC, Pages: 5783-5790, ISSN: 1050-4729

CONFERENCE PAPER

Popovici DT, Russell FP, Wilkinson K, Skylaris C-K, Kelly PHJ, Franchetti Fet al., 2015, Generating Optimized Fourier Interpolation Routines for Density Functional Theory using SPIRAL, 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Publisher: IEEE, Pages: 743-752, ISSN: 1530-2075

CONFERENCE PAPER

Rokos G, Gorman G, Kelly PHJ, 2015, A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures, 21st International Conference on Parallel and Distributed Computing (Euro-Par), Publisher: SPRINGER-VERLAG BERLIN, Pages: 414-425, ISSN: 0302-9743

CONFERENCE PAPER

Russell FP, Wilkinson KA, Kelly PHJ, Skylaris C-Ket al., 2015, Optimised three-dimensional Fourier interpolation: An analysis of techniques and application to a linear-scaling density functional theory code, COMPUTER PHYSICS COMMUNICATIONS, Vol: 187, Pages: 8-19, ISSN: 0010-4655

JOURNAL ARTICLE

Collingbourne P, Cadar C, Kelly PHJ, 2014, Symbolic Crosschecking of Data-Parallel Floating-Point Code, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, Vol: 40, Pages: 710-737, ISSN: 0098-5589

JOURNAL ARTICLE

Konstantinidis A, Kelly PHJ, Ramanujam J, Sadayappan Pet al., 2014, Parametric GPU Code Generation for Affine Loop Programs, 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC), Publisher: SPRINGER-VERLAG BERLIN, Pages: 136-151, ISSN: 0302-9743

CONFERENCE PAPER

Luporini F, Varbanescu AL, Rathgeber F, Bercea G-T, Ramanujam J, Ham DA, Kelly PHJet al., 2014, Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly, ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, Vol: 11, ISSN: 1544-3566

JOURNAL ARTICLE

Salas-Moreno RF, Glocker B, Kelly PHJ, Davison AJet al., 2014, Dense Planar SLAM, IEEE International Symposium on Mixed and Augmented Reality (ISMAR) - Science and Technology, Publisher: IEEE, Pages: 157-164, ISSN: 1554-7868

CONFERENCE PAPER

Salas-Moreno RF, Glocker B, Kelly PHJ, Davison AJet al., 2014, [DEMO] Dense Planar SLAM, IEEE International Symposium on Mixed and Augmented Reality (ISMAR) - Science and Technology, Publisher: IEEE, Pages: 367-368, ISSN: 1554-7868

CONFERENCE PAPER

Strout MM, Luporini F, Krieger CD, Bertolli C, Bercea G-T, Olschanowsky C, Ramanujam J, Kelly PHJet al., 2014, Generalizing Run-time Tiling with the Loop Chain Abstraction, IEEE 28th International Parallel & Distributed Processing Symposium (IPDPS), Publisher: IEEE, ISSN: 1530-2075

CONFERENCE PAPER

Bertolli C, Betts A, Loriant N, Mudalige GR, Radford D, Ham DA, Giles MB, Kelly PHJet al., 2013, Compiler optimizations for industrial unstructured mesh CFD applications on GPUs, Pages: 112-126, ISSN: 0302-9743

Graphical Processing Units (GPUs) have shown acceleration factors over multicores for structured mesh-based Computational Fluid Dynamics (CFD). However, the value remains unclear for dynamic and irregular applications. Our motivating example is HYDRA, an unstructured mesh application used in production at Rolls-Royce for the simulation of turbomachinery components of jet engines. We describe three techniques for GPU optimization of unstructured mesh applications: a technique able to split a highly complex loop into simpler loops, a kernel specific alternative code synthesis, and configuration parameter tuning. Using these optimizations systematically on HYDRA improves the GPU performance relative to the multicore CPU. We show how these optimizations can be automated in a compiler, through user annotations. Performance analysis of a large number of complex loops enables us to study the relationship between optimizations and resource requirements of loops, in terms of registers and shared memory, which directly affect the loop performance. © Springer-Verlag Berlin Heidelberg 2013.

CONFERENCE PAPER

Birch D, Kelly PHJ, Field AJ, Simondetti Aet al., 2013, Computationally unifying urban masterplanning, Proceedings of the ACM International Conference on Computing Frontiers, CF 2013

Architectural design, particularly in large scale master planning projects, has yet to fully undergo the computational revolution experienced by other design-led industries such as automotive and aerospace. These industries use computational frameworks to undertake automated design analysis and design space exploration. However, within the Architectural, Engineering and Construction (AEC) industries wend no such computational platforms. This precludes the rapid analysis needed for quantitative design iteration which is required for sustainable design. This is a current computing frontier. This paper considers the computational solutions to the challenges preventing such advances to improve architectural design performance for a more sustainable future. We present a practical discussion of the computational challenges and opportunities in this industry and present a computational framework "HierSynth" with a data model designed to the needs of this industry. We report the results and lessons learned from applying this framework to a major commercial urban master planning project. This framework was used to automate and augment existing practice and was used to undertake previously infeasible, designer lead, design space exploration. During the casestudy an order of magnitude more analysis cycles were undertaken than literature suggests is normal; each occurring in hours not days.

JOURNAL ARTICLE

Birch D, Liang H, Ko J, Kelly P, Field A, Mullineux G, Simondetti Aet al., 2013, Multidisciplinary Engineering Models: Methodology and Case Study in Spreadsheet Analytics, European Spreadsheet Risks Interest Group 14th Annual Conference (EuSpRIG 2013), Publisher: EuSpRIG, Pages: 1-12

CONFERENCE PAPER

Chong N, Donaldson AF, Kelly PHJ, Ketema J, Qadeer Set al., 2013, Barrier Invariants: A Shared State Abstraction for the Analysis of Data-Dependent GPU Kernels, 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications (OOPSLA'13), Publisher: ASSOC COMPUTING MACHINERY, Pages: 605-621, ISSN: 0362-1340

CONFERENCE PAPER

Grosser T, Cohen A, Kelly PHJ, Ramanujam J, Sadayappan P, Verdoolaege Set al., 2013, Split tiling for GPUs: Automatic parallelization using trapezoidal tiles, Pages: 24-31

Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism. One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs. Copyright 2013 ACM.

CONFERENCE PAPER

Kelly PH, 2013, Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles, 6th Workshop on General Purpose Processor Using Graphics Processing Units, Publisher: ACM Press

Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.

CONFERENCE PAPER

Krieger CD, Strout MM, Olschanowsky C, Stone A, Guzik S, Gao X, Bertolli C, Kelly PHJ, Mudalige G, Van Straalen B, Williams Set al., 2013, Loop chaining: A programming abstraction for balancing locality and parallelism, Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013, Pages: 375-384

There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering. This partial ordering is dictated by data dependencies that, as part of the abstraction, are exposed, thereby avoiding inter-procedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or run-time system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domain-specific library, OP2, used in full applications with unstructured grids, and a domain-specific library, Chombo, used in full applications with structured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores. © 2013 IEEE.

JOURNAL ARTICLE

Markall GR, Rathgeber F, Mitchell L, Loriant N, Bertolli C, Ham DA, Kelly PHJet al., 2013, Performance-portable finite element assembly using PyOP2 and FEniCS, Pages: 279-289, ISSN: 0302-9743

We describe a toolchain that provides a fully automated compilation pathway from a finite element domain-specific language to low-level code for multicore and GPGPU platforms. We demonstrate that the generated code exceeds the performance of the best available alternatives, without requiring manual tuning or modification of the generated code. The toolchain can easily be integrated with existing finite element solvers, providing a means to add performance portable methods without having to rebuild an entire complex implementation from scratch. © 2013 Springer-Verlag.

CONFERENCE PAPER

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00003206&limit=30&person=true