Publications

Bertolli C, Betts A, Mudalige GR, Giles MB, Kelly PHJet al., 2011, Design and Performance of the OP2 Library for Unstructured Mesh Applications., Publisher: Springer, Pages: 191-200

Conference paper

Collingbourne P, Cadar C, Kelly PHJ, 2011, Symbolic crosschecking of floating-point and SIMD code, New York, NY, USA, Publisher: ACM, Pages: 315-328

Conference paper

Pearce DJ, Kelly PHJ, 2010, A batch algorithm for maintaining a topological order, Conferences in Research and Practice in Information Technology Series, Vol: 102, Pages: 79-87, ISSN: 1445-1336

The dynamic topological order problem is that of efficiently updating a topological order after some edge(s) are inserted into a graph. Much prior work exists on the unit-change version of this problem, where the order is updated after every single insertion. No previous (non-trivial) algorithms are known for the batch version of the problem, where the order is updated after every batch of insertions. We present the first such algorithm. This requires O(min{κ · (v+e), ve}) time to process any sequence of κ insertion batches. This is achieved by only re-computing those region(s) of the order affected by the inserted edges. In many cases, our algorithm will only traverse small portions of the graph when processing a batch. We empirically evaluate our algorithm against previous algorithms for this problem, and find that it performs well when the batch size is sufficiently large. Copyright © 2010, Australian Computer Society, Inc.

Abstract
Cite
Citations: 2

Journal article

Cantwell CD, Sherwin SJ, Kirby RM, Kelly PHJet al., 2010, From h to p efficiently: Strategy selection for operator evaluation on hexahedral and tetrahedral elements, Computers and Fluids, Vol: 43, Pages: 23-28

Journal article

Franke H, Kelly PHJ, 2010, Message from the program co-chairs, CF 2010 - Proceedings of the 2010 Computing Frontiers Conference

Cite

Journal article

Collingbourne P, Kelly PHJ, 2010, Inference of Session Types From Control Flow, Electronic Notes in Theoretical Computer Science, Vol: 238, Pages: 15-40, ISSN: 1571-0661

This is a study of a technique for deriving the session type of a program written in a statically typed imperative language from its control flow. We impose on our unlabelled session type syntax a well-formedness constraint based upon normalisation and explore the effects thereof. We present our inference algorithm declaratively and in a form suitable for implementation, and illustrate it with examples. We then present an implementation of the algorithm using a program analysis and transformation toolkit. © 2010.

Abstract
Cite
Citations: 2

Journal article

Howes L, Lokhmotov A, Donaldson AF, Kelly PHJet al., 2010, Towards Metaprogramming for Parallel Systems on a Chip, 15th International Euro-Par Conference on Parallel Computing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 36-+, ISSN: 0302-9743

Author Web Link
Cite
Citations: 2

Conference paper

Markall GR, Ham DA, Kelly PHJ, 2010, Generating Optimised Finite Element Solvers for GPU Architectures, International Conference on Numerical Analysis and Applied Mathematics, Publisher: AMER INST PHYSICS, Pages: 787-790, ISSN: 0094-243X

We show that optimal implementations of a finite element solver written for a Graphics Processing Unit and a multicore CPU require the use of different algorithms and data formats. This motivates the use of code generation in order to produce efficient, maintainable implementations of the finite element method for GPU architectures.

Abstract
Cite

Conference paper

Markall GR, Ham DA, Kelly PHJ, 2010, Towards generating optimised finite element solvers for GPUs from high-level specifications, Pages: 1809 - 1817-1809 - 1817, ISSN: 1877-0509

Conference paper

Markall GR, Ham DA, Kelly PHJ, 2010, Generating Optimised Finite Element Solvers for GPU Architectures, International Conference on Numerical Analysis and Applied Mathematics, Publisher: AMER INST PHYSICS, Pages: 787-+, ISSN: 0094-243X

Author Web Link
Cite
Citations: 2

Conference paper

Cornwall JLT, Howes LW, Kelly PHJ, Parsonage P, Nicoletti Bet al., 2009, High-performance SIMT code generation in an active visual effects library, ACM Computing Frontiers, Publisher: ACM Press, Pages: 175-184

Cite

Conference paper

Spacey SA, Luk W, Kelly PHJ, Kuhn Det al., 2009, RAPID DESIGN SPACE VISUALISATION THROUGH HARDWARE/SOFTWARE PARTITIONING, 5th Southern Conference on Programmable Logic, Publisher: IEEE, Pages: 159-164

Author Web Link
Cite
Citations: 6

Conference paper

Howes LW, Lokhmotov A, Donaldson AE, Kelly PHJet al., 2009, Deriving Efficient Data Movement from Decoupled Access/Execute Specifications, 4th International Conference on High Performance Embedded Architectures and Compilers, Publisher: SPRINGER-VERLAG BERLIN, Pages: 168-+, ISSN: 0302-9743

Author Web Link
Cite
Citations: 17

Conference paper

Pop A, Pop S, Jagasia H, Sjödin J, Kelly PHJet al., 2008, Improving GNU compiler collection infrastructure for streamization, Proceedings of the GCC Developers' Summit 2008, Pages: 77-86

The GNU Compiler Collection (GCC) needs a strategy to support future multicore architectures, which will probably include heterogeneous accelerator-like designs with explicit management of scratchpad memories. Some have further restrictions; for example, SIMD has limited synchronization capabilities. Some platforms will probably offer hardware support for streaming, transactions, and speculation. The purpose of this paper is to survey and evaluate some automatic and manual techniques for improving support for such targets in GCC. We focus on translation of sequential code for such platforms, i.e., the translation to task graphs and their communication and memory access operations. The paper provides an evaluation of the communication library support on an AMD Phenom™ X4 9550 quad-core processor. We use these experiments to tune the automatic task-partitioning algorithm implemented in GCC. The paper concludes with recommendations for strategic developments of GCC to support a stream programming language and improve the automatic generation of streamized tasks.

Abstract
Cite
Citations: 3

Journal article

Howes L, Lokhmotov A, Kelly P, Field Tet al., 2008, Optimising component composition using indexed dependence metadata, First International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC), Publisher: Karlsruhe University Press (KIT Scientific Publishing), Pages: 39-46

This paper explores the use of dependence metadata for optimising composition in component-based parallel programs. The idea is for each component to carry additional information about how points in its iteration space map to memory locations associated with its input and output data structures. When two components are composed this information can be used to implement optimisations that would otherwise require expensive analysis of the components' code at the time of composition. This dependence metadata facilitates a number of cross-component optimisations -- in this paper we focus on loop fusion and array contraction. We describe a prototype framework, based on the CLooG loop generator tool, that embodies these ideas and report experimental performance results for three non-trivial parallel benchmarks. Our results show execution time reductions of up to 50% using the proposed framework on an eight-core Intel Xeon system.

Abstract
Cite

Conference paper

Cornwall JLT, Kelly PHJ, Parsonage P, Nicoletti Bet al., 2008, Explicit dependence metadata in an active visual effects library, 20th International Workshop on Languages and Compilers for Parallel Computing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 172-+, ISSN: 0302-9743

Author Web Link
Cite
Citations: 2

Conference paper

Pearce DJ, Kelly PHJ, Hankin CL, 2007, Efficient field-sensitive pointer analysis of C., ACM Transactions on Programming Languages and Systems (TOPLAS), Vol: 30

Cite

Journal article

Falconer H, Kelly, P H J, Ingram D M, Mellor, Michael R, Field A J, Beckmann Oet al., 2007, A declarative framework for analysis and optimization, Compiler Construction (CC07), Publisher: Springer LNCS

Cite

Conference paper

Cornwall JLT, Kelly PHJ, 2006, Scheduling challenges in visual effects processing with heterogeneous architectures, IET Conference Publications

We are developing fast GPU-assisted implementations of a suite of visual effects processing primitives. We show that performance critically depends on the careful management of data movement (potential slowdown up to 2.7x) and resource contention (potential slowdown up to 2.5x). We are developing scheduling techniques that account for these effects.

Abstract
Cite

Journal article

Pearce D J, Webster M, Berry R, Kelly, P H Jet al., 2006, Profiling with AspectJ, Software: Practice and Experience

Cite

Journal article

Russell F P, Mellor M R, Kelly, P H J, Beckmann Oet al., 2006, An active linear algebra library using delayed evaluation and runtime code generation, Library-Centric Software Design LCSD'06, Publisher: Published online at http://sms.cs.chalmers.se/bibliography/proceedings/2006-LCSD.pdf, Pages: 1-8

Conference paper

Burton AN, Kelly PHJ, 2006, Performance prediction of paging workloads using lightweight tracing, International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems, Publisher: ELSEVIER, Pages: 784-793, ISSN: 0167-739X

Author Web Link
Cite
Citations: 5

Conference paper

Cornwall JLT, Beckmann O, Kelly PHJ, 2006, Automatically translating a general purpose C++ image processing library for GPUs, 20th International Parallel and Distributed Processing Symposium, IPDPS 2006, Vol: 2006

This paper presents work-in-progress towards a C++ source-to-source translator that automatically seeks parallelisable code fragments and replaces them with code for a graphics co-processor. We report on our experience with accelerating an industrial image processing library. To increase the effectiveness of our approach, we exploit some domain-specific knowledge of the library's semantics. We outline the architecture of our translator and how it uses the ROSE source-to-source transformation library to overcome complexities in the C++ language. Techniques for parallel analysis and source transformation are presented in light of their uses in GPU code generation. We conclude with results from a performance evaluation of two examples, image blending and an erosion filter, hand-translated with our parallelisation techniques. We show that our approach has potential and explain some of the remaining challenges in building an effective tool. © 2006 IEEE.

Abstract
Cite
Citations: 1

Journal article

Pearce DJ, Kelly PHJ, 2006, A dynamic topological sort algorithm for directed acyclic graphs, ACM Journal of Experimental Algorithmics, Vol: 11, ISSN: 1084-6654

We consider the problem of maintaining the topological order of a directed acyclic graph (DAG) in the presence of edge insertions and deletions. We present a new algorithm and, although this has inferior time complexity compared with the best previously known result, we find that its simplicity leads to better performance in practice. In addition, we provide an empirical comparison against the three main alternatives over a large number of random DAGs. The results show our algorithm is the best for sparse digraphs and only a constant factor slower than the best on dense digraphs.

Abstract
Cite
Citations: 64

Journal article

Osmond K, Beckmann O, Field T, Kelly Pet al., 2006, Parallel Visualization using the Domain-Specific Interpreter Pattern, Workshop on Compilers for Parallel Computing

Conference paper

Kelly P, Beckmann O, Jeyarajan T, 2006, Is Morton layout competitive for large two-dimensional arrays yet?, Publisher: wiley

Two-dimensional arrays are generally arranged in memory in row-major order or column-major order. Traversing a row-major array in column-major order, or vice versa, leads to poor spatial locality. With large arrays the performance loss can be a factor of 10 or more. This paper explores the Morton storage layout, which has substantial spatial locality whether traversed in row-major or column-major order. Using a small suite of dense kernels working on two-dimensional arrays, we have carried out an extensive study of the impact of poor array layout and of whether Morton layout can offer an attractive compromise. We show that Morton layout can lead to better performance than the worse of the two canonical layouts; however, the performance of Morton layout compared to the better choice of canonical layout is often disappointing. We further study one simple improvement of the basic Morton scheme: we show that choosing the correct alignment for the base address of an array in Morton layout can sometimes significantly improve the competitiveness of this layout

Abstract
Cite

Conference paper

Cornwall JL, Beckmann O, Kelly P, 2006, Accelerating a C++ Image Processing Library with a GPU, POHLL 2006: Workshop on Performance Optimization for High-Level Languages and Libraries (colocated with IPDPS06, Rhodes)., Publisher: IEEE Press

Conference paper

Osmond K, Beckmann O, Kelly P, 2005, A Domain-Specific Interpreter for Parallelizing a Large Mixed-Language Visualisation Application, LCPC 2005, Publisher: Springer Verlag

Conference paper

Osmond K, Beckmann O, Kelly P, 2005, A Domain-Specific Interpreter for Parallelizing a Large Mixed-Language Visualisation Application, LCPC 2005, Publisher: Springer Verlag

Conference paper

Kelly P, Beckmann O, 2005, Generative and Adaptive Methods in Performance Programming, Parallel Processing Letters, Vol: 15, Pages: 239-256

Journal article

ProfessorPaulKelly

Contact

Location

Summary