149 results found
Bodin B, Nardi L, Zia MZ, et al., Integrating Algorithmic Parameters into Benchmarking and Design Space Exploration in 3D Scene Understanding, International conference on Parallel Architectures and Compilation Techniques
System designers typically use well-studied benchmarks toevaluate and improve new architectures and compilers. Wedesign tomorrow's systems based on yesterday's applications.In this paper we investigate an emerging application,3D scene understanding, likely to be signi cant in the mobilespace in the near future. Until now, this application couldonly run in real-time on desktop GPUs. In this work, weexamine how it can be mapped to power constrained embeddedsystems. Key to our approach is the idea of incrementalco-design exploration, where optimization choices that concernthe domain layer are incrementally explored togetherwith low-level compiler and architecture choices. The goalof this exploration is to reduce execution time while minimizingpower and meeting our quality of result objective.As the design space is too large to exhaustively evaluate,we use active learning based on a random forest predictorto nd good designs. We show that our approach can, forthe rst time, achieve dense 3D mapping and tracking in thereal-time range within a 1W power budget on a popular embeddeddevice. This is a 4.8x execution time improvementand a 2.8x power reduction compared to the state-of-the-art.
Kelly PHJ, Cadar, collingbourne, Symbolic Crosschecking of Data-Parallel Floating-Point Code, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, ISSN: 0098-5589
Nardi L, Bodin B, Saeedi S, et al., Algorithmic performance-accuracy trade-off in 3D vision applications using hypermapper, IPDPS, Publisher: IEEE
In this paper we investigate an emerging appli-cation, 3D scene understanding, likely to be significant in themobile space in the near future. The goal of this explorationis to reduce execution time while meeting our quality of resultobjectives. In previous work, we showed for the first time thatit is possible to map this application to power constrainedembedded systems, highlighting that decision choices made atthe algorithmic design-level have the most significant impact.As the algorithmic design space is too large to be exhaus-tively evaluated, we use a previously introduced multi-objectiverandom forest active learning prediction framework dubbedHyperMapper, to find good algorithmic designs. We showthat HyperMapper generalizes on a recent cutting edge 3Dscene understanding algorithm and on a modern GPU-basedcomputer architecture. HyperMapper is able to beat an experthuman hand-tuning the algorithmic parameters of the classof computer vision applications taken under consideration inthis paper automatically. In addition, we use crowd-sourcingusing a 3D scene understanding Android app to show that thePareto front obtained on an embedded system can be used toaccelerate the same application on all the 83 smart-phones andtablets with speedups ranging from 2x to over 12x.
Saeedi Gharahbolagh S, Nardi L, Johns E, et al., Application-oriented Design Space Exploration for SLAM Algorithms, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE
In visual SLAM, there are many software and hardware parameters, such as algorithmic thresholds and GPU frequency, that need to be tuned; however, this tuning should also take into account the structure and motion of the camera. In this paper, we determine the complexity of the structure and motion with a few parameters calculated using information theory. Depending on this complexity and the desired performance metrics, suitable parameters are explored and determined. Additionally, based on the proposed structure and motion parameters, several applications are presented, including a novel active SLAM approach which guides the camera in such a way that the SLAM algorithm achieves the desired performance metrics. Real-world and simulated experimental results demonstrate the effectiveness of the proposed design space and its applications.
Bolten M, Franchetti F, Kelly PHJ, et al., 2017, Algebraic description and automatic generation of multigrid methods in SPIRAL, Concurrency Computation, Pages: e4105-e4105, ISSN: 1532-0626
© 2017 John Wiley & Sons, Ltd. SPIRAL is an autotuning, program generation, and code synthesis system that offers a fully automatic generation of highly optimized target codes, customized for the specific execution platform at hand. Initially, SPIRAL was targeted at problem domains in digital signal processing, later also at basic linear algebra. We open SPIRAL up to a new, practically relevant and challenging domain: multigrid solvers. SPIRAL is driven by algebraic transformation rules. We specify a set of such rules for a simple multigrid solver with a Richardson smoother for a discretized square 2D Poisson equation with Dirichlet boundary conditions. We present the target code that SPIRAL generates in static single-assignment form and discuss its performance. While this example required no changes of or extensions to the SPIRAL system, more complex multigrid solvers may require small adaptations.
Luporini F, Ham DA, Kelly PHJ, et al., 2017, An Algorithm for the Optimization of Finite Element Integration Loops, ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, Vol: 44, Pages: 1-26, ISSN: 0098-3500
© 2017 ACM. We present an algorithm for the optimization of a class of finite-element integration loop nests. This algorithm, which exploits fundamental mathematical properties of finite-element operators, is proven to achieve a locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive numerical experiments demonstrate significant performance improvements over the state of the art in finiteelement code generation in almost all cases. This validates the effectiveness of the algorithm presented here and illustrates its limitations.
Rathgeber F, Ham DA, Mitchell L, et al., 2017, Firedrake: Automating the Finite Element Method by Composing Abstractions, ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, Vol: 43, Pages: 1-27, ISSN: 0098-3500
Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domain-specific language for the finite element method of the FEniCS project, but with a pure Python runtime-only implementation centered on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separation of concerns that eases the incorporation of separate contributions from computer scientists, numerical analysts, and application specialists. These contributions may add functionality or improve performance. Firedrake benefits from automatically applying new optimizations. This includes factorizing mixed function spaces, transforming and vectorizing inner loops, and intrinsically supporting block matrix operations. Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users to implement common operations that fall outside of pure variational formulations, such as flux limiters.
Bercea G-T, McRae ATT, Ham DA, et al., 2016, A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake, GEOSCIENTIFIC MODEL DEVELOPMENT, Vol: 9, Pages: 3803-3815, ISSN: 1991-959X
We present a generic algorithm for numbering and then efficiently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of three-dimensional high aspect ratio domains employed by geophysical finite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake finite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10-20 layers as long as the underlying mesh is well ordered. We characterize the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretizations. On meshes with realistic numbers of layers the performance achieved is between 70 and 90% of a theoretical hardware-specific limit.
Bodin B, Nardi L, Kelly PHJ, et al., 2016, Diplomat: Mapping of Multi-kernel Applications Using a Static Dataflow Abstraction, 24th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), Publisher: IEEE, Pages: 241-250, ISSN: 1526-7539
© 2016 IEEE. In this paper we propose a novel approach to heterogeneous embedded systems programmability using a task-graph based framework called Diplomat. Diplomat is a task-graph framework that exploits the potential of static dataflow modeling and analysis to deliver performance estimation and CPU/GPU mapping. An application has to be specified once, and then the framework can automatically propose good mappings. We evaluate Diplomat with a computer vision application on two embedded platforms. Using the Diplomat generation we observed a 16% performance improvement on average and up to a 30% improvement over the best existing hand-coded implementation.
Reguly IZ, Mudalige GR, Bertolli C, et al., 2016, Acceleration of a Full-Scale Industrial CFD Application with OP2, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, Vol: 27, Pages: 1265-1278, ISSN: 1045-9219
© 1990-2012 IEEE. Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc., capable of performing complex simulations over highly detailed unstructured mesh geometries. Hydra presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging platforms. We present research in achieving this goal through the OP2 domain-specific high-level framework, demonstrating the viability of such a high-level programming approach. OP2 targets the domain of unstructured mesh problems and enables execution on a range of back-end hardware platforms. We chart the conversion of Hydra to OP2, and map out the key difficulties encountered in the process. Specifically we show how different parallel implementations can be achieved with an active library framework, even for a highly complicated industrial application and how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly, reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performance as that of the hand-tuned original code could be achieved, but it can be significantly improved on conventional processor systems, and many-core systems. Our results provide evidence of how high-level frameworks such as OP2 enable portability across a wide range of contrasting platforms and their significant utility in achieving high performance without the intervention of the application programmer.
Wozniak BD, Witherden FD, Russell FP, et al., 2016, GiMMiK-Generating bespoke matrix multiplication kernels for accelerators: Application to high-order Computational Fluid Dynamics, COMPUTER PHYSICS COMMUNICATIONS, Vol: 202, Pages: 12-22, ISSN: 0010-4655
© 2016 The Authors. Matrix multiplication is a fundamental linear algebra routine ubiquitous in all areas of science and engineering. Highly optimised BLAS libraries (cuBLAS and clBLAS on GPUS) are the most popular choices for an implementation of the General Matrix Multiply (GEMM) in software. In this paper we present GiMMiK - a generator of bespoke matrix multiplication kernels for the CUDA and OpenCL platforms. GiMMiK exploits a prior knowledge of the operator matrix to generate highly performant code. The performance of GiMMiK's kernels is particularly apparent in a block-by-panel type of matrix multiplication, where the block matrix is typically small (e.g. dimensions of 96×64). Such operations are characteristic to our motivating application in PyFR - an implementation of Flux Reconstruction schemes for high-order fluid flow simulations on mixed unstructured meshes. GiMMiK fully unrolls the matrix-vector product and embeds matrix entries directly in the code to benefit from the use of the constant cache and compiler optimisations. Further, it reduces the number of floating-point operations by removing multiplications by zeros. Together with the ability of our kernels to avoid the poorly optimised cleanup code, executed by library GEMM, we are able to outperform cuBLAS on two NVIDIA GPUS: GTX 780 Ti and Tesla K40c. We observe speedups of our kernels over cuBLAS GEMM of up to 9.98 and 63.30 times for a 294×1029 99% sparse PyFR matrix in double precision on the Tesla K40c and GTX 780 Ti correspondingly. In single precision, observed speedups reach 12.20 and 13.07 times for a 4×8 50% sparse PyFR matrix on the two aforementioned cards. Using GiMMiK as the matrix multiplication kernel provider allows us to achieve a speedup of up to 1.70 (2.19) for a simulation of an unsteady flow over a cylinder executed with PyFR in double (single) precision on the Tesla K40c. All results were generated with GiMMiK version 1.0.
Zia MZ, Nardi L, Jack A, et al., 2016, Comparative Design Space Exploration of Dense and Semi-Dense SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 1292-1299, ISSN: 1050-4729
© 2016 IEEE. SLAM has matured significantly over the past few years, and is beginning to appear in serious commercial products. While new SLAM systems are being proposed at every conference, evaluation is often restricted to qualitative visualizations or accuracy estimation against a ground truth. This is due to the lack of benchmarking methodologies which can holistically and quantitatively evaluate these systems. Further investigation at the level of individual kernels and parameter spaces of SLAM pipelines is non-existent, which is absolutely essential for systems research and integration. We extend the recently introduced SLAMBench framework to allow comparing two state-of-the-art SLAM pipelines, namely KinectFusion and LSD-SLAM, along the metrics of accuracy, energy consumption, and processing frame rate on two different hardware platforms, namely a desktop and an embedded device. We also analyze the pipelines at the level of individual kernels and explore their algorithmic and hardware design spaces for the first time, yielding valuable insights.
Nardi L, Bodin B, Zia MZ, et al., 2015, Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE COMPUTER SOC, Pages: 5783-5790, ISSN: 1050-4729
© 2015 IEEE. Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it difficult for robotics and vision researchers to implement their algorithms in a performance-portable way. In this paper we introduce SLAMBench, a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption of a dense RGB-D SLAM system. SLAMBench provides a KinectFusion implementation in C++, OpenMP, OpenCL and CUDA, and harnesses the ICL-NUIM dataset of synthetic RGB-D sequences with trajectory and scene ground truth for reliable accuracy comparison of different implementation and algorithms. We present an analysis and breakdown of the constituent algorithmic elements of KinectFusion, and experimentally investigate their execution time on a variety of multicore and GPU-accelerated platforms. For a popular embedded platform, we also present an analysis of energy efficiency for different configuration alternatives.
Popovici DT, Russell FP, Wilkinson K, et al., 2015, Generating Optimized Fourier Interpolation Routines for Density Functional Theory using SPIRAL, 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Publisher: IEEE, Pages: 743-752, ISSN: 1530-2075
© 2015 IEEE. Upsampling of a multi-dimensional data-set is an operation with wide application in image processing and quantum mechanical calculations using density functional theory. For small up sampling factors as seen in the quantum chemistry code ONETEP, a time-shift based implementation that shifts samples by a fraction of the original grid spacing to fill in the intermediate values using a frequency domain Fourier property can be a good choice. Readily available highly optimized multidimensional FFT implementations are leveraged at the expense of extra passes through the entire working set. In this paper we present an optimized variant of the time-shift based up sampling. Since ONETEP handles threading, we address the memory hierarchy and SIMD vectorization, and focus on problem dimensions relevant for ONETEP. We present a formalization of this operation within the SPIRAL framework and demonstrate auto-generated and auto-tuned interpolation libraries. We compare the performance of our generated code against the previous best implementations using highly optimized FFT libraries (FFTW and MKL). We demonstrate speed-ups in isolation averaging 3x and within ONETEP of up to 15%.
Rokos G, Gorman G, Kelly PHJ, et al., 2015, A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures, 21st International Conference on Parallel and Distributed Computing (Euro-Par), Publisher: SPRINGER-VERLAG BERLIN, Pages: 414-425, ISSN: 0302-9743
© Springer-Verlag Berlin Heidelberg 2015. Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Çatalyürek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Çatalyürek et al.’s algorithm.We show that the new technique scales better on multicore and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.
Russell FP, Wilkinson KA, Kelly PHJ, et al., 2015, Optimised three-dimensional Fourier interpolation: An analysis of techniques and application to a linear-scaling density functional theory code, COMPUTER PHYSICS COMMUNICATIONS, Vol: 187, Pages: 8-19, ISSN: 0010-4655
© 2014 The Authors. The Fourier interpolation of 3D data-sets is a performance critical operation in many fields, including certain forms of image processing and density functional theory (DFT) quantum chemistry codes based on plane wave basis sets, to which this paper is targeted. In this paper we describe three different algorithms for performing this operation built from standard discrete Fourier transform operations, and derive theoretical operation counts. The algorithms compared consist of the most straightforward implementation and two that exploit techniques such as phase-shifts and knowledge of zero padding to reduce computational cost. Through a library implementation (tintl) we explore the performance characteristics of these algorithms and the performance impact of different implementation choices on actual hardware. We present comparisons within the linear-scaling DFT code ONETEP where we replace the existing interpolation implementation with our library implementation configured to choose the most efficient algorithm. Within the ONETEP Fourier interpolation stages, we demonstrate speed-ups of over 1.55×.
Collingbourne P, Cadar C, Kelly PHJ, et al., 2014, Symbolic Crosschecking of Data-Parallel Floating-Point Code, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, Vol: 40, Pages: 710-737, ISSN: 0098-5589
We present a symbolic execution-based technique for cross-checking programs accelerated using SIMD or OpenCL against an unaccelerated version, as well as a technique for detecting data races in OpenCL programs. Our techniques are implemented in KLEE-CL, a tool based on the symbolic execution engine KLEE that supports symbolic reasoning on the equivalence between expressions involving both integer and floating-point operations. While the current generation of constraint solvers provide effective support for integer arithmetic, the situation is different for floating-point arithmetic, due to the complexity inherent in such computations. The key insight behind our approach is that floating-point values are only reliably equal if they are essentially built by the same operations. This allows us to use an algorithm based on symbolic expression matching augmented with canonicalisation rules to determine path equivalence. Under symbolic execution, we have to verify equivalence along every feasible control-flow path. We reduce the branching factor of this process by aggressively merging conditionals, if-converting branches into select operations via an aggressive phi-node folding transformation. To support the Intel Streaming SIMD Extension (SSE) instruction set, we lower SSE instructions to equivalent generic vector operations, which in turn are interpreted in terms of primitive integer and floating-point operations. To support OpenCL programs, we symbolically model the OpenCL environment using an OpenCL runtime library targeted to symbolic execution. We detect data races by keeping track of all memory accesses using a memory log, and reporting a race whenever we detect that two accesses conflict. By representing the memory log symbolically, we are also able to detect races associated with symbolically-indexed accesses of memory objects. We used KLEE-CL to prove the bounded equivalence between scalar and data-parallel versions of floating-point programs and find a number o
Konstantinidis A, Kelly PHJ, Ramanujam J, et al., 2014, Parametric GPU Code Generation for Affine Loop Programs, 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC), Publisher: SPRINGER-VERLAG BERLIN, Pages: 136-151, ISSN: 0302-9743
© Springer International Publishing Switzerland 2014. Partitioning a parallel computation into finite-sized chunks for effective mapping onto a parallel machine is a critical concern for source-to-source compilation. In the context of OpenCL and CUDA, this translates to the definition of a uniform hyper-rectangular partitioning of the parallel execution space where each partition is subject to a fine-grained distribution of resources that has a direct yet hard to estimate impact on performance. This paper develops the first compilation scheme for generating parametrically tiled codes for affine loop programs on GPUs, which facilitates run-time exploration of partitioning parameters as a fast and portable way of finding the ones that yield maximum performance. Our approach is based on a parametric tiling scheme for producing wavefronts of parallel rectangular partitions of parametric size and a novel runtime system that manages wavefront execution and local memory usage dynamically through an inspector-executor mechanism. An experimental evaluation demonstrates the effectiveness of our approach for wavefront as well as rectangularly-parallel partitionings.
Luporini F, Varbanescu AL, Rathgeber F, et al., 2014, Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly, ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, Vol: 11, Pages: 1-25, ISSN: 1544-3566
© 2014 ACM. We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler (called COFFEE) for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates abstract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector SIMD intrinsics. Experiments using a range of real-world finite element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed.
Salas-Moreno RF, Glocker B, Kelly PHJ, et al., 2014, [DEMO] Dense Planar SLAM, IEEE International Symposium on Mixed and Augmented Reality (ISMAR) - Science and Technology, Publisher: IEEE, Pages: 367-368, ISSN: 1554-7868
Salas-Moreno RF, Glocker B, Kelly PHJ, et al., 2014, Dense Planar SLAM, IEEE International Symposium on Mixed and Augmented Reality (ISMAR) - Science and Technology, Publisher: IEEE, Pages: 157-164, ISSN: 1554-7868
© 2014 IEEE. Using higher-level entities during mapping has the potential to improve camera localisation performance and give substantial perception capabilities to real-time 3D SLAM systems. We present an efficient new real-time approach which densely maps an environment using bounded planes and surfels extracted from depth images (like those produced by RGB-D sensors or dense multi-view stereo reconstruction). Our method offers the every-pixel descriptive power of the latest dense SLAM approaches, but takes advantage directly of the planarity of many parts of real-world scenes via a data-driven process to directly regularize planar regions and represent their accurate extent efficiently using an occupancy approach with on-line compression. Large areas can be mapped efficiently and with useful semantic planar structure which enables intuitive and useful AR applications such as using any wall or other planar surface in a scene to display a user's content.
Strout MM, Luporini F, Krieger CD, et al., 2014, Generalizing Run-time Tiling with the Loop Chain Abstraction, IEEE 28th International Parallel & Distributed Processing Symposium (IPDPS), Publisher: IEEE, Pages: 1136-1145, ISSN: 1530-2075
Many scientific applications are organized in a data parallel way: as sequences of parallel and/or reduction loops. This exposes parallelism well, but does not convert data reuse between loops into data locality. This paper focuses on this issue in parallel loops whose loop-to-loop dependence structure is data-dependent due to indirect references such as A[B[i]]. Such references are a common occurrence in sparse matrix computations, molecular dynamics simulations, and unstructured-mesh computational fluid dynamics (CFD). Previously, sparse tiling approaches were developed for individual benchmarks to group iterations across such loops to improve data locality. These approaches were shown to benefit applications such as moldyn, Gauss-Seidel, and the sparse matrix powers kernel, however the run-time routines for performing sparse tiling were hand coded per application. In this paper, we present a generalized full sparse tiling algorithm that uses the newly developed loop chain abstraction as input, improves inter-loop data locality, and creates a task graph to expose shared-memory parallelism at runtime. We evaluate the overhead and performance impact of the generalized full sparse tiling algorithm on two codes: a sparse Jacobi iterative solver and the Airfoil CFD benchmark. © 2014 IEEE.
Bertolli C, Betts A, Loriant N, et al., 2013, Compiler optimizations for industrial unstructured mesh CFD applications on GPUs, International Workshop on Languages and Compilers for Parallel Computing (LCPC), Publisher: Springer, Pages: 112-126, ISSN: 0302-9743
Graphical Processing Units (GPUs) have shown acceleration factors over multicores for structured mesh-based Computational Fluid Dynamics (CFD). However, the value remains unclear for dynamic and irregular applications. Our motivating example is HYDRA, an unstructured mesh application used in production at Rolls-Royce for the simulation of turbomachinery components of jet engines. We describe three techniques for GPU optimization of unstructured mesh applications: a technique able to split a highly complex loop into simpler loops, a kernel specific alternative code synthesis, and configuration parameter tuning. Using these optimizations systematically on HYDRA improves the GPU performance relative to the multicore CPU. We show how these optimizations can be automated in a compiler, through user annotations. Performance analysis of a large number of complex loops enables us to study the relationship between optimizations and resource requirements of loops, in terms of registers and shared memory, which directly affect the loop performance. © Springer-Verlag Berlin Heidelberg 2013.
Architectural design, particularly in large scale master planning projects, has yet to fully undergo the computational revolution experienced by other design-led industries such as automotive and aerospace. These industries use computational frameworks to undertake automated design analysis and design space exploration. However, within the Architectural, Engineering and Construction (AEC) industries wend no such computational platforms. This precludes the rapid analysis needed for quantitative design iteration which is required for sustainable design. This is a current computing frontier. This paper considers the computational solutions to the challenges preventing such advances to improve architectural design performance for a more sustainable future. We present a practical discussion of the computational challenges and opportunities in this industry and present a computational framework "HierSynth" with a data model designed to the needs of this industry. We report the results and lessons learned from applying this framework to a major commercial urban master planning project. This framework was used to automate and augment existing practice and was used to undertake previously infeasible, designer lead, design space exploration. During the casestudy an order of magnitude more analysis cycles were undertaken than literature suggests is normal; each occurring in hours not days.
Birch D, Liang H, Ko J, et al., 2013, Multidisciplinary Engineering Models: Methodology and Case Study in Spreadsheet Analytics, European Spreadsheet Risks Interest Group 14th Annual Conference (EuSpRIG 2013), Publisher: EuSpRIG, Pages: 1-12
Chong N, Donaldson AF, Kelly PHJ, et al., 2013, Barrier Invariants: A Shared State Abstraction for the Analysis of Data-Dependent GPU Kernels, 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications (OOPSLA'13), Publisher: ASSOC COMPUTING MACHINERY, Pages: 605-621, ISSN: 0362-1340
Data-dependent GPU kernels, whose data or control flow are dependent on the input of the program, are difficult to verify because they require reasoning about shared state manipulated by many parallel threads. Existing verification techniques for GPU kernels achieve soundness and scalability by using a two-thread reduction and making the contents of the shared state nondeterministic each time threads synchronise at a barrier, to account for all possible thread interactions. This coarse abstraction prohibits verification of datadependent kernels. We present barrier invariants, a novel abstraction technique which allows key properties about the shared state of a kernel to be preserved across barriers during formal reasoning. We have integrated barrier invariants with the GPUVerify tool, and present a detailed case study showing how they can be used to verify three prefix sum algorithms, allowing efficient modular verification of a stream compaction kernel, a key building block for GPU programming. This analysis goes significantly beyond what is possible using existing verification techniques for GPU kernels. Copyright © 2013. Copyright © 2013 ACM.
Grosser T, Cohen A, Kelly PHJ, et al., 2013, Split tiling for GPUs: Automatic parallelization using trapezoidal tiles, 6th Workshop on General Purpose Processor Using Graphics Processing Units, Publisher: ACM Press, Pages: 24-31
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism. One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs. Copyright 2013 ACM.
Kelly PH, 2013, Split Tiling for GPUs: Automatic Parallelization Using Trapezoidal Tiles, 6th Workshop on General Purpose Processor Using Graphics Processing Units, Publisher: ACM Press
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism.One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs.
Krieger CD, Strout MM, Olschanowsky C, et al., 2013, Loop chaining: A programming abstraction for balancing locality and parallelism, Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013, Pages: 375-384
There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering. This partial ordering is dictated by data dependencies that, as part of the abstraction, are exposed, thereby avoiding inter-procedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or run-time system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domain-specific library, OP2, used in full applications with unstructured grids, and a domain-specific library, Chombo, used in full applications with structured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores. © 2013 IEEE.
Markall GR, Rathgeber F, Mitchell L, et al., 2013, Performance-portable finite element assembly using PyOP2 and FEniCS, International Supercomputing Conference (ISC), Publisher: Springer, Pages: 279-289, ISSN: 0302-9743
We describe a toolchain that provides a fully automated compilation pathway from a finite element domain-specific language to low-level code for multicore and GPGPU platforms. We demonstrate that the generated code exceeds the performance of the best available alternatives, without requiring manual tuning or modification of the generated code. The toolchain can easily be integrated with existing finite element solvers, providing a means to add performance portable methods without having to rebuild an entire complex implementation from scratch. © 2013 Springer-Verlag.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.