179 results found
Papaphilippou P, Kelly PHJ, Luk W, 2021, Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions
This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count where applicable. Additionally, a high-performance open-source RISC-V (RV32 IM) softcore is introduced, optimised for exploring custom SIMD instructions and streaming performance. By providing instruction templates for instruction development in HDL/Verilog, efficient FPGA-based instructions can be developed with few low-level lines of code. In order to improve custom SIMD instruction performance, the softcore's cache hierarchy is optimised for bandwidth, such as with very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications on an FPGA. Although the exploration is based on the softcore, the goal is to provide a means to experiment with advanced SIMD instructions which could be loaded in future CPUs that feature reconfigurable regions as custom instructions. Finally, we provide some insights on the challenges and effectiveness of such future micro-architectures.
Papaphilippou P, Kelly PHJ, Luk W, 2021, Demonstrating custom SIMD instruction development for a RISC-V softcore, FPL2021. The International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE
This demo elaborates on the programmability aspect of Simodense, a recently released open-source softcore, optimised for evaluating custom SIMD instructions. CPUs featuring small reconfigurable areas for implementing custom instructions is an alternative path in computer architecture that can help with the challenges found in today’s FPGAs. By providing RTL-based programmability for implementing custom SIMD instructions, highly-integrated accelerators can be developed, while benefiting from the pre-existing CPU logic, such as the caches and their high memory throughput to main memory.
Papaphilippou P, Kelly PHJ, Luk W, 2021, Simodense: a RISC-V softcore optimised for exploring custom SIMD instructions, FPL2021. The International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE
Simodense is a high-performance open-source RISC-V (RV32IM) softcore, optimised for exploring custom SIMD instructions. In order to maximise SIMD instruction performance, the design’s memory system is optimised for streaming bandwidth, such as very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications with custom instructions. This paper also provides insights on the effectiveness of adding FPGA resources in general purpose processors in the form of reconfigurable SIMDinstructions.
Koch MK, Kelly PHJ, Vincent P, 2021, Identification and classification of off-vertex critical points for contour tree construction on unstructured meshes of hexahedra, IEEE Transactions on Visualization and Computer Graphics, Pages: 1-1, ISSN: 1077-2626
The topology of isosurfaces changes at isovalues of critical points, making such points an important feature when building contour trees or Morse-Smale complexes. Hexahedral elements with linear interpolants can contain additional off-vertex critical points in element bodies and on element faces. Moreover, a point on the face of a hexahedron which is critical in the element-local context is not necessarily critical in the global context. In ‘`Exploring Scalar Fields Using Critical Isovalues’' Weber et al. introduce a method to determine whether critical points on faces are also critical in the global context, based on the gradient of the asymptotic decider in each element that shares the face. However, as defined, the method of Weber et al. contains an error, and can lead to incorrect results. In this work we correct the error.
Murai R, Saeedi S, Kelly P, 2021, BIT-VO: visual odometry at 300 FPS using binary features from the focal plane, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, Publisher: IEEE, Pages: 8579-8586
Focal-plane Sensor-processor (FPSP) is a next-generation camera technology which enables every pixel on the sensor chip to perform computation in parallel, on the focal plane where the light intensity is captured. SCAMP-5 is a general-purpose FPSP used in this work and it carries out computations in the analog domain before analog to digital conversion. By extracting features from the image on the focal plane, data which is digitised and transferred is reduced. As a consequence, SCAMP-5 offers a high frame rate while maintaining low energy consumption. Here, we present BITVO, which is the first 6-Degrees of Freedom visual odometry algorithm which utilises the FPSP. Our entire system operates at 300 FPS in a natural environment, using binary edges and corner features detected by the SCAMP-5.
Kukreja N, Hückelheim J, Louboutin M, et al., 2020, Lossy Checkpoint Compression in Full Waveform Inversion, Geoscientific Model Development, ISSN: 1991-959X
This paper proposes a new method that combines check- pointing methods with error-controlled lossy compression for large-scale high-performance Full-Waveform Inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can signif- icantly reduce data movement, allowing a reduction in run time as well as peak memory.In the Exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit.Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory foot- print during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations.Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.
Stow E, Murai R, Saeedi S, et al., 2020, Cain: Automatic code generation for simultaneous convolutional kernels onfocal-plane sensor-processors, Languages and Compilers for Parallel Computing, Publisher: Springer Verlag, ISSN: 0302-9743
Focal-plane Sensor-processors (FPSPs) are a camera technology that enables low power, high frame rate computation, making the device suitable for edge computation. Unfortunately, the device’s limited instruction set and registers make the development of complex algorithms challenging. In this work, we present Cain – a compiler that targets SCAMP-5, a general-purpose FPSP – which generates SCAMP-5 code from multiple convolutional kernels. As an example, given the convolutional kernels for an MNIST digit recognition neural network, Cain produces code that is half as long, when compared to the other available compilers for SCAMP-5.
Sun T, Mitchell L, Kulkarni K, et al., 2020, A study of vectorization for matrix-free finite element methods, International Journal of High Performance Computing Applications, Vol: 34, Pages: 629-644, ISSN: 1094-3420
Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30% of theoretical peak performance for many examples of practical significance, and exceed 50% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels.
Carvalho EDC, Clark R, Nicastro A, et al., 2020, Scalable uncertainty for computer vision with functional variationalinference, CVPR 2020, Publisher: IEEE, Pages: 12003-12013
As Deep Learning continues to yield successful applications in ComputerVision, the ability to quantify all forms of uncertainty is a paramountrequirement for its safe and reliable deployment in the real-world. In thiswork, we leverage the formulation of variational inference in function space,where we associate Gaussian Processes (GPs) to both Bayesian CNN priors andvariational family. Since GPs are fully determined by their mean and covariancefunctions, we are able to obtain predictive uncertainty estimates at the costof a single forward pass through any chosen CNN architecture and for anysupervised learning task. By leveraging the structure of the induced covariancematrices, we propose numerically efficient algorithms which enable fasttraining in the context of high-dimensional tasks such as depth estimation andsemantic segmentation. Additionally, we provide sufficient conditions forconstructing regression loss functions whose probabilistic counterparts arecompatible with aleatoric uncertainty quantification.
Luporini F, Lange M, Louboutin M, et al., 2020, Architecture and performance of Devito, a system for automated stencil computation, ACM Transactions on Mathematical Software, Vol: 46, Pages: 1-24, ISSN: 0098-3500
Stencil computations are a key part of many high-performance computing applications, such as imageprocessing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devitois a framework capable of generating highly-optimized code given symbolic equations expressed in Python,specialized in, but not limited to, affine (stencil) codes. The lowering process—from mathematical equations down to C++ code—is performed by the Devito compiler through a series of intermediate representations.Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integratedin the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performanceoptimizations is demonstrated using operators drawn from seismic imaging applications.
Luporini F, Louboutin M, Lange M, et al., 2020, Architecture and Performance of Devito, a System for Automated Stencil Computation, Publisher: ASSOC COMPUTING MACHINERY
Vespa E, Funk N, Kelly PHJ, et al., 2019, Adaptive-resolution octree-based volumetric SLAM, 7th International Conference on 3D Vision (3DV), Publisher: IEEE COMPUTER SOC, Pages: 654-662, ISSN: 2378-3826
We introduce a novel volumetric SLAM pipeline for the integration and rendering of depth images at an adaptive level of detail. Our core contribution is a fusion algorithm which dynamically selects the appropriate integration scale based on the effective sensor resolution given the distance from the observed scene, addressing aliasing issues, reconstruction quality, and efficiency simultaneously. We implement our approach using an efficient octree structure which supports multi-resolution rendering allowing for online frame-to-model alignment. Our qualitative and quantitative experiments demonstrate significantly improved reconstruction quality and up to six-fold execution time speed-ups compared to single resolution grids.
Bujanca M, Gafton P, Saeedi S, et al., 2019, SLAMBench 3.0: Systematic automated reproducible evaluation of SLAM systems for robot vision challenges and scene understanding, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729
As the SLAM research area matures and the number of SLAM systems available increases, the need for frameworks that can objectively evaluate them against prior work grows. This new version of SLAMBench moves beyond traditional visual SLAM, and provides new support for scene understanding and non-rigid environments (dynamic SLAM). More concretely for dynamic SLAM, SLAMBench 3.0 includes the first publicly available implementation of DynamicFusion, along with an evaluation infrastructure. In addition, we include two SLAM systems (one dense, one sparse) augmented with convolutional neural networks for scene understanding, together with datasets and appropriate metrics. Through a series of use-cases, we demonstrate the newly incorporated algorithms, visulation aids and metrics (6 new metrics, 4 new datasets and 5 new algorithms).
Saeedi S, Carvalho EDC, Li W, et al., 2019, Characterizing visual localization and mapping datasets, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729
Benchmarking mapping and motion estimation algorithms is established practice in robotics and computer vision. As the diversity of datasets increases, in terms of the trajectories, models, and scenes, it becomes a challenge to select datasets for a given benchmarking purpose. Inspired by the Wasserstein distance, this paper addresses this concern by developing novel metrics to evaluate trajectories and the environments without relying on any SLAM or motion estimation algorithm. The metrics, which so far have been missing in the research community, can be applied to the plethora of datasets that exist. Additionally, to improve the robotics SLAM benchmarking, the paper presents a new dataset for visual localization and mapping algorithms. A broad range of real-world trajectories is used in very high-quality scenes and a rendering framework to create a set of synthetic datasets with ground-truth trajectory and dense map which are representative of key SLAM applications such as virtual reality (VR), micro aerial vehicle (MAV) flight, and ground robotics.
Koch MK, Kelly PHJ, Vincent PE, 2019, Towards in-situ vortex identification for peta-scale CFD using contour trees, 8th IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV), Publisher: Institute of Electrical and Electronics Engineers, Pages: 104-105
Turbulent flows exist in many fields of science and occur in a wide range of engineering applications. While in the past broad knowledge has been established regarding the statistical properties of turbulence at a range of Reynolds numbers, there is a lack of under-standing of the detailed structure of these flows. Since the physical processes involve a vast number of structures, extremely large data sets are required to fully resolve a flow field in both space and time. To make the analysis of such data sets possible, we propose a frame-work that uses state-of-the-art contour tree construction algorithms to identify, classify and track vortices in turbulent flow fields produced by large-scale high-fidelity massively-parallel computational fluid dynamics solvers such as PyFR. Since disk capacity and I/O have become a bottleneck for such large-scale simulations, the proposed framework will be applied in-situ, while relevant data is still in device memory.
Luporini F, Lange M, Jacobs CT, et al., 2019, Automated tiling of unstructured mesh computations with application to seismological modeling, ACM Transactions on Mathematical Software, Vol: 45, ISSN: 0098-3500
Publication rights licensed to ACM. Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]-hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation-not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.
Papaphilippou P, Kelly PHJ, Luk W, 2019, Pangloss: a novel Markov chain prefetcher., Publisher: arXiv
We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access patterns. This is achieved by a highly-efficient representation of the Markov chain to provide accurate values for transition probabilities. In addition, we have added a mechanism to reconstruct delta transitions originally obfuscated by the out-of-order execution or page transitions, such as when streaming data from multiple sources. Our single-level (L2) prefetcher achieves a geometric speedup of 1.7% and 3.2% over selected state-of-the-art baselines (KPCP and BOP). When combined with an equivalent for the L1 cache (L1 & L2), the speedups rise to 6.8% and 8.4%, and 40.4% over non-prefetch. In the multi-core evaluation, there seems to be a considerable performance improvement as well.
Debrunner T, Saeedi Gharahbolagh S, Kelly P, 2019, AUKE: Automatic Kernel Code Generation for an analogue SIMD Focal-plane Sensor-Processor Array, ACM Transactions on Architecture and Code Optimization, Vol: 15, ISSN: 1544-3973
Focal-plane Sensor-Processor Arrays (FPSPs) are new imaging devices with parallel Single Instruction Multiple Data (SIMD) computational capabilities built into every pixel. Compared to traditional imaging devices, FPSPs allow for massive pixel-parallel execution of image processing algorithms. This enables the application of certain algorithms at extreme frame rates (>10,000 frames per second). By performing some early-stage processing in-situ, systems incorporating FPSPs can consume less power compared to conventional approaches using standard digital cameras. In this article, we explore code generation for an FPSP whose 256 × 256 processors operate on analogue signal data, leading to further opportunities for power reduction—and additional code synthesis challenges.While rudimentary image processing algorithms have been demonstrated on FPSPs before, progress with higher-level computer vision algorithms has been sparse due to the unique architecture and limits of the devices. This article presents a code generator for convolution filters for the SCAMP-5 FPSP, with applications in many high-level tasks such as convolutional neural networks, pose estimation, and so on. The SCAMP-5 FPSP has no effective multiply operator. Convolutions have to be implemented through sequences of more primitive operations such as additions, subtractions, and multiplications/divisions by two. We present a code generation algorithm to optimise convolutions by identifying common factors in the different weights and by determining an optimised pattern of pixel-to-pixel data movements to exploit them. We present evaluation in terms of both speed and energy consumption for a suite of well-known convolution filters. Furthermore, an application of the method is shown by the implementation of a Viola-Jones face detection algorithm.
Saeedi Gharahbolagh S, Bodin B, Wagstaff H, et al., 2018, Navigating the landscape for real-time localisation and mapping for robotics, virtual and augmented reality, Proceedings of the IEEE, Vol: 106, Pages: 2020-2039, ISSN: 0018-9219
Visual understanding of 3-D environments in real time, at low power, is a huge computational challenge. Often referred to as simultaneous localization and mapping (SLAM), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, and virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are: 1) tools and methodology for systematic quantitative evaluation of SLAM algorithms; 2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives; 3) end-to-end simulation tools to enable optimization of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches; and 4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.
Kelly PHJ, 2018, IEEE cluster 2018 Message from the program chair, IEEE Cluster conference 2018, Pages: XV-XV, ISSN: 1552-5244
Bodin B, Nardi L, Wagstaff H, et al., 2018, Algorithmic Performance-Accuracy Trade-off in 3D Vision Applications, Pages: 123-124
Simultaneous Localisation And Mapping (SLAM) is a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is particularly true when it comes to evaluate the potential trade-offs between computation speed, accuracy, and power consumption. SLAMBench is a benchmarking framework to evaluate existing and future SLAM systems, both open and closed source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. SLAMBench is a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption across SLAM systems. In this poster we give an overview of SLAMBench and in particular we show how this framework can be used within Design Space Exploration and large-scale performance evaluation on mobile phones.
Bodin B, Wagstaff H, Saeedi S, et al., 2018, SLAMBench2: multi-objective head-to-head benchmarking for visual SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 3637-3644, ISSN: 1050-4729
SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phone-based AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and val-idatable experimental research to investigate trade-offs across SLAM systems.
Vespa E, Nikolov N, Grimm M, et al., 2018, Efficient octree-based volumetric SLAM supporting signed-distance and occupancy mapping, IEEE Robotics and Automation Letters, Vol: 3, Pages: 1144-1151, ISSN: 2377-3766
We present a dense volumetric simultaneous localisation and mapping (SLAM) framework that uses an octree representation for efficient fusion and rendering of either a truncated signed distance field (TSDF) or an occupancy map. The primary aim of this letter is to use one single representation of the environment that can be used not only for robot pose tracking and high-resolution mapping, but seamlessly for planning. We show that our highly efficient octree representation of space fits SLAM and planning purposes in a real-time control loop. In a comprehensive evaluation, we demonstrate dense SLAM accuracy and runtime performance on-par with flat hashing approaches when using TSDF-based maps, and considerable speed-ups when using occupancy mapping compared to standard occupancy maps frameworks. Our SLAM system can run at 10-40 Hz on a modern quadcore CPU, without the need for massive parallelization on a GPU. We, furthermore, demonstrate a probabilistic occupancy mapping as an alternative to TSDF mapping in dense SLAM and show its direct applicability to online motion planning, using the example of informed rapidly-exploring random trees (RRT*).
Nica A, Vespa E, González de Aledo P, et al., 2018, Investigating automatic vectorization for real-time 3D scene understanding
Simultaneous Localization And Mapping (SLAM) is the problem of building a representation of a geometric space while simultaneously estimating the observer’s location within the space. While this seems to be a chicken-and-egg problem, several algorithms have appeared in the last decades that approximately and iteratively solve this problem. SLAM algorithms are tailored to the available resources, hence aimed at balancing the precision of the map with the constraints that the computational platform imposes and the desire to obtain real-time results. Working with KinectFusion, an established SLAM implementation, we explore in this work the vectorization opportunities present in this scenario, with the goal of using the CPU to its full potential. Using ISPC, an automatic vectorization tool, we produce a partially vectorized version of KinectFusion. Along the way we explore a number of optimization strategies, among which tiling to exploit ray-coherence and outer loop vectorization, obtaining up to 4x speed-up over the baseline on an 8-wide vector machine.
Saeedi Gharahbolagh S, Nardi L, Johns E, et al., 2017, Application-oriented design space exploration for SLAM algorithms, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE
In visual SLAM, there are many software and hardware parameters, such as algorithmic thresholds and GPU frequency, that need to be tuned; however, this tuning should also take into account the structure and motion of the camera. In this paper, we determine the complexity of the structure and motion with a few parameters calculated using information theory. Depending on this complexity and the desired performance metrics, suitable parameters are explored and determined. Additionally, based on the proposed structure and motion parameters, several applications are presented, including a novel active SLAM approach which guides the camera in such a way that the SLAM algorithm achieves the desired performance metrics. Real-world and simulated experimental results demonstrate the effectiveness of the proposed design space and its applications.
Luporini F, Ham DA, Kelly PHJ, 2017, An algorithm for the optimization of finite element integration loops, ACM Transactions on Mathematical Software, Vol: 44, ISSN: 0098-3500
We present an algorithm for the optimization of a class of finite element integration loop nests. This algo-rithm, which exploits fundamental mathematical properties of finite element operators, is proven to achievea locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive nu-merical experiments demonstrate significant performance improvements over the state of the art in finiteelement code generation in almost all cases. This validates the effectiveness of the algorithm presented here,and illustrates its limitations.
Unat D, Dubey A, Hoefler T, et al., 2017, Trends in Data Locality Abstractions for HPC Systems, IEEE Transactions on Parallel and Distributed Systems, Vol: 28, Pages: 3007-3020, ISSN: 1045-9219
The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
Bolten M, Franchetti F, Kelly PHJ, et al., 2017, Algebraic description and automatic generation of multigrid methods in SPIRAL, Concurrency and Computation: Practice and Experience, Vol: 29, ISSN: 1532-0634
SPIRAL is an autotuning, program generation and code synthesis system that offers a fully automaticgeneration of highly optimized target codes, customized for the specific execution platform at hand. Initially,SPIRAL was targeted at problem domains in digital signal processing, later also at basic linear algebra.We open SPIRAL up to a new, practically relevant and challenging domain: multigrid solvers. SPIRAL isdriven by algebraic transformation rules. We specify a set of such rules for a simple multigrid solver with aRichardson smoother for a discretized square 2D Poisson equation with Dirichlet boundary conditions. Wepresent the target code that SPIRAL generates in static single-assignment form and discuss its performance.While this example required no changes of or extensions to the SPIRAL system, more complex multigridsolvers may require small adaptations.
Nardi L, Bodin B, Saeedi S, et al., 2017, Algorithmic performance-accuracy trade-off in 3D vision applications using hypermapper, IPDPS, Publisher: IEEE
In this paper we investigate an emerging appli-cation, 3D scene understanding, likely to be significant in themobile space in the near future. The goal of this explorationis to reduce execution time while meeting our quality of resultobjectives. In previous work, we showed for the first time thatit is possible to map this application to power constrainedembedded systems, highlighting that decision choices made atthe algorithmic design-level have the most significant impact.As the algorithmic design space is too large to be exhaus-tively evaluated, we use a previously introduced multi-objectiverandom forest active learning prediction framework dubbedHyperMapper, to find good algorithmic designs. We showthat HyperMapper generalizes on a recent cutting edge 3Dscene understanding algorithm and on a modern GPU-basedcomputer architecture. HyperMapper is able to beat an experthuman hand-tuning the algorithmic parameters of the classof computer vision applications taken under consideration inthis paper automatically. In addition, we use crowd-sourcingusing a 3D scene understanding Android app to show that thePareto front obtained on an embedded system can be used toaccelerate the same application on all the 83 smart-phones andtablets with speedups ranging from 2x to over 12x.
Mitchell L, Ham DA, McRae ATT, et al., 2017, Firedrake: automating the finite element method by composing abstractions, ACM Transactions on Mathematical Software, Vol: 43, Pages: 1-27, ISSN: 1557-7295
Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrakeadopts the domain-specific language for the finite element method of the FEniCS project, but with a purePython runtime-only implementation centred on the composition of several existing and new abstractions forparticular aspects of scientific computing. The result is a more complete separation of concerns which easesthe incorporation of separate contributions from computer scientists, numerical analysts and applicationspecialists. These contributions may add functionality, or improve performance.Firedrake benefits from automatically applying new optimisations. This includes factorising mixed functionspaces, transforming and vectorising inner loops, and intrinsically supporting block matrix operations.Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users toimplement common operations that fall outside pure variational formulations, such as flux-limiters.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.