172 results found
Luporini F, Lange M, Louboutin M, et al., 2020, Architecture and performance of Devito, a system for automated stencil computation, ACM Transactions on Mathematical Software, Vol: 46, ISSN: 0098-3500
Stencil computations are a key part of many high-performance computing applications, such as imageprocessing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devitois a framework capable of generating highly-optimized code given symbolic equations expressed in Python,specialized in, but not limited to, affine (stencil) codes. The lowering process—from mathematical equations down to C++ code—is performed by the Devito compiler through a series of intermediate representations.Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integratedin the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performanceoptimizations is demonstrated using operators drawn from seismic imaging applications.
Luporini F, Louboutin M, Lange M, et al., 2020, Architecture and performance of devito, a system for automated stencil computation
© 2020 Owner/Author. Stencil computations are a key part of many high-performance computing applications, such as image processing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devito is a framework capable of generating highly optimized code given symbolic equations expressed in Python, specialized in, but not limited to, affine (stencil) codes. The lowering process - from mathematical equations down to C++ code - is performed by the Devito compiler through a series of intermediate representations. Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling, and parallelization. Some of these are obtained through well-established stencil optimizers, integrated in the backend of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performance optimizations is demonstrated using operators drawn from seismic imaging applications.
Carvalho EDC, Clark R, Nicastro A, et al., 2020, Scalable uncertainty for computer vision with functional variationalinference, CVPR 2020, Publisher: IEEE
As Deep Learning continues to yield successful applications in ComputerVision, the ability to quantify all forms of uncertainty is a paramountrequirement for its safe and reliable deployment in the real-world. In thiswork, we leverage the formulation of variational inference in function space,where we associate Gaussian Processes (GPs) to both Bayesian CNN priors andvariational family. Since GPs are fully determined by their mean and covariancefunctions, we are able to obtain predictive uncertainty estimates at the costof a single forward pass through any chosen CNN architecture and for anysupervised learning task. By leveraging the structure of the induced covariancematrices, we propose numerically efficient algorithms which enable fasttraining in the context of high-dimensional tasks such as depth estimation andsemantic segmentation. Additionally, we provide sufficient conditions forconstructing regression loss functions whose probabilistic counterparts arecompatible with aleatoric uncertainty quantification.
© 2019 IEEE. We introduce a novel volumetric SLAM pipeline for the integration and rendering of depth images at an adaptive level of detail. Our core contribution is a fusion algorithm which dynamically selects the appropriate integration scale based on the effective sensor resolution given the distance from the observed scene, addressing aliasing issues, reconstruction quality, and efficiency simultaneously. We implement our approach using an efficient octree structure which supports multi-resolution rendering allowing for online frame-to-model alignment. Our qualitative and quantitative experiments demonstrate significantly improved reconstruction quality and up to six-fold execution time speed-ups compared to single resolution grids.
Bujanca M, Gafton P, Saeedi S, et al., 2019, SLAMBench 3.0: Systematic automated reproducible evaluation of SLAM systems for robot vision challenges and scene understanding, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729
As the SLAM research area matures and the number of SLAM systems available increases, the need for frameworks that can objectively evaluate them against prior work grows. This new version of SLAMBench moves beyond traditional visual SLAM, and provides new support for scene understanding and non-rigid environments (dynamic SLAM). More concretely for dynamic SLAM, SLAMBench 3.0 includes the first publicly available implementation of DynamicFusion, along with an evaluation infrastructure. In addition, we include two SLAM systems (one dense, one sparse) augmented with convolutional neural networks for scene understanding, together with datasets and appropriate metrics. Through a series of use-cases, we demonstrate the newly incorporated algorithms, visulation aids and metrics (6 new metrics, 4 new datasets and 5 new algorithms).
Saeedi S, Carvalho EDC, Li W, et al., 2019, Characterizing visual localization and mapping datasets, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729
Benchmarking mapping and motion estimation algorithms is established practice in robotics and computer vision. As the diversity of datasets increases, in terms of the trajectories, models, and scenes, it becomes a challenge to select datasets for a given benchmarking purpose. Inspired by the Wasserstein distance, this paper addresses this concern by developing novel metrics to evaluate trajectories and the environments without relying on any SLAM or motion estimation algorithm. The metrics, which so far have been missing in the research community, can be applied to the plethora of datasets that exist. Additionally, to improve the robotics SLAM benchmarking, the paper presents a new dataset for visual localization and mapping algorithms. A broad range of real-world trajectories is used in very high-quality scenes and a rendering framework to create a set of synthetic datasets with ground-truth trajectory and dense map which are representative of key SLAM applications such as virtual reality (VR), micro aerial vehicle (MAV) flight, and ground robotics.
Koch MK, Kelly PHJ, Vincent PE, 2019, Towards in-situ vortex identification for peta-scale CFD using contour trees, 8th IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV), Publisher: Institute of Electrical and Electronics Engineers, Pages: 104-105
Turbulent flows exist in many fields of science and occur in a wide range of engineering applications. While in the past broad knowledge has been established regarding the statistical properties of turbulence at a range of Reynolds numbers, there is a lack of under-standing of the detailed structure of these flows. Since the physical processes involve a vast number of structures, extremely large data sets are required to fully resolve a flow field in both space and time. To make the analysis of such data sets possible, we propose a frame-work that uses state-of-the-art contour tree construction algorithms to identify, classify and track vortices in turbulent flow fields produced by large-scale high-fidelity massively-parallel computational fluid dynamics solvers such as PyFR. Since disk capacity and I/O have become a bottleneck for such large-scale simulations, the proposed framework will be applied in-situ, while relevant data is still in device memory.
Luporini F, Lange M, Jacobs CT, et al., 2019, Automated tiling of unstructured mesh computations with application to seismological modeling, ACM Transactions on Mathematical Software, Vol: 45, ISSN: 0098-3500
Publication rights licensed to ACM. Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]-hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation-not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.
Papaphilippou P, Kelly PHJ, Luk W, 2019, Pangloss: a novel Markov chain prefetcher., Publisher: arXiv
We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access patterns. This is achieved by a highly-efficient representation of the Markov chain to provide accurate values for transition probabilities. In addition, we have added a mechanism to reconstruct delta transitions originally obfuscated by the out-of-order execution or page transitions, such as when streaming data from multiple sources. Our single-level (L2) prefetcher achieves a geometric speedup of 1.7% and 3.2% over selected state-of-the-art baselines (KPCP and BOP). When combined with an equivalent for the L1 cache (L1 & L2), the speedups rise to 6.8% and 8.4%, and 40.4% over non-prefetch. In the multi-core evaluation, there seems to be a considerable performance improvement as well.
Debrunner T, Saeedi Gharahbolagh S, Kelly P, 2019, AUKE: Automatic Kernel Code Generation for an analogue SIMD Focal-plane Sensor-Processor Array, ACM Transactions on Architecture and Code Optimization, Vol: 15, ISSN: 1544-3973
Focal-plane Sensor-Processor Arrays (FPSPs) are new imaging devices with parallel Single Instruction Multiple Data (SIMD) computational capabilities built into every pixel. Compared to traditional imaging devices, FPSPs allow for massive pixel-parallel execution of image processing algorithms. This enables the application of certain algorithms at extreme frame rates (>10,000 frames per second). By performing some early-stage processing in-situ, systems incorporating FPSPs can consume less power compared to conventional approaches using standard digital cameras. In this article, we explore code generation for an FPSP whose 256 × 256 processors operate on analogue signal data, leading to further opportunities for power reduction—and additional code synthesis challenges.While rudimentary image processing algorithms have been demonstrated on FPSPs before, progress with higher-level computer vision algorithms has been sparse due to the unique architecture and limits of the devices. This article presents a code generator for convolution filters for the SCAMP-5 FPSP, with applications in many high-level tasks such as convolutional neural networks, pose estimation, and so on. The SCAMP-5 FPSP has no effective multiply operator. Convolutions have to be implemented through sequences of more primitive operations such as additions, subtractions, and multiplications/divisions by two. We present a code generation algorithm to optimise convolutions by identifying common factors in the different weights and by determining an optimised pattern of pixel-to-pixel data movements to exploit them. We present evaluation in terms of both speed and energy consumption for a suite of well-known convolution filters. Furthermore, an application of the method is shown by the implementation of a Viola-Jones face detection algorithm.
Saeedi Gharahbolagh S, Bodin B, Wagstaff H, et al., 2018, Navigating the landscape for real-time localisation and mapping for robotics, virtual and augmented reality, Proceedings of the IEEE, Vol: 106, Pages: 2020-2039, ISSN: 0018-9219
Visual understanding of 3-D environments in real time, at low power, is a huge computational challenge. Often referred to as simultaneous localization and mapping (SLAM), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, and virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are: 1) tools and methodology for systematic quantitative evaluation of SLAM algorithms; 2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives; 3) end-to-end simulation tools to enable optimization of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches; and 4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.
Kelly PHJ, 2018, IEEE Cluster 2018 Message from the Program Chair, ISSN: 1552-5244
Bodin B, Nardi L, Wagstaff H, et al., 2018, Algorithmic Performance-Accuracy Trade-off in 3D Vision Applications, Pages: 123-124
© 2018 IEEE. Simultaneous Localisation And Mapping (SLAM) is a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is particularly true when it comes to evaluate the potential trade-offs between computation speed, accuracy, and power consumption. SLAMBench is a benchmarking framework to evaluate existing and future SLAM systems, both open and closed source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. SLAMBench is a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption across SLAM systems. In this poster we give an overview of SLAMBench and in particular we show how this framework can be used within Design Space Exploration and large-scale performance evaluation on mobile phones.
Bodin B, Wagstaff H, Saeedi S, et al., 2018, SLAMBench2: multi-objective head-to-head benchmarking for visual SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 3637-3644, ISSN: 1050-4729
SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phone-based AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and val-idatable experimental research to investigate trade-offs across SLAM systems.
Vespa E, Nikolov N, Grimm M, et al., 2018, Efficient octree-based volumetric SLAM supporting signed-distance and occupancy mapping, IEEE Robotics and Automation Letters, Vol: 3, Pages: 1144-1151, ISSN: 2377-3766
We present a dense volumetric simultaneous localisation and mapping (SLAM) framework that uses an octree representation for efficient fusion and rendering of either a truncated signed distance field (TSDF) or an occupancy map. The primary aim of this letter is to use one single representation of the environment that can be used not only for robot pose tracking and high-resolution mapping, but seamlessly for planning. We show that our highly efficient octree representation of space fits SLAM and planning purposes in a real-time control loop. In a comprehensive evaluation, we demonstrate dense SLAM accuracy and runtime performance on-par with flat hashing approaches when using TSDF-based maps, and considerable speed-ups when using occupancy mapping compared to standard occupancy maps frameworks. Our SLAM system can run at 10-40 Hz on a modern quadcore CPU, without the need for massive parallelization on a GPU. We, furthermore, demonstrate a probabilistic occupancy mapping as an alternative to TSDF mapping in dense SLAM and show its direct applicability to online motion planning, using the example of informed rapidly-exploring random trees (RRT*).
Nica A, Vespa E, González de Aledo P, et al., 2018, Investigating automatic vectorization for real-time 3D scene understanding
© 2018 Association for Computing Machinery. Simultaneous Localization And Mapping (SLAM) is the problem of building a representation of a geometric space while simultaneously estimating the observer’s location within the space. While this seems to be a chicken-and-egg problem, several algorithms have appeared in the last decades that approximately and iteratively solve this problem. SLAM algorithms are tailored to the available resources, hence aimed at balancing the precision of the map with the constraints that the computational platform imposes and the desire to obtain real-time results. Working with KinectFusion, an established SLAM implementation, we explore in this work the vectorization opportunities present in this scenario, with the goal of using the CPU to its full potential. Using ISPC, an automatic vectorization tool, we produce a partially vectorized version of KinectFusion. Along the way we explore a number of optimization strategies, among which tiling to exploit ray-coherence and outer loop vectorization, obtaining up to 4x speed-up over the baseline on an 8-wide vector machine.
Saeedi Gharahbolagh S, Nardi L, Johns E, et al., 2017, Application-oriented design space exploration for SLAM algorithms, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE
In visual SLAM, there are many software and hardware parameters, such as algorithmic thresholds and GPU frequency, that need to be tuned; however, this tuning should also take into account the structure and motion of the camera. In this paper, we determine the complexity of the structure and motion with a few parameters calculated using information theory. Depending on this complexity and the desired performance metrics, suitable parameters are explored and determined. Additionally, based on the proposed structure and motion parameters, several applications are presented, including a novel active SLAM approach which guides the camera in such a way that the SLAM algorithm achieves the desired performance metrics. Real-world and simulated experimental results demonstrate the effectiveness of the proposed design space and its applications.
Luporini F, Ham DA, Kelly PHJ, 2017, An algorithm for the optimization of finite element integration loops, ACM Transactions on Mathematical Software, Vol: 44, ISSN: 0098-3500
We present an algorithm for the optimization of a class of finite element integration loop nests. This algo-rithm, which exploits fundamental mathematical properties of finite element operators, is proven to achievea locally optimal operation count. In specified circumstances the optimum achieved is global. Extensive nu-merical experiments demonstrate significant performance improvements over the state of the art in finiteelement code generation in almost all cases. This validates the effectiveness of the algorithm presented here,and illustrates its limitations.
Unat D, Dubey A, Hoefler T, et al., 2017, Trends in Data Locality Abstractions for HPC Systems, IEEE Transactions on Parallel and Distributed Systems, Vol: 28, Pages: 3007-3020, ISSN: 1045-9219
The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
Bolten M, Franchetti F, Kelly PHJ, et al., 2017, Algebraic description and automatic generation of multigrid methods in SPIRAL, Concurrency and Computation: Practice and Experience, Vol: 29, ISSN: 1532-0634
SPIRAL is an autotuning, program generation and code synthesis system that offers a fully automaticgeneration of highly optimized target codes, customized for the specific execution platform at hand. Initially,SPIRAL was targeted at problem domains in digital signal processing, later also at basic linear algebra.We open SPIRAL up to a new, practically relevant and challenging domain: multigrid solvers. SPIRAL isdriven by algebraic transformation rules. We specify a set of such rules for a simple multigrid solver with aRichardson smoother for a discretized square 2D Poisson equation with Dirichlet boundary conditions. Wepresent the target code that SPIRAL generates in static single-assignment form and discuss its performance.While this example required no changes of or extensions to the SPIRAL system, more complex multigridsolvers may require small adaptations.
Nardi L, Bodin B, Saeedi S, et al., 2017, Algorithmic performance-accuracy trade-off in 3D vision applications using hypermapper, IPDPS, Publisher: IEEE
In this paper we investigate an emerging appli-cation, 3D scene understanding, likely to be significant in themobile space in the near future. The goal of this explorationis to reduce execution time while meeting our quality of resultobjectives. In previous work, we showed for the first time thatit is possible to map this application to power constrainedembedded systems, highlighting that decision choices made atthe algorithmic design-level have the most significant impact.As the algorithmic design space is too large to be exhaus-tively evaluated, we use a previously introduced multi-objectiverandom forest active learning prediction framework dubbedHyperMapper, to find good algorithmic designs. We showthat HyperMapper generalizes on a recent cutting edge 3Dscene understanding algorithm and on a modern GPU-basedcomputer architecture. HyperMapper is able to beat an experthuman hand-tuning the algorithmic parameters of the classof computer vision applications taken under consideration inthis paper automatically. In addition, we use crowd-sourcingusing a 3D scene understanding Android app to show that thePareto front obtained on an embedded system can be used toaccelerate the same application on all the 83 smart-phones andtablets with speedups ranging from 2x to over 12x.
Mitchell L, Ham DA, McRae ATT, et al., 2017, Firedrake: automating the finite element method by composing abstractions, ACM Transactions on Mathematical Software, Vol: 43, ISSN: 1557-7295
Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrakeadopts the domain-specific language for the finite element method of the FEniCS project, but with a purePython runtime-only implementation centred on the composition of several existing and new abstractions forparticular aspects of scientific computing. The result is a more complete separation of concerns which easesthe incorporation of separate contributions from computer scientists, numerical analysts and applicationspecialists. These contributions may add functionality, or improve performance.Firedrake benefits from automatically applying new optimisations. This includes factorising mixed functionspaces, transforming and vectorising inner loops, and intrinsically supporting block matrix operations.Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users toimplement common operations that fall outside pure variational formulations, such as flux-limiters.
Bodin B, Nardi L, Kelly PHJ, et al., 2016, Diplomat: Mapping of Multi-kernel Applications Using a Static Dataflow Abstraction, Modelling, Analysis and Simulation of Computer and Telecommunication Systems, Publisher: IEEE, ISSN: 2375-0227
In this paper we propose a novel approach toheterogeneous embedded systems programmability using a taskgraphbased framework called Diplomat. Diplomat is a taskgraphframework that exploits the potential of static dataflowmodeling and analysis to deliver performance estimation andCPU/GPU mapping. An application has to be specified once, andthen the framework can automatically propose good mappings.We evaluate Diplomat with a computer vision application on twoembedded platforms. Using the Diplomat generation we observeda 16% performance improvement on average and up to a 30%improvement over the best existing hand-coded implementation.
Bodin B, Nardi L, Zia MZ, et al., 2016, Integrating Algorithmic Parameters into Benchmarking and Design Space Exploration in 3D Scene Understanding, International conference on Parallel Architectures and Compilation Techniques, Publisher: IEEE
System designers typically use well-studied benchmarks toevaluate and improve new architectures and compilers. Wedesign tomorrow's systems based on yesterday's applications.In this paper we investigate an emerging application,3D scene understanding, likely to be signi cant in the mobilespace in the near future. Until now, this application couldonly run in real-time on desktop GPUs. In this work, weexamine how it can be mapped to power constrained embeddedsystems. Key to our approach is the idea of incrementalco-design exploration, where optimization choices that concernthe domain layer are incrementally explored togetherwith low-level compiler and architecture choices. The goalof this exploration is to reduce execution time while minimizingpower and meeting our quality of result objective.As the design space is too large to exhaustively evaluate,we use active learning based on a random forest predictorto nd good designs. We show that our approach can, forthe rst time, achieve dense 3D mapping and tracking in thereal-time range within a 1W power budget on a popular embeddeddevice. This is a 4.8x execution time improvementand a 2.8x power reduction compared to the state-of-the-art.
Bercea G, McRae ATT, Ham DA, et al., 2016, A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake, Geoscientific Model Development, Vol: 9, Pages: 3803-3815, ISSN: 1991-9603
We present a generic algorithm for numbering and then efﬁciently iterating over the data values attached to an extruded mesh. An extruded mesh is formed by replicating an existing mesh, assumed to be unstructured, to form layers of prismatic cells. Applications of extruded meshes include, but are not limited to, the representation of 3D high aspect ratio domains employed by geophysical ﬁnite element simulations. These meshes are structured in the extruded direction. The algorithm presented here exploits this structure to avoid the performance penalty traditionally associated with unstructured meshes. We evaluate the implementation of this algorithm in the Firedrake ﬁnite element system on a range of low compute intensity operations which constitute worst cases for data layout performance exploration. The experiments show that having structure along the extruded direction enables the cost of the indirect data accesses to be amortized after 10-20 layers as long as the underlying mesh is well-ordered. We characterise the resulting spatial and temporal reuse in a representative set of both continuous-Galerkin and discontinuous-Galerkin discretisations. On meshes with realistic numbers of layers the performance achieved is between 70% and 90% of a theoretical hardware-speciﬁc limit.
Zia MZ, Nardi L, Jack A, et al., 2016, Comparative design space exploration of dense and semi-dense SLAM, 2016 IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 1292-1299, ISSN: 1050-4729
SLAM has matured significantly over the past few years, and is beginning to appear in serious commercial products. While new SLAM systems are being proposed at every conference, evaluation is often restricted to qualitative visualizations or accuracy estimation against a ground truth. This is due to the lack of benchmarking methodologies which can holistically and quantitatively evaluate these systems. Further investigation at the level of individual kernels and parameter spaces of SLAM pipelines is non-existent, which is absolutely essential for systems research and integration. We extend the recently introduced SLAMBench framework to allow comparing two state-of-the-art SLAM pipelines, namely KinectFusion and LSD-SLAM, along the metrics of accuracy, energy consumption, and processing frame rate on two different hardware platforms, namely a desktop and an embedded device. We also analyze the pipelines at the level of individual kernels and explore their algorithmic and hardware design spaces for the first time, yielding valuable insights.
Wozniak BD, Witherden FD, Russell FP, et al., 2016, GiMMiK - Generating Bespoke Matrix Multiplication Kernels for Accelerators: Application to High-Order Computational Fluid Dynamics, Computer Physics Communications, Vol: 202, Pages: 12-22, ISSN: 0010-4655
Matrix multiplication is a fundamental linear algebra routineubiquitous in all areas of science and engineering. Highly optimised BLAS libraries (cuBLAS and clBLAS on GPUs) are the most popular choices for an implementation of the General Matrix Multiply (GEMM) in software. In this paper we present GiMMiK - a generator of bespoke matrix multiplication kernels for the CUDA and OpenCL platforms. GiMMiK exploits a prior knowledge of the operator matrix to generate highly performant code. The performance of GiMMiK's kernels is particularly apparent in a block-bypanel type of matrix multiplication, where the block matrix is typically small (e.g. dimensions of 96 × 64). Such operations are characteristic to our motivating application in PyFR - an implementation of Flux Reconstruction schemes for high-order fluid flow simulations on mixed unstructured meshes. GiMMiK fully unrolls the matrix-vector product and embeds matrix entries directly in the code to benefit from the use of the constant cache and compiler optimisations. Further, it reduces the number of floating-point operations by removing multiplications by zeros. Together with the ability of our kernels to avoid the poorly optimised cleanup code, executed by library GEMM, we are able to outperform cuBLAS on two NVIDIA GPUs: GTX 780 Ti and Tesla K40c. We observe speedups of our kernels over cuBLAS GEMM of up to 9.98 and 63.30 times for a 294 × 1029 99% sparse PyFR matrix in double precision on the Tesla K40c and GTX 780 Ti correspondingly. In single precision, observed speedups reach 12.20 and 13.07 times for a 4 × 8 50% sparse PyFR matrix on the two aforementioned cards. Using GiMMiK as the matrix multiplication kernel provider allows us to achieve a speedup of up to 1.70 (2.19) for a simulation of an unsteady flow over a cylinder executed with PyFR in double (single) precision on the Tesla K40c. All results were generated with GiMMiK version 1.0.
Rokos G, Gorman G, Kelly PHJ, 2015, A fast and scalable graph coloring algorithm for multi-core and many-core architectures, 21st International Conference on Parallel and Distributed Computing, Publisher: Springer, Pages: 414-425, ISSN: 0302-9743
Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Çatalyürek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Çatalyürek et al.’s algorithm.We show that the new technique scales better on multicore and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.
Kelly PHJ, Reguly IZ, Mudalige GR, et al., 2015, Acceleration of a Full-scale Industrial CFD Application with OP2, IEEE Transactions on Parallel and Distributed Systems, Vol: 27, Pages: 1265-1278, ISSN: 1558-2183
Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc., capable ofperforming complex simulations over highly detailed unstructured mesh geometries. Hydra presents major challenges in dataorganization and movement that need to be overcome for continued high performance on emerging platforms. We present research inachieving this goal through the OP2 domain-specific high-level framework, demonstrating the viability of such a high-level programmingapproach. OP2 targets the domain of unstructured mesh problems and enables execution on a range of back-end hardware platforms.We chart the conversion of Hydra to OP2, and map out the key difficulties encountered in the process. Specifically we show howdifferent parallel implementations can be achieved with an active library framework, even for a highly complicated industrial applicationand how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly,reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performanceas that of the hand-tuned original code could be achieved, but it can be significantly improved on conventional processor systems, andmany-core systems. Our results provide evidence of how high-level frameworks such as OP2 enable portability across a wide range ofcontrasting platforms and their significant utility in achieving near-optimal performance without the intervention of the applicationprogrammer.
© 2015, Springer International Publishing Switzerland. Mesh adaptation is a powerful way to minimise the computational cost of mesh based computation. It is particularly successful for multi-scale problems where the required mesh resolution can vary by orders of magnitude across the domain. The end result is local control over solution accuracy and reduced time to solution. In the case of large scale simulations, where the time to solution is unacceptable or the memory requirements exceeds available RAM, mesh based computation is typically parallelised using domain decomposition methods using the Message Passing Interface (MPI). This allows a simulation to run in parallel on a distributed memory computer. While this has been a high successful strategy up until now, the drive towards low power multi- and many-core architectures means that an even higher degree of parallelism is required and the memory hierarchy exploited to maximise memory bandwidth. For this reason application codes are increasingly adopting a hybrid parallel approach whereby decomposition methods, implemented using the Message Passing Interface (MPI), are applied for inter-node parallelisation, while a threaded programming model is used for intra-node parallelisation. Mesh adaptivity has been successfully parallelised using MPI by a number of groups, and can be implemented efficiently with few modifications to the serial code. However, thread-level parallelism is significantly more challenging because each thread modifies the mesh data and therefore must be carefully marshalled to avoid data races while still ensuring enough parallelism is exposed to achieve good parallel efficiency. Here we describe a new thread-parallel algorithm for anisotropic mesh adaptation algorithms. For each mesh optimisation phase (refinement, coarsening, swapping and smoothing) we describe how independent sets of tasks are defined. We show how a deferred updates strategy can be used to update the mesh data structure
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.