Publications
186 results found
Sasongko MA, Chabbi M, Kelly PHJ, et al., 2023, Precise event sampling on AMD versus intel: quantitative and qualitative comparison, IEEE Transactions on Parallel and Distributed Systems, Vol: 34, Pages: 1594-1608, ISSN: 1045-9219
Precise event sampling is a profiling feature in commodity processors that can sample hardware events and accurately locate the instructions that trigger the events. This feature has been used in a large number of tools to detect application performance issues. Although precise event sampling is readily supported in modern multicore architectures, vendor supports exhibit great differences that affect their accuracy, stability, overhead, and functionality. This work presents the most comprehensive study to date on benchmarking the event sampling features of Intel PEBS and AMD IBS and performs in-depth analysis on key differences through series of microbenchmarks. Our qualitative and quantitative analysis shows that PEBS allows finer-grained and more accurate sampling of hardware events, while IBS offers richer set of information at each sample though it suffers from lower accuracy and stability. Moreover, OS signal delivery, which is a common method used by the profiling software, introduces significant time overhead to the original overhead incurred by the hardware mechanisms in both PEBS and IBS. We also found that both PEBS and IBS have bias in sampling events across multiple different locations in a code. Lastly, we demonstrate how our findings on microbenchmarks under different thread counts hold for a full-fledged profiling tool that runs on the state-of-the-art Intel and AMD machines. Overall our detailed comparisons serve as a great reference and provide invaluable information for hardware designers and profiling tool developers.
Sasongko MA, Chabbi M, Kelly P, et al., 2023, Precise event sampling-based data locality tools for AMD multicore architectures, Concurrency and Computation: Practice and Experience, Pages: 1-18, ISSN: 1532-0626
We propose COMDETECTIVE+, an inter-thread communication analyzer, andREUSETRACKER+, a reuse distance analyzer, that leverage the hardware featuresin AMD processors to support low-overhead profiling. Both tools employ theinstruction-based sampling (IBS) facility and debug registers in AMD processorsto detect inter-thread communication and data reuse. Different from prior arts,COMDETECTIVE+ differentiates the communication into true and false sharing, andREUSETRACKER+ measures reuse distance in private and shared caches by also considering cache line invalidation with low overhead. Both tools can attribute thecommunications and reuses to source code lines. To our knowledge these tools are twoof the few profiling tools designed specifically for AMD x86 architectures using IBS.Our tools are timely and relevant considering the rise in numbers of AMD processorbased data centers and HPC systems. We perform experiments to evaluate the accuracy and overheads of the proposed tools on an AMD machine with two-socket EPYC7352 processors. COMDETECTIVE+ exhibits high accuracy while introducing 5.14×runtime and 1.4× memory overheads. REUSETRACKER+ also displays high accuracy,which is 95%, with 11.76×runtime and 1.46× memory overheads. These overheads aremuch lower than the overheads of existing simulators and code instrumentation-basedtools. Lastly, we demonstrate the usage of the tools by having COMDETECTIVE+ andREUSETRACKER+ facilitate the code refactoring of two data mining benchmarks toimprove their performance by up to 29%.
Koch MK, Kelly PHJ, Vincent P, 2022, Identification and classification of off-vertex critical points for contour tree construction on unstructured meshes of hexahedra, IEEE Transactions on Visualization and Computer Graphics, Vol: 28, Pages: 5178-5180, ISSN: 1077-2626
The topology of isosurfaces changes at isovalues of critical points, making such points an important feature when building contour trees or Morse-Smale complexes. Hexahedral elements with linear interpolants can contain additional off-vertex critical points in element bodies and on element faces. Moreover, a point on the face of a hexahedron which is critical in the element-local context is not necessarily critical in the global context. In ‘`Exploring Scalar Fields Using Critical Isovalues’' Weber et al. introduce a method to determine whether critical points on faces are also critical in the global context, based on the gradient of the asymptotic decider in each element that shares the face. However, as defined, the method of Weber et al. contains an error, and can lead to incorrect results. In this work we correct the error.
Stow E, Ahsan A, Li Y, et al., 2022, Compiling CNNs with Cain: focal-plane processing for robot navigation, Autonomous Robots, Vol: 46, Pages: 893-910, ISSN: 0929-5593
Focal-plane Sensor-processors (FPSPs) are a camera technology that enables low power, high frame rate computation in the image sensor itself, making them suitable for edge computation. To fit into the sensor array, FPSPs are highly resource-constrained, with limited instruction set and few registers - which makes developing complex algorithms difficult. In this work, we present Cain, a compiler for convolutional filters that targets SCAMP-5, a general-purpose FPSP. Cain generates code to evaluate multiple convolutional kernels at the same time. It generates code that avoids the need for hardware multipliers, while orchestrating the exploitation of common sub-terms—leading to a large reduction in instruction count compared to both straightforward and prior optimized approaches. We demonstrate the capability enabled by Cain on SCAMP-5 with robotic navigation for near-sensor high-speed and low-power computation, by using Cain to implement a neural network on the focal plane.
Stow E, Kelly PHJ, 2022, Convolutional kernel function algebra, Frontiers of Computer Science, Vol: 4, Pages: 1-13, ISSN: 2095-2228
Many systems for image manipulation, signal analysis, machine learning, and scientific computing make use of discrete convolutional filters that are known before computation begins. These contexts benefit from common sub-expression elimination to reduce the number of calculations required, both multiplications and additions. We present an algebra for describing convolutional kernels and filters at a sufficient level of abstraction to enable intuitive common sub-expression based optimizations through decomposing filters into smaller, repeated, kernels. This enables the creation of an enormous search space of potential implementations of filters via algebraic manipulation. We demonstrate how integral image and sliding window optimizations can be expressed in the context of common sub-expression elimination as well as show the direct use case for this algebra in massively SIMD multiply-free contexts such as in cellular processor arrays. We then show that this algebra is general enough to express and optimize kernels that use non-standard semi-rings to enable shortest path algorithms.
Kukreja N, Huckelheim J, Louboutin M, et al., 2022, Lossy checkpoint compression in full waveform inversion: a case study with ZFPv0.5.5 and the overthrust model, Geoscientific Model Development, Vol: 15, Pages: 3815-3829, ISSN: 1991-959X
This paper proposes a new method that combines checkpointing methods with error-controlled lossy compression for large-scale high-performance full-waveform inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can significantly reduce data movement, allowing a reduction in run time as well as peak memory.In the exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit.Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory footprint during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations.Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.
Hsueh H-Y, Toma A-I, Jaafar HA, et al., 2022, Systematic comparison of path planning algorithms using PathBench, Advanced Robotics, Vol: 36, Pages: 566-581, ISSN: 0169-1864
Path planning is an essential component of mobile robotics. Classical path planning algorithms, such as wavefront and rapidly exploring random tree, are used heavily in autonomous robots. With the recent advances in machine learning, development of learning-based path planning algorithms has been experiencing a rapid growth. A unified path planning interface that facilitates the development and benchmarking of existing and new algorithms is needed. This paper presents PathBench, a platform for developing, visualizing, training, testing, and benchmarking of existing and future, classical and learning-based path planning algorithms in 2D and 3D grid world environments. Many existing path planning algorithms are supported, e.g. A*, Dijkstra, waypoint planning networks, value iteration networks, and gated path planning networks; integrating new algorithms is easy and clearly specified. The benchmarking ability of PathBench is explored in this paper by comparing algorithms across five different hardware systems and three different map types, including built-in PathBench maps, video game maps, and maps from real world databases. Metrics, such as path length, success rate, and computational time, were used to evaluate algorithms. Algorithmic analysis was also performed on a real-world robot to demonstrate PathBench's support for Robot Operating System. PathBench is open source1.
Papaphilippou P, Kelly PHJ, Luk W, 2021, Simodense: a RISC-V softcore optimised for exploring custom SIMD instructions, FPL2021. The International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, Pages: 391-397
Simodense is a high-performance open-source RISC-V (RV32IM) softcore, optimised for exploring custom SIMD instructions. In order to maximise SIMD instruction performance, the design’s memory system is optimised for streaming bandwidth, such as very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications with custom instructions. This paper also provides insights on the effectiveness of adding FPGA resources in general purpose processors in the form of reconfigurable SIMDinstructions.
Papaphilippou P, Kelly PHJ, Luk W, 2021, Demonstrating custom SIMD instruction development for a RISC-V softcore, FPL2021. The International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE
This demo elaborates on the programmability aspect of Simodense, a recently released open-source softcore, optimised for evaluating custom SIMD instructions. CPUs featuring small reconfigurable areas for implementing custom instructions is an alternative path in computer architecture that can help with the challenges found in today’s FPGAs. By providing RTL-based programmability for implementing custom SIMD instructions, highly-integrated accelerators can be developed, while benefiting from the pre-existing CPU logic, such as the caches and their high memory throughput to main memory.
Toma A-I, Hsueh H-Y, Jaafar HA, et al., 2021, PathBench: a benchmarking platform for classical and learned path planning algorithms, 2021 18th Conference on Robots and Vision (CRV), Publisher: IEEE, Pages: 79-86
Path planning is a key component in mobile robotics. A wide range of path planning algorithms exist, but few attempts have been made to benchmark the algorithms holistically or unify their interface. Moreover, with the recent advances in deep neural networks, there is an urgent need to facilitate the development and benchmarking of such learning-based planning algorithms. This paper presents PathBench, a platform for developing, visualizing, training, testing, and benchmarking of existing and future, classical and learned 2D and 3D path planning algorithms, while offering support for Robot Operating System (ROS). Many existing path planning algorithms are supported; e.g. A*, wavefront, rapidly-exploring random tree, value iteration networks, gated path planning networks; and integrating new algorithms is easy and clearly specified. We demonstrate the benchmarking capability of PathBench by comparing implemented classical and learned algorithms for metrics, such as path length, success rate, computational time and path deviation. These evaluations are done on built-in PathBench maps and external path planning environments from video games and real world databases. PathBench is open source 1 .
Bisbas G, Luporini F, Louboutin M, et al., 2021, Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources, 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Publisher: IEEE COMPUTER SOC, Pages: 497-506, ISSN: 1530-2075
Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid (“off-the-grid”). Our work is motivated by modelling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x.
Papaphilippou P, Kelly PHJ, Luk W, 2021, Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions
This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count where applicable. Additionally, a high-performance open-source RISC-V (RV32 IM) softcore is introduced, optimised for exploring custom SIMD instructions and streaming performance. By providing instruction templates for instruction development in HDL/Verilog, efficient FPGA-based instructions can be developed with few low-level lines of code. In order to improve custom SIMD instruction performance, the softcore's cache hierarchy is optimised for bandwidth, such as with very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications on an FPGA. Although the exploration is based on the softcore, the goal is to provide a means to experiment with advanced SIMD instructions which could be loaded in future CPUs that feature reconfigurable regions as custom instructions. Finally, we provide some insights on the challenges and effectiveness of such future micro-architectures.
Stow E, Murai R, Saeedi S, et al., 2021, Cain: Automatic code generation for simultaneous convolutional kernels onfocal-plane sensor-processors, Languages and Compilers for Parallel Computing (LCPC 2020), Publisher: Springer Verlag, Pages: 181-197, ISSN: 0302-9743
Focal-plane Sensor-processors (FPSPs) are a camera technology that enables low power, high frame rate computation, making the device suitable for edge computation. Unfortunately, the device’s limited instruction set and registers make the development of complex algorithms challenging. In this work, we present Cain – a compiler that targets SCAMP-5, a general-purpose FPSP – which generates SCAMP-5 code from multiple convolutional kernels. As an example, given the convolutional kernels for an MNIST digit recognition neural network, Cain produces code that is half as long, when compared to the other available compilers for SCAMP-5.
Murai R, Saeedi S, Kelly P, 2021, BIT-VO: visual odometry at 300 FPS using binary features from the focal plane, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, Publisher: IEEE, Pages: 8579-8586
Focal-plane Sensor-processor (FPSP) is a next-generation camera technology which enables every pixel on the sensor chip to perform computation in parallel, on the focal plane where the light intensity is captured. SCAMP-5 is a general-purpose FPSP used in this work and it carries out computations in the analog domain before analog to digital conversion. By extracting features from the image on the focal plane, data which is digitised and transferred is reduced. As a consequence, SCAMP-5 offers a high frame rate while maintaining low energy consumption. Here, we present BITVO, which is the first 6-Degrees of Freedom visual odometry algorithm which utilises the FPSP. Our entire system operates at 300 FPS in a natural environment, using binary edges and corner features detected by the SCAMP-5.
Kukreja N, Hückelheim J, Louboutin M, et al., 2020, Lossy Checkpoint Compression in Full Waveform Inversion, Geoscientific Model Development, ISSN: 1991-959X
This paper proposes a new method that combines check- pointing methods with error-controlled lossy compression for large-scale high-performance Full-Waveform Inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can signif- icantly reduce data movement, allowing a reduction in run time as well as peak memory.In the Exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit.Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory foot- print during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations.Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.
Sun T, Mitchell L, Kulkarni K, et al., 2020, A study of vectorization for matrix-free finite element methods, International Journal of High Performance Computing Applications, Vol: 34, Pages: 629-644, ISSN: 1094-3420
Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30% of theoretical peak performance for many examples of practical significance, and exceed 50% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels.
Carvalho EDC, Clark R, Nicastro A, et al., 2020, Scalable uncertainty for computer vision with functional variationalinference, CVPR 2020, Publisher: IEEE, Pages: 12003-12013
As Deep Learning continues to yield successful applications in ComputerVision, the ability to quantify all forms of uncertainty is a paramountrequirement for its safe and reliable deployment in the real-world. In thiswork, we leverage the formulation of variational inference in function space,where we associate Gaussian Processes (GPs) to both Bayesian CNN priors andvariational family. Since GPs are fully determined by their mean and covariancefunctions, we are able to obtain predictive uncertainty estimates at the costof a single forward pass through any chosen CNN architecture and for anysupervised learning task. By leveraging the structure of the induced covariancematrices, we propose numerically efficient algorithms which enable fasttraining in the context of high-dimensional tasks such as depth estimation andsemantic segmentation. Additionally, we provide sufficient conditions forconstructing regression loss functions whose probabilistic counterparts arecompatible with aleatoric uncertainty quantification.
Luporini F, Lange M, Louboutin M, et al., 2020, Architecture and performance of Devito, a system for automated stencil computation, ACM Transactions on Mathematical Software, Vol: 46, Pages: 1-24, ISSN: 0098-3500
Stencil computations are a key part of many high-performance computing applications, such as imageprocessing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devitois a framework capable of generating highly-optimized code given symbolic equations expressed in Python,specialized in, but not limited to, affine (stencil) codes. The lowering process—from mathematical equations down to C++ code—is performed by the Devito compiler through a series of intermediate representations.Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integratedin the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performanceoptimizations is demonstrated using operators drawn from seismic imaging applications.
Luporini F, Louboutin M, Lange M, et al., 2020, Architecture and Performance of Devito, a System for Automated Stencil Computation, Publisher: ASSOC COMPUTING MACHINERY
- Author Web Link
- Open Access Link
- Cite
- Citations: 13
Vespa E, Funk N, Kelly PHJ, et al., 2019, Adaptive-resolution octree-based volumetric SLAM, 7th International Conference on 3D Vision (3DV), Publisher: IEEE COMPUTER SOC, Pages: 654-662, ISSN: 2378-3826
We introduce a novel volumetric SLAM pipeline for the integration and rendering of depth images at an adaptive level of detail. Our core contribution is a fusion algorithm which dynamically selects the appropriate integration scale based on the effective sensor resolution given the distance from the observed scene, addressing aliasing issues, reconstruction quality, and efficiency simultaneously. We implement our approach using an efficient octree structure which supports multi-resolution rendering allowing for online frame-to-model alignment. Our qualitative and quantitative experiments demonstrate significantly improved reconstruction quality and up to six-fold execution time speed-ups compared to single resolution grids.
Bujanca M, Gafton P, Saeedi S, et al., 2019, SLAMBench 3.0: Systematic automated reproducible evaluation of SLAM systems for robot vision challenges and scene understanding, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729
As the SLAM research area matures and the number of SLAM systems available increases, the need for frameworks that can objectively evaluate them against prior work grows. This new version of SLAMBench moves beyond traditional visual SLAM, and provides new support for scene understanding and non-rigid environments (dynamic SLAM). More concretely for dynamic SLAM, SLAMBench 3.0 includes the first publicly available implementation of DynamicFusion, along with an evaluation infrastructure. In addition, we include two SLAM systems (one dense, one sparse) augmented with convolutional neural networks for scene understanding, together with datasets and appropriate metrics. Through a series of use-cases, we demonstrate the newly incorporated algorithms, visulation aids and metrics (6 new metrics, 4 new datasets and 5 new algorithms).
Saeedi S, Carvalho EDC, Li W, et al., 2019, Characterizing visual localization and mapping datasets, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729
Benchmarking mapping and motion estimation algorithms is established practice in robotics and computer vision. As the diversity of datasets increases, in terms of the trajectories, models, and scenes, it becomes a challenge to select datasets for a given benchmarking purpose. Inspired by the Wasserstein distance, this paper addresses this concern by developing novel metrics to evaluate trajectories and the environments without relying on any SLAM or motion estimation algorithm. The metrics, which so far have been missing in the research community, can be applied to the plethora of datasets that exist. Additionally, to improve the robotics SLAM benchmarking, the paper presents a new dataset for visual localization and mapping algorithms. A broad range of real-world trajectories is used in very high-quality scenes and a rendering framework to create a set of synthetic datasets with ground-truth trajectory and dense map which are representative of key SLAM applications such as virtual reality (VR), micro aerial vehicle (MAV) flight, and ground robotics.
Koch MK, Kelly PHJ, Vincent PE, 2019, Towards in-situ vortex identification for peta-scale CFD using contour trees, 8th IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV), Publisher: Institute of Electrical and Electronics Engineers, Pages: 104-105
Turbulent flows exist in many fields of science and occur in a wide range of engineering applications. While in the past broad knowledge has been established regarding the statistical properties of turbulence at a range of Reynolds numbers, there is a lack of under-standing of the detailed structure of these flows. Since the physical processes involve a vast number of structures, extremely large data sets are required to fully resolve a flow field in both space and time. To make the analysis of such data sets possible, we propose a frame-work that uses state-of-the-art contour tree construction algorithms to identify, classify and track vortices in turbulent flow fields produced by large-scale high-fidelity massively-parallel computational fluid dynamics solvers such as PyFR. Since disk capacity and I/O have become a bottleneck for such large-scale simulations, the proposed framework will be applied in-situ, while relevant data is still in device memory.
Luporini F, Lange M, Jacobs CT, et al., 2019, Automated tiling of unstructured mesh computations with application to seismological modeling, ACM Transactions on Mathematical Software, Vol: 45, ISSN: 0098-3500
Publication rights licensed to ACM. Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]-hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation-not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.
Papaphilippou P, Kelly PHJ, Luk W, 2019, Pangloss: a novel Markov chain prefetcher., Publisher: arXiv
We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access patterns. This is achieved by a highly-efficient representation of the Markov chain to provide accurate values for transition probabilities. In addition, we have added a mechanism to reconstruct delta transitions originally obfuscated by the out-of-order execution or page transitions, such as when streaming data from multiple sources. Our single-level (L2) prefetcher achieves a geometric speedup of 1.7% and 3.2% over selected state-of-the-art baselines (KPCP and BOP). When combined with an equivalent for the L1 cache (L1 & L2), the speedups rise to 6.8% and 8.4%, and 40.4% over non-prefetch. In the multi-core evaluation, there seems to be a considerable performance improvement as well.
Debrunner T, Saeedi Gharahbolagh S, Kelly P, 2019, AUKE: Automatic Kernel Code Generation for an analogue SIMD Focal-plane Sensor-Processor Array, ACM Transactions on Architecture and Code Optimization, Vol: 15, ISSN: 1544-3973
Focal-plane Sensor-Processor Arrays (FPSPs) are new imaging devices with parallel Single Instruction Multiple Data (SIMD) computational capabilities built into every pixel. Compared to traditional imaging devices, FPSPs allow for massive pixel-parallel execution of image processing algorithms. This enables the application of certain algorithms at extreme frame rates (>10,000 frames per second). By performing some early-stage processing in-situ, systems incorporating FPSPs can consume less power compared to conventional approaches using standard digital cameras. In this article, we explore code generation for an FPSP whose 256 × 256 processors operate on analogue signal data, leading to further opportunities for power reduction—and additional code synthesis challenges.While rudimentary image processing algorithms have been demonstrated on FPSPs before, progress with higher-level computer vision algorithms has been sparse due to the unique architecture and limits of the devices. This article presents a code generator for convolution filters for the SCAMP-5 FPSP, with applications in many high-level tasks such as convolutional neural networks, pose estimation, and so on. The SCAMP-5 FPSP has no effective multiply operator. Convolutions have to be implemented through sequences of more primitive operations such as additions, subtractions, and multiplications/divisions by two. We present a code generation algorithm to optimise convolutions by identifying common factors in the different weights and by determining an optimised pattern of pixel-to-pixel data movements to exploit them. We present evaluation in terms of both speed and energy consumption for a suite of well-known convolution filters. Furthermore, an application of the method is shown by the implementation of a Viola-Jones face detection algorithm.
Saeedi Gharahbolagh S, Bodin B, Wagstaff H, et al., 2018, Navigating the landscape for real-time localisation and mapping for robotics, virtual and augmented reality, Proceedings of the IEEE, Vol: 106, Pages: 2020-2039, ISSN: 0018-9219
Visual understanding of 3-D environments in real time, at low power, is a huge computational challenge. Often referred to as simultaneous localization and mapping (SLAM), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, and virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are: 1) tools and methodology for systematic quantitative evaluation of SLAM algorithms; 2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives; 3) end-to-end simulation tools to enable optimization of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches; and 4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.
Kelly PHJ, 2018, IEEE cluster 2018 Message from the program chair, IEEE Cluster conference 2018, Pages: XV-XV, ISSN: 1552-5244
Bodin B, Nardi L, Wagstaff H, et al., 2018, Algorithmic Performance-Accuracy Trade-off in 3D Vision Applications, Pages: 123-124
Simultaneous Localisation And Mapping (SLAM) is a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is particularly true when it comes to evaluate the potential trade-offs between computation speed, accuracy, and power consumption. SLAMBench is a benchmarking framework to evaluate existing and future SLAM systems, both open and closed source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. SLAMBench is a publicly-available software framework which represents a starting point for quantitative, comparable and validatable experimental research to investigate trade-offs in performance, accuracy and energy consumption across SLAM systems. In this poster we give an overview of SLAMBench and in particular we show how this framework can be used within Design Space Exploration and large-scale performance evaluation on mobile phones.
Bodin B, Wagstaff H, Saeedi S, et al., 2018, SLAMBench2: multi-objective head-to-head benchmarking for visual SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 3637-3644, ISSN: 1050-4729
SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phone-based AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and val-idatable experimental research to investigate trade-offs across SLAM systems.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.