Publications

Murai R, Alzugaray I, Kelly PHJ, Davison AJet al., 2024, Distributed simultaneous localisation and auto-calibration using Gaussian belief propagation, IEEE Robotics and Automation Letters, Vol: 9, Pages: 2136-2143, ISSN: 2377-3766

We present a novel scalable, fully distributed, and online method for simultaneous localisation and extrinsic calibration for multi-robot setups. Individual a priori unknown robot poses are probabilistically inferred as robots sense each other while simultaneously calibrating their sensors and markers extrinsic using Gaussian Belief Propagation. In the presented experiments, we show how our method not only yields accurate robot localisation and auto-calibration but also is able to perform under challenging circumstances such as highly noisy measurements, significant communication failures or limited communication range.

Journal article

Murai R, Ortiz J, Saeedi S, Kelly PHJ, Davison AJet al., 2024, A robot web for distributed many-device localization, IEEE Transactions on Robotics, Vol: 40, Pages: 121-138, ISSN: 1552-3098

We show that a distributed network of robots or other devices which make measurements of each other can collaborate to globally localize via efficient ad hoc peer-to-peer communication. Our Robot Web solution is based on Gaussian belief propagation (GBP) on the fundamental nonlinear factor graph describing the probabilistic structure of all of the observations robots make internally or of each other, and is flexible for any type of robot, motion or sensor. We define a simple and efficient communication protocol which can be implemented by the publishing and reading of web pages or other asynchronous communication technologies. We show in simulations with up to 1000 robots interacting in arbitrary patterns that our solution convergently achieves global accuracy as accurate as a centralized nonlinear factor graph solver while operating with high distributed efficiency of computation and communication. Via the use of robust factors in GBP, our method is tolerant to a high percentage of faulty sensor measurements or dropped communication packets. Furthermore, we showcase that the system operates on real robots with limited onboard computational resources.

Journal article

Murai R, Saeedi S, Kelly PHJ, 2023, High-frame rate homography and visual odometry by tracking binary features from the focal plane, Autonomous Robots, Vol: 47, Pages: 1579-1592, ISSN: 0929-5593

Robotics faces a long-standing obstacle in which the speed of the vision system’s scene understanding is insufficient, impeding the robot’s ability to perform agile tasks. Consequently, robots must often rely on interpolation and extrapolation of the vision data to accomplish tasks in a timely and effective manner. One of the primary reasons for these delays is the analog-to-digital conversion that occurs on a per-pixel basis across the image sensor, along with the transfer of pixel-intensity information to the host device. This results in significant delays and power consumption in modern visual processing pipelines. The SCAMP-5—a general-purpose Focal-plane Sensor-processor array (FPSP)—used in this research performs computations in the analog domain prior to analog-to-digital conversion. By extracting features from the image on the focal plane, the amount of data that needs to be digitised and transferred is reduced. This allows for a high frame rate and low energy consumption for the SCAMP-5. The focus of our work is on localising the camera within the scene, which is crucial for scene understanding and for any downstream robotics tasks. We present a localisation system that utilise the FPSP in two parts. First, a 6-DoF odometry system is introduced, which efficiently estimates its position against a known marker at over 400 FPS. Second, our work is extended to implement BIT-VO—6-DoF visual odometry system which operates under an unknown natural environment at 300 FPS.

Journal article

Sasongko MA, Chabbi M, Kelly P, Unat Det al., 2023, Precise event sampling-based data locality tools for AMD multicore architectures, Concurrency and Computation: Practice and Experience, Vol: 35, Pages: 1-18, ISSN: 1532-0626

We propose COMDETECTIVE+, an inter-thread communication analyzer, andREUSETRACKER+, a reuse distance analyzer, that leverage the hardware featuresin AMD processors to support low-overhead profiling. Both tools employ theinstruction-based sampling (IBS) facility and debug registers in AMD processorsto detect inter-thread communication and data reuse. Different from prior arts,COMDETECTIVE+ differentiates the communication into true and false sharing, andREUSETRACKER+ measures reuse distance in private and shared caches by also considering cache line invalidation with low overhead. Both tools can attribute thecommunications and reuses to source code lines. To our knowledge these tools are twoof the few profiling tools designed specifically for AMD x86 architectures using IBS.Our tools are timely and relevant considering the rise in numbers of AMD processorbased data centers and HPC systems. We perform experiments to evaluate the accuracy and overheads of the proposed tools on an AMD machine with two-socket EPYC7352 processors. COMDETECTIVE+ exhibits high accuracy while introducing 5.14×runtime and 1.4× memory overheads. REUSETRACKER+ also displays high accuracy,which is 95%, with 11.76×runtime and 1.46× memory overheads. These overheads aremuch lower than the overheads of existing simulators and code instrumentation-basedtools. Lastly, we demonstrate the usage of the tools by having COMDETECTIVE+ andREUSETRACKER+ facilitate the code refactoring of two data mining benchmarks toimprove their performance by up to 29%.

Journal article

Sasongko MA, Chabbi M, Kelly PHJ, Unat Det al., 2023, Precise event sampling on AMD versus intel: quantitative and qualitative comparison, IEEE Transactions on Parallel and Distributed Systems, Vol: 34, Pages: 1594-1608, ISSN: 1045-9219

Precise event sampling is a profiling feature in commodity processors that can sample hardware events and accurately locate the instructions that trigger the events. This feature has been used in a large number of tools to detect application performance issues. Although precise event sampling is readily supported in modern multicore architectures, vendor supports exhibit great differences that affect their accuracy, stability, overhead, and functionality. This work presents the most comprehensive study to date on benchmarking the event sampling features of Intel PEBS and AMD IBS and performs in-depth analysis on key differences through series of microbenchmarks. Our qualitative and quantitative analysis shows that PEBS allows finer-grained and more accurate sampling of hardware events, while IBS offers richer set of information at each sample though it suffers from lower accuracy and stability. Moreover, OS signal delivery, which is a common method used by the profiling software, introduces significant time overhead to the original overhead incurred by the hardware mechanisms in both PEBS and IBS. We also found that both PEBS and IBS have bias in sampling events across multiple different locations in a code. Lastly, we demonstrate how our findings on microbenchmarks under different thread counts hold for a full-fledged profiling tool that runs on the state-of-the-art Intel and AMD machines. Overall our detailed comparisons serve as a great reference and provide invaluable information for hardware designers and profiling tool developers.

Journal article

Koch MK, Kelly PHJ, Vincent P, 2022, Identification and classification of off-vertex critical points for contour tree construction on unstructured meshes of hexahedra, IEEE Transactions on Visualization and Computer Graphics, Vol: 28, Pages: 5178-5180, ISSN: 1077-2626

The topology of isosurfaces changes at isovalues of critical points, making such points an important feature when building contour trees or Morse-Smale complexes. Hexahedral elements with linear interpolants can contain additional off-vertex critical points in element bodies and on element faces. Moreover, a point on the face of a hexahedron which is critical in the element-local context is not necessarily critical in the global context. In ‘`Exploring Scalar Fields Using Critical Isovalues’' Weber et al. introduce a method to determine whether critical points on faces are also critical in the global context, based on the gradient of the asymptotic decider in each element that shares the face. However, as defined, the method of Weber et al. contains an error, and can lead to incorrect results. In this work we correct the error.

Journal article

Stow E, Ahsan A, Li Y, Babaei A, Murai R, Saeedi S, Kelly PHJet al., 2022, Compiling CNNs with Cain: focal-plane processing for robot navigation, Autonomous Robots, Vol: 46, Pages: 893-910, ISSN: 0929-5593

Focal-plane Sensor-processors (FPSPs) are a camera technology that enables low power, high frame rate computation in the image sensor itself, making them suitable for edge computation. To fit into the sensor array, FPSPs are highly resource-constrained, with limited instruction set and few registers - which makes developing complex algorithms difficult. In this work, we present Cain, a compiler for convolutional filters that targets SCAMP-5, a general-purpose FPSP. Cain generates code to evaluate multiple convolutional kernels at the same time. It generates code that avoids the need for hardware multipliers, while orchestrating the exploitation of common sub-terms—leading to a large reduction in instruction count compared to both straightforward and prior optimized approaches. We demonstrate the capability enabled by Cain on SCAMP-5 with robotic navigation for near-sensor high-speed and low-power computation, by using Cain to implement a neural network on the focal plane.

Journal article

Stow E, Kelly PHJ, 2022, Convolutional kernel function algebra, Frontiers of Computer Science, Vol: 4, Pages: 1-13, ISSN: 2095-2228

Many systems for image manipulation, signal analysis, machine learning, and scientific computing make use of discrete convolutional filters that are known before computation begins. These contexts benefit from common sub-expression elimination to reduce the number of calculations required, both multiplications and additions. We present an algebra for describing convolutional kernels and filters at a sufficient level of abstraction to enable intuitive common sub-expression based optimizations through decomposing filters into smaller, repeated, kernels. This enables the creation of an enormous search space of potential implementations of filters via algebraic manipulation. We demonstrate how integral image and sliding window optimizations can be expressed in the context of common sub-expression elimination as well as show the direct use case for this algebra in massively SIMD multiply-free contexts such as in cellular processor arrays. We then show that this algebra is general enough to express and optimize kernels that use non-standard semi-rings to enable shortest path algorithms.

Journal article

Kukreja N, Huckelheim J, Louboutin M, Washbourne J, Kelly PHJ, Gorman GJet al., 2022, Lossy checkpoint compression in full waveform inversion: a case study with ZFPv0.5.5 and the overthrust model, Geoscientific Model Development, Vol: 15, Pages: 3815-3829, ISSN: 1991-959X

This paper proposes a new method that combines checkpointing methods with error-controlled lossy compression for large-scale high-performance full-waveform inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can significantly reduce data movement, allowing a reduction in run time as well as peak memory.In the exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit.Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory footprint during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations.Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.

Journal article

Hsueh H-Y, Toma A-I, Jaafar HA, Stow E, Murai R, Kelly PHJ, Saeedi Set al., 2022, Systematic comparison of path planning algorithms using PathBench, Advanced Robotics, Vol: 36, Pages: 566-581, ISSN: 0169-1864

Path planning is an essential component of mobile robotics. Classical path planning algorithms, such as wavefront and rapidly exploring random tree, are used heavily in autonomous robots. With the recent advances in machine learning, development of learning-based path planning algorithms has been experiencing a rapid growth. A unified path planning interface that facilitates the development and benchmarking of existing and new algorithms is needed. This paper presents PathBench, a platform for developing, visualizing, training, testing, and benchmarking of existing and future, classical and learning-based path planning algorithms in 2D and 3D grid world environments. Many existing path planning algorithms are supported, e.g. A*, Dijkstra, waypoint planning networks, value iteration networks, and gated path planning networks; integrating new algorithms is easy and clearly specified. The benchmarking ability of PathBench is explored in this paper by comparing algorithms across five different hardware systems and three different map types, including built-in PathBench maps, video game maps, and maps from real world databases. Metrics, such as path length, success rate, and computational time, were used to evaluate algorithms. Algorithmic analysis was also performed on a real-world robot to demonstrate PathBench's support for Robot Operating System. PathBench is open source1.

Journal article

Papaphilippou P, Kelly PHJ, Luk W, 2021, Simodense: a RISC-V softcore optimised for exploring custom SIMD instructions, FPL2021. The International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE, Pages: 391-397

Simodense is a high-performance open-source RISC-V (RV32IM) softcore, optimised for exploring custom SIMD instructions. In order to maximise SIMD instruction performance, the design’s memory system is optimised for streaming bandwidth, such as very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications with custom instructions. This paper also provides insights on the effectiveness of adding FPGA resources in general purpose processors in the form of reconfigurable SIMDinstructions.

Conference paper

Papaphilippou P, Kelly PHJ, Luk W, 2021, Demonstrating custom SIMD instruction development for a RISC-V softcore, FPL2021. The International Conference on Field-Programmable Logic and Applications (FPL), Publisher: IEEE

This demo elaborates on the programmability aspect of Simodense, a recently released open-source softcore, optimised for evaluating custom SIMD instructions. CPUs featuring small reconfigurable areas for implementing custom instructions is an alternative path in computer architecture that can help with the challenges found in today’s FPGAs. By providing RTL-based programmability for implementing custom SIMD instructions, highly-integrated accelerators can be developed, while benefiting from the pre-existing CPU logic, such as the caches and their high memory throughput to main memory.

Conference paper

Toma A-I, Hsueh H-Y, Jaafar HA, Murai R, Kelly PHJ, Saeedi Set al., 2021, PathBench: a benchmarking platform for classical and learned path planning algorithms, 2021 18th Conference on Robots and Vision (CRV), Publisher: IEEE, Pages: 79-86

Path planning is a key component in mobile robotics. A wide range of path planning algorithms exist, but few attempts have been made to benchmark the algorithms holistically or unify their interface. Moreover, with the recent advances in deep neural networks, there is an urgent need to facilitate the development and benchmarking of such learning-based planning algorithms. This paper presents PathBench, a platform for developing, visualizing, training, testing, and benchmarking of existing and future, classical and learned 2D and 3D path planning algorithms, while offering support for Robot Operating System (ROS). Many existing path planning algorithms are supported; e.g. A*, wavefront, rapidly-exploring random tree, value iteration networks, gated path planning networks; and integrating new algorithms is easy and clearly specified. We demonstrate the benchmarking capability of PathBench by comparing implemented classical and learned algorithms for metrics, such as path length, success rate, computational time and path deviation. These evaluations are done on built-in PathBench maps and external path planning environments from video games and real world databases. PathBench is open source 1 .

Conference paper

Bisbas G, Luporini F, Louboutin M, Nelson R, Gorman GJ, Kelly PHJet al., 2021, Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources, 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Publisher: IEEE COMPUTER SOC, Pages: 497-506, ISSN: 1530-2075

Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid (“off-the-grid”). Our work is motivated by modelling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x.

Conference paper

Papaphilippou P, Kelly PHJ, Luk W, 2021, Extending the RISC-V ISA for exploring advanced reconfigurable SIMD instructions

This paper presents a novel, non-standard set of vector instruction types for exploring custom SIMD instructions in a softcore. The new types allow simultaneous access to a relatively high number of operands, reducing the instruction count where applicable. Additionally, a high-performance open-source RISC-V (RV32 IM) softcore is introduced, optimised for exploring custom SIMD instructions and streaming performance. By providing instruction templates for instruction development in HDL/Verilog, efficient FPGA-based instructions can be developed with few low-level lines of code. In order to improve custom SIMD instruction performance, the softcore's cache hierarchy is optimised for bandwidth, such as with very wide blocks for the last-level cache. The approach is demonstrated on example memory-intensive applications on an FPGA. Although the exploration is based on the softcore, the goal is to provide a means to experiment with advanced SIMD instructions which could be loaded in future CPUs that feature reconfigurable regions as custom instructions. Finally, we provide some insights on the challenges and effectiveness of such future micro-architectures.

Working paper

Stow E, Murai R, Saeedi S, Kelly Pet al., 2021, Cain: Automatic code generation for simultaneous convolutional kernels onfocal-plane sensor-processors, Languages and Compilers for Parallel Computing (LCPC 2020), Publisher: Springer Verlag, Pages: 181-197, ISSN: 0302-9743

Focal-plane Sensor-processors (FPSPs) are a camera technology that enables low power, high frame rate computation, making the device suitable for edge computation. Unfortunately, the device’s limited instruction set and registers make the development of complex algorithms challenging. In this work, we present Cain – a compiler that targets SCAMP-5, a general-purpose FPSP – which generates SCAMP-5 code from multiple convolutional kernels. As an example, given the convolutional kernels for an MNIST digit recognition neural network, Cain produces code that is half as long, when compared to the other available compilers for SCAMP-5.

Conference paper

Murai R, Saeedi S, Kelly P, 2021, BIT-VO: visual odometry at 300 FPS using binary features from the focal plane, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, Publisher: IEEE, Pages: 8579-8586

Focal-plane Sensor-processor (FPSP) is a next-generation camera technology which enables every pixel on the sensor chip to perform computation in parallel, on the focal plane where the light intensity is captured. SCAMP-5 is a general-purpose FPSP used in this work and it carries out computations in the analog domain before analog to digital conversion. By extracting features from the image on the focal plane, data which is digitised and transferred is reduced. As a consequence, SCAMP-5 offers a high frame rate while maintaining low energy consumption. Here, we present BITVO, which is the first 6-Degrees of Freedom visual odometry algorithm which utilises the FPSP. Our entire system operates at 300 FPS in a natural environment, using binary edges and corner features detected by the SCAMP-5.

Conference paper

Kukreja N, Hückelheim J, Louboutin M, Washbourne J, Kelly PHJ, Gorman GJet al., 2020, Lossy Checkpoint Compression in Full Waveform Inversion, Geoscientific Model Development, ISSN: 1991-959X

This paper proposes a new method that combines check- pointing methods with error-controlled lossy compression for large-scale high-performance Full-Waveform Inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can signif- icantly reduce data movement, allowing a reduction in run time as well as peak memory.In the Exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit.Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory foot- print during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations.Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.

Journal article

Sun T, Mitchell L, Kulkarni K, Klockner A, Ham DA, Kelly PHJet al., 2020, A study of vectorization for matrix-free finite element methods, International Journal of High Performance Computing Applications, Vol: 34, Pages: 629-644, ISSN: 1094-3420

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses difficulties to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30% of theoretical peak performance for many examples of practical significance, and exceed 50% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels.

Journal article

Carvalho EDC, Clark R, Nicastro A, Kelly PHJet al., 2020, Scalable uncertainty for computer vision with functional variationalinference, CVPR 2020, Publisher: IEEE, Pages: 12003-12013

As Deep Learning continues to yield successful applications in ComputerVision, the ability to quantify all forms of uncertainty is a paramountrequirement for its safe and reliable deployment in the real-world. In thiswork, we leverage the formulation of variational inference in function space,where we associate Gaussian Processes (GPs) to both Bayesian CNN priors andvariational family. Since GPs are fully determined by their mean and covariancefunctions, we are able to obtain predictive uncertainty estimates at the costof a single forward pass through any chosen CNN architecture and for anysupervised learning task. By leveraging the structure of the induced covariancematrices, we propose numerically efficient algorithms which enable fasttraining in the context of high-dimensional tasks such as depth estimation andsemantic segmentation. Additionally, we provide sufficient conditions forconstructing regression loss functions whose probabilistic counterparts arecompatible with aleatoric uncertainty quantification.

Conference paper

Luporini F, Lange M, Louboutin M, Kukreja N, Hückelheim J, Yount C, Witte P, Kelly PHJ, Herrmann FJ, Gorman Get al., 2020, Architecture and performance of Devito, a system for automated stencil computation, ACM Transactions on Mathematical Software, Vol: 46, Pages: 1-24, ISSN: 0098-3500

Stencil computations are a key part of many high-performance computing applications, such as imageprocessing, convolutional neural networks, and finite-difference solvers for partial differential equations. Devitois a framework capable of generating highly-optimized code given symbolic equations expressed in Python,specialized in, but not limited to, affine (stencil) codes. The lowering process—from mathematical equations down to C++ code—is performed by the Devito compiler through a series of intermediate representations.Several performance optimizations are introduced, including advanced common sub-expressions elimination, tiling and parallelization. Some of these are obtained through well-established stencil optimizers, integratedin the back-end of the Devito compiler. The architecture of the Devito compiler, as well as the performance optimizations that are applied when generating code, are presented. The effectiveness of such performanceoptimizations is demonstrated using operators drawn from seismic imaging applications.

Journal article

Luporini F, Louboutin M, Lange M, Kukreja N, Witte P, Huckelheim J, Yount C, Kelly PHJ, Herrmann FJ, Gorman GJet al., 2020, Architecture and Performance of Devito, a System for Automated Stencil Computation, Publisher: ASSOC COMPUTING MACHINERY

Working paper

Vespa E, Funk N, Kelly PHJ, Leutenegger Set al., 2019, Adaptive-resolution octree-based volumetric SLAM, 7th International Conference on 3D Vision (3DV), Publisher: IEEE COMPUTER SOC, Pages: 654-662, ISSN: 2378-3826

We introduce a novel volumetric SLAM pipeline for the integration and rendering of depth images at an adaptive level of detail. Our core contribution is a fusion algorithm which dynamically selects the appropriate integration scale based on the effective sensor resolution given the distance from the observed scene, addressing aliasing issues, reconstruction quality, and efficiency simultaneously. We implement our approach using an efficient octree structure which supports multi-resolution rendering allowing for online frame-to-model alignment. Our qualitative and quantitative experiments demonstrate significantly improved reconstruction quality and up to six-fold execution time speed-ups compared to single resolution grids.

Conference paper

Bujanca M, Gafton P, Saeedi S, Nisbet A, Bodin B, O'Boyle MFP, Davison AJ, Paul HJ K, Riley G, Lennox B, Lujan M, Furber Set al., 2019, SLAMBench 3.0: Systematic automated reproducible evaluation of SLAM systems for robot vision challenges and scene understanding, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729

As the SLAM research area matures and the number of SLAM systems available increases, the need for frameworks that can objectively evaluate them against prior work grows. This new version of SLAMBench moves beyond traditional visual SLAM, and provides new support for scene understanding and non-rigid environments (dynamic SLAM). More concretely for dynamic SLAM, SLAMBench 3.0 includes the first publicly available implementation of DynamicFusion, along with an evaluation infrastructure. In addition, we include two SLAM systems (one dense, one sparse) augmented with convolutional neural networks for scene understanding, together with datasets and appropriate metrics. Through a series of use-cases, we demonstrate the newly incorporated algorithms, visulation aids and metrics (6 new metrics, 4 new datasets and 5 new algorithms).

Conference paper

Saeedi S, Carvalho EDC, Li W, Tzoumanikas D, Leutenegger S, Kelly PHJ, Davison AJet al., 2019, Characterizing visual localization and mapping datasets, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729

Benchmarking mapping and motion estimation algorithms is established practice in robotics and computer vision. As the diversity of datasets increases, in terms of the trajectories, models, and scenes, it becomes a challenge to select datasets for a given benchmarking purpose. Inspired by the Wasserstein distance, this paper addresses this concern by developing novel metrics to evaluate trajectories and the environments without relying on any SLAM or motion estimation algorithm. The metrics, which so far have been missing in the research community, can be applied to the plethora of datasets that exist. Additionally, to improve the robotics SLAM benchmarking, the paper presents a new dataset for visual localization and mapping algorithms. A broad range of real-world trajectories is used in very high-quality scenes and a rendering framework to create a set of synthetic datasets with ground-truth trajectory and dense map which are representative of key SLAM applications such as virtual reality (VR), micro aerial vehicle (MAV) flight, and ground robotics.

Conference paper

Koch MK, Kelly PHJ, Vincent PE, 2019, Towards in-situ vortex identification for peta-scale CFD using contour trees, 8th IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV), Publisher: Institute of Electrical and Electronics Engineers, Pages: 104-105

Turbulent flows exist in many fields of science and occur in a wide range of engineering applications. While in the past broad knowledge has been established regarding the statistical properties of turbulence at a range of Reynolds numbers, there is a lack of under-standing of the detailed structure of these flows. Since the physical processes involve a vast number of structures, extremely large data sets are required to fully resolve a flow field in both space and time. To make the analysis of such data sets possible, we propose a frame-work that uses state-of-the-art contour tree construction algorithms to identify, classify and track vortices in turbulent flow fields produced by large-scale high-fidelity massively-parallel computational fluid dynamics solvers such as PyFR. Since disk capacity and I/O have become a bottleneck for such large-scale simulations, the proposed framework will be applied in-situ, while relevant data is still in device memory.

Conference paper

Luporini F, Lange M, Jacobs CT, Gorman GJ, Ramanujam J, Kelly PHJet al., 2019, Automated tiling of unstructured mesh computations with application to seismological modeling, ACM Transactions on Mathematical Software, Vol: 45, ISSN: 0098-3500

Publication rights licensed to ACM. Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]-hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation-not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.

Journal article

Papaphilippou P, Kelly PHJ, Luk W, 2019, Pangloss: a novel Markov chain prefetcher., Publisher: arXiv

We present Pangloss, an efficient high-performance data prefetcher that approximates a Markov chain on delta transitions. With a limited information scope and space/logic complexity, it is able to reconstruct a variety of both simple and complex access patterns. This is achieved by a highly-efficient representation of the Markov chain to provide accurate values for transition probabilities. In addition, we have added a mechanism to reconstruct delta transitions originally obfuscated by the out-of-order execution or page transitions, such as when streaming data from multiple sources. Our single-level (L2) prefetcher achieves a geometric speedup of 1.7% and 3.2% over selected state-of-the-art baselines (KPCP and BOP). When combined with an equivalent for the L1 cache (L1 & L2), the speedups rise to 6.8% and 8.4%, and 40.4% over non-prefetch. In the multi-core evaluation, there seems to be a considerable performance improvement as well.

Working paper

Debrunner T, Saeedi Gharahbolagh S, Kelly P, 2019, AUKE: Automatic Kernel Code Generation for an analogue SIMD Focal-plane Sensor-Processor Array, ACM Transactions on Architecture and Code Optimization, Vol: 15, ISSN: 1544-3973

Focal-plane Sensor-Processor Arrays (FPSPs) are new imaging devices with parallel Single Instruction Multiple Data (SIMD) computational capabilities built into every pixel. Compared to traditional imaging devices, FPSPs allow for massive pixel-parallel execution of image processing algorithms. This enables the application of certain algorithms at extreme frame rates (>10,000 frames per second). By performing some early-stage processing in-situ, systems incorporating FPSPs can consume less power compared to conventional approaches using standard digital cameras. In this article, we explore code generation for an FPSP whose 256 × 256 processors operate on analogue signal data, leading to further opportunities for power reduction—and additional code synthesis challenges.While rudimentary image processing algorithms have been demonstrated on FPSPs before, progress with higher-level computer vision algorithms has been sparse due to the unique architecture and limits of the devices. This article presents a code generator for convolution filters for the SCAMP-5 FPSP, with applications in many high-level tasks such as convolutional neural networks, pose estimation, and so on. The SCAMP-5 FPSP has no effective multiply operator. Convolutions have to be implemented through sequences of more primitive operations such as additions, subtractions, and multiplications/divisions by two. We present a code generation algorithm to optimise convolutions by identifying common factors in the different weights and by determining an optimised pattern of pixel-to-pixel data movements to exploit them. We present evaluation in terms of both speed and energy consumption for a suite of well-known convolution filters. Furthermore, an application of the method is shown by the implementation of a Viola-Jones face detection algorithm.

Journal article

Saeedi Gharahbolagh S, Bodin B, Wagstaff H, Nisbet A, Nardi L, Mawer J, Melot N, Palomar O, Vespa E, Gorgovan C, Webb A, Clarkson J, Tomusk E, Debrunner T, Kaszyk K, Gonzalez P, Rodchenko A, Riley G, Kotselidis C, Franke B, OBoyle M, Davison A, Kelly P, Lujan M, Furber Set al., 2018, Navigating the landscape for real-time localisation and mapping for robotics, virtual and augmented reality, Proceedings of the IEEE, Vol: 106, Pages: 2020-2039, ISSN: 0018-9219

Visual understanding of 3-D environments in real time, at low power, is a huge computational challenge. Often referred to as simultaneous localization and mapping (SLAM), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, and virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are: 1) tools and methodology for systematic quantitative evaluation of SLAM algorithms; 2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives; 3) end-to-end simulation tools to enable optimization of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches; and 4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.

Journal article

ProfessorPaulKelly

Contact

Location

Summary