Imperial College London

ProfessorAndrewDavison

Faculty of EngineeringDepartment of Computing

Professor of Robot Vision
 
 
 
//

Contact

 

+44 (0)20 7594 8316a.davison Website

 
 
//

Assistant

 

Mrs Marina Hall +44 (0)20 7594 8259

 
//

Location

 

303William Penney LaboratorySouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

124 results found

Matas J, James S, Davison A, 2018, Sim-to-real reinforcement learning for deformable object manipulation, Conference on Robot Learning 2018, Publisher: PMLR, Pages: 734-743

We have seen much recent progress in rigid object manipulation, but in-teraction with deformable objects has notably lagged behind. Due to the large con-figuration space of deformable objects, solutions using traditional modelling ap-proaches require significant engineering work. Perhaps then, bypassing the needfor explicit modelling and instead learning the control in an end-to-end mannerserves as a better approach? Despite the growing interest in the use of end-to-endrobot learning approaches, only a small amount of work has focused on their ap-plicability to deformable object manipulation. Moreover, due to the large amountof data needed to learn these end-to-end solutions, an emerging trend is to learncontrol policies in simulation and then transfer them over to the real world. To-date, no work has explored whether it is possible to learn and transfer deformableobject policies. We believe that if sim-to-real methods are to be employed fur-ther, then it should be possible to learn to interact with a wide variety of objects,and not only rigid objects. In this work, we use a combination of state-of-the-artdeep reinforcement learning algorithms to solve the problem of manipulating de-formable objects (specifically cloth). We evaluate our approach on three tasks —folding a towel up to a mark, folding a face towel diagonally, and draping a pieceof cloth over a hanger. Our agents are fully trained in simulation with domainrandomisation, and then successfully deployed in the real world without havingseen any real deformable objects.

Conference paper

James S, Blosch M, Davison A, 2018, Task-embedded control networks for few-shot imitation learning, Conference on Robot Learning, Publisher: PMLR, Pages: 783-795

Much like humans, robots should have the ability to leverage knowledge from previously learned tasks in order to learn new tasks quickly in new and unfamiliar environments. Despite this, most robot learning approaches have focused on learning a single task, from scratch, with a limited notion of generalisation, and no way of leveraging the knowledge to learn other tasks more efficiently. One possible solution is meta-learning, but many of the related approaches are limited in their ability to scale to a large number of tasks and to learn further tasks without forgetting previously learned ones. With this in mind, we introduce Task-Embedded Control Networks, which employ ideas from metric learning in order to create a task embedding that can be used by a robot to learn new tasks from one or more demonstrations. In the area of visually-guided manipulation, we present simulation results in which we surpass the performance of a state-of-the-art method when using only visual information from each demonstration. Additionally, we demonstrate that our approach can also be used in conjunction with domain randomisation to train our few-shot learning ability in simulation and then deploy in the real world without any additional training. Once deployed, the robot can learn new tasks from a single real-world demonstration.

Conference paper

McCormac J, Clark R, Bloesch M, Davison A, Leutenegger Set al., 2018, Fusion++: Volumetric object-level SLAM, 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), International Conference on, Publisher: IEEE, Pages: 32-41, ISSN: 2378-3826

We propose an online object-level SLAM system which builds a persistent and accurate 3D graph map of arbitrary reconstructed objects. As an RGB-D camera browses a cluttered indoor scene, Mask-RCNN instance segmentations are used to initialise compact per-object Truncated Signed Distance Function (TSDF) reconstructions with object size-dependent resolutions and a novel 3D foreground mask. Reconstructed objects are stored in an optimisable 6DoF pose graph which is our only persistent map representation. Objects are incrementally refined via depth fusion, and are used for tracking, relocalisation and loop closure detection. Loop closures cause adjustments in the relative pose estimates of object instances, but no intra-object warping. Each object also carries semantic information which is refined over time and an existence probability to account for spurious instance predictions. We demonstrate our approach on a hand-held RGB-D sequence from a cluttered office scene with a large number and variety of object instances, highlighting how the system closes loops and makes good use of existing objects on repeated loops. We quantitatively evaluate the trajectory error of our system against a baseline approach on the RGB-D SLAM benchmark, and qualitatively compare reconstruction quality of discovered objects on the YCB video dataset. Performance evaluation shows our approach is highly memory efficient and runs online at 4-8Hz (excluding relocalisation) despite not being optimised at the software level.

Conference paper

Clark R, Bloesch M, Czarnowski J, Leutenegger S, Davison AJet al., 2018, Learning to solve nonlinear least squares for monocular stereo, 15th European Conference on Computer Vision, Publisher: Springer Nature Switzerland AG 2018, Pages: 291-306, ISSN: 0302-9743

Sum-of-squares objective functions are very popular in computer vision algorithms. However, these objective functions are not always easy to optimize. The underlying assumptions made by solvers are often not satisfied and many problems are inherently ill-posed. In this paper, we propose a neural nonlinear least squares optimization algorithm which learns to effectively optimize these cost functions even in the presence of adversities. Unlike traditional approaches, the proposed solver requires no hand-crafted regularizers or priors as these are implicitly learned from the data. We apply our method to the problem of motion stereo ie. jointly estimating the motion and scene geometry from pairs of images of a monocular sequence. We show that our learned optimizer is able to efficiently and effectively solve this challenging optimization problem.

Conference paper

Saeedi Gharahbolagh S, Bodin B, Wagstaff H, Nisbet A, Nardi L, Mawer J, Melot N, Palomar O, Vespa E, Gorgovan C, Webb A, Clarkson J, Tomusk E, Debrunner T, Kaszyk K, Gonzalez P, Rodchenko A, Riley G, Kotselidis C, Franke B, OBoyle M, Davison A, Kelly P, Lujan M, Furber Set al., 2018, Navigating the landscape for real-time localisation and mapping for robotics, virtual and augmented reality, Proceedings of the IEEE, ISSN: 0018-9219

Visual understanding of 3-D environments in real time, at low power, is a huge computational challenge. Often referred to as simultaneous localization and mapping (SLAM), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, and virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are: 1) tools and methodology for systematic quantitative evaluation of SLAM algorithms; 2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives; 3) end-to-end simulation tools to enable optimization of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches; and 4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.

Journal article

Clark R, Bloesch M, Czarnowski J, Leutenegger S, Davison Aet al., LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo, European Conference on Computer Vision

Conference paper

Bodin B, Wagstaff H, Saeedi S, Nardi L, Vespa E, Mawer J, Nisbet A, Lujan M, Furber S, Davison AJ, Kelly PHJ, O'Boyle MFPet al., 2018, SLAMBench2: multi-objective head-to-head benchmarking for visual SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 3637-3644, ISSN: 1050-4729

SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phone-based AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and val-idatable experimental research to investigate trade-offs across SLAM systems.

Conference paper

Bloesch M, Czarnowski J, Clark R, Leutenegger S, Davison AJet al., CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM, IEEE Computer Vision and Pattern Recognition 2018, Publisher: IEEE

The representation of geometry in real-time 3D per-ception systems continues to be a critical research issue.Dense maps capture complete surface shape and can beaugmented with semantic labels, but their high dimension-ality makes them computationally costly to store and pro-cess, and unsuitable for rigorous probabilistic inference.Sparse feature-based representations avoid these problems,but capture only partial scene information and are mainlyuseful for localisation only.We present a new compact but dense representation ofscene geometry which is conditioned on the intensity datafrom a single image and generated from a code consistingof a small number of parameters. We are inspired by workboth on learned depth from images, and auto-encoders. Ourapproach is suitable for use in a keyframe-based monoculardense SLAM system: While each keyframe with a code canproduce a depth map, the code can be optimised efficientlyjointlywith pose variables and together with the codes ofoverlapping keyframes to attain global consistency. Condi-tioning the depth map on the image allows the code to onlyrepresent aspects of the local geometry which cannot di-rectly be predicted from the image. We explain how to learnour code representation, and demonstrate its advantageousproperties in monocular SLAM.

Conference paper

Czarnowski J, Leutenegger S, Davison AJ, 2018, Semantic Texture for Robust Dense Tracking, 16th IEEE International Conference on Computer Vision (ICCV), Publisher: IEEE, Pages: 851-859, ISSN: 2473-9936

Conference paper

James S, Davison A, Johns E, 2017, Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task, Conference on Robot Learning, Publisher: PMLR, Pages: 334-343

End-to-end control for robot manipulation and grasping is emergingas an attractive alternative to traditional pipelined approaches. However, end-to-end methods tend to either be slow to train, exhibit little or no generalisability,or lack the ability to accomplish long-horizon or multi-stage tasks. In this paper,we show how two simple techniques can lead to end-to-end (image to velocity)execution of a multi-stage task, which is analogous to a simple tidying routine,without having seen a single real image. This involves locating, reaching for, andgrasping a cube, then locating a basket and dropping the cube inside. To achievethis, robot trajectories are computed in a simulator, to collect a series of controlvelocities which accomplish the task. Then, a CNN is trained to map observedimages to velocities, using domain randomisation to enable generalisation to realworld images. Results show that we are able to successfully accomplish the taskin the real world with the ability to generalise to novel environments, includingthose with dynamic lighting conditions, distractor objects, and moving objects,including the basket itself. We believe our approach to be simple, highly scalable,and capable of learning long-horizon tasks that have until now not been shownwith the state-of-the-art in end-to-end robot control.

Conference paper

McCormac, Handa A, Leutenegger S, Davison AJet al., SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?, International Conference on Computer Vision 2017, Publisher: IEEE

Conference paper

Lukierski R, Leutenegger S, Davison AJ, 2017, Room layout estimation from rapid omnidirectional exploration, IEEE International Conference on Robotics and Automation (ICRA), 2017, Publisher: IEEE

A new generation of practical, low-cost indoor robots is now using wide-angle cameras to aid navigation, but usually this is limited to position estimation via sparse feature-based SLAM. Such robots usually have little global sense of the dimensions, demarcation or identities of the rooms they are in, information which would be very useful to enable behaviour with much more high level intelligence. In this paper we show that we can augment an omni-directional SLAM pipeline with straightforward dense stereo estimation and simple and robust room model fitting to obtain rapid and reliable estimation of the global shape of typical rooms from short robot motions. We have tested our method extensively in real homes, offices and on synthetic data. We also give examples of how our method can extend to making composite maps of larger rooms, and detecting room transitions.

Conference paper

McCormac J, Handa A, Davison AJ, Leutenegger Set al., 2017, SemanticFusion: dense 3D semantic mapping with convolutional neural networks, IEEE International Conference on Robotics and Automation (ICRA), 2017, Publisher: IEEE

Ever more robust, accurate and detailed mapping using visual sensing has proven to be an enabling factor for mobile robots across a wide variety of applications. For the next level of robot intelligence and intuitive user interaction, maps need to extend beyond geometry and appearance — they need to contain semantics. We address this challenge by combining Convolutional Neural Networks (CNNs) and a state-of-the-art dense Simultaneous Localization and Mapping (SLAM) system, ElasticFusion, which provides long-term dense correspondences between frames of indoor RGB-D video even during loopy scanning trajectories. These correspondences allow the CNN's semantic predictions from multiple view points to be probabilistically fused into a map. This not only produces a useful semantic 3D map, but we also show on the NYUv2 dataset that fusing multiple predictions leads to an improvement even in the 2D semantic labelling over baseline single frame predictions. We also show that for a smaller reconstruction dataset with larger variation in prediction viewpoint, the improvement over single frame segmentation increases. Our system is efficient enough to allow real-time interactive use at frame-rates of ≈25Hz.

Conference paper

McCormac J, Handa A, Davison A, Leutenegger Set al., 2017, SemanticFusion: Dense 3D semantic mapping with convolutional neural networks

© 2017 IEEE. Ever more robust, accurate and detailed mapping using visual sensing has proven to be an enabling factor for mobile robots across a wide variety of applications. For the next level of robot intelligence and intuitive user interaction, maps need to extend beyond geometry and appearance - they need to contain semantics. We address this challenge by combining Convolutional Neural Networks (CNNs) and a state-of-the-art dense Simultaneous Localization and Mapping (SLAM) system, ElasticFusion, which provides long-term dense correspondences between frames of indoor RGB-D video even during loopy scanning trajectories. These correspondences allow the CNN's semantic predictions from multiple view points to be probabilistically fused into a map. This not only produces a useful semantic 3D map, but we also show on the NYUv2 dataset that fusing multiple predictions leads to an improvement even in the 2D semantic labelling over baseline single frame predictions. We also show that for a smaller reconstruction dataset with larger variation in prediction viewpoint, the improvement over single frame segmentation increases. Our system is efficient enough to allow real-time interactive use at frame-rates of ≈25Hz.

Working paper

Canelhas DR, Schaffernicht E, Stoyanov T, Lilienthal AJ, Davison AJet al., 2017, Compressed voxel-based mapping using unsupervised learning, Robotics, Vol: 6, ISSN: 2218-6581

In order to deal with the scaling problem of volumetric map representations, we propose spatially local methods for high-ratio compression of 3D maps, represented as truncated signed distance fields. We show that these compressed maps can be used as meaningful descriptors for selective decompression in scenarios relevant to robotic applications. As compression methods, we compare using PCA-derived low-dimensional bases to nonlinear auto-encoder networks. Selecting two application-oriented performance metrics, we evaluate the impact of different compression rates on reconstruction fidelity as well as to the task of map-aided ego-motion estimation. It is demonstrated that lossily reconstructed distance fields used as cost functions for ego-motion estimation can outperform the original maps in challenging scenarios from standard RGB-D (color plus depth) data sets due to the rejection of high-frequency noise content.

Journal article

Tsiotsios A, Davison A, Kim T, Near-lighting Photometric Stereo for unknown scene distance and medium attenuation, Image and Vision Computing, ISSN: 0262-8856

Journal article

Nardi L, Bodin B, Saeedi S, Vespa E, Davison AJ, Kelly Pet al., Algorithmic performance-accuracy trade-off in 3D vision applications using hypermapper, IPDPS, Publisher: IEEE

In this paper we investigate an emerging appli-cation, 3D scene understanding, likely to be significant in themobile space in the near future. The goal of this explorationis to reduce execution time while meeting our quality of resultobjectives. In previous work, we showed for the first time thatit is possible to map this application to power constrainedembedded systems, highlighting that decision choices made atthe algorithmic design-level have the most significant impact.As the algorithmic design space is too large to be exhaus-tively evaluated, we use a previously introduced multi-objectiverandom forest active learning prediction framework dubbedHyperMapper, to find good algorithmic designs. We showthat HyperMapper generalizes on a recent cutting edge 3Dscene understanding algorithm and on a modern GPU-basedcomputer architecture. HyperMapper is able to beat an experthuman hand-tuning the algorithmic parameters of the classof computer vision applications taken under consideration inthis paper automatically. In addition, we use crowd-sourcingusing a 3D scene understanding Android app to show that thePareto front obtained on an embedded system can be used toaccelerate the same application on all the 83 smart-phones andtablets with speedups ranging from 2x to over 12x.

Conference paper

Platinsky L, Davison AJ, Leutenegger S, Monocular visual odometry: sparse joint optimisation or densealternation?, IEEE International Conference on Robotics and Automation (ICRA), 2017, Publisher: IEEE

Real-time monocular SLAM is increasingly ma-ture and entering commercial products. However, there is adivide between two techniques providing similar performance.Despite the rise of ‘dense’ and ‘semi-dense’ methods which uselarge proportions of the pixels in a video stream to estimatemotion and structure via alternating estimation, they havenot eradicated feature-based methods which use a significantlysmaller amount of image information from keypoints and retaina more rigorous joint estimation framework. Dense methodsprovide more complete scene information, but in this paperwe focus on how the amount of information and differentoptimisation methods affect the accuracy of local motionestimation (monocular visual odometry). This topic becomesparticularly relevant after the recent results from a direct sparsesystem. We propose a new method for fairly comparing theaccuracy of SLAM frontends in a common setting. We suggestcomputational cost models for an overall comparison whichindicates that there is relative parity between the approaches atthe settings allowed by current serial processors when evaluatedunder equal conditions.

Conference paper

Saeedi Gharahbolagh S, Nardi L, Johns E, Bodin B, Kelly PHJ, Davison AJet al., Application-oriented Design Space Exploration for SLAM Algorithms, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE

In visual SLAM, there are many software and hardware parameters, such as algorithmic thresholds and GPU frequency, that need to be tuned; however, this tuning should also take into account the structure and motion of the camera. In this paper, we determine the complexity of the structure and motion with a few parameters calculated using information theory. Depending on this complexity and the desired performance metrics, suitable parameters are explored and determined. Additionally, based on the proposed structure and motion parameters, several applications are presented, including a novel active SLAM approach which guides the camera in such a way that the SLAM algorithm achieves the desired performance metrics. Real-world and simulated experimental results demonstrate the effectiveness of the proposed design space and its applications.

Conference paper

Zienkiewicz J, Tsiotsios C, Davison AJ, Leutenegger Set al., 2016, Monocular, Real-Time Surface Reconstruction using Dynamic Level of Detail, International Conference on 3DVision, Publisher: IEEE

We present a scalable, real-time capable method for robustsurface reconstruction that explicitly handles multiplescales. As a monocular camera browses a scene, ouralgorithm processes images as they arrive and incrementallybuilds a detailed surface model. While most of theexisting reconstruction approaches rely on volumetric orpoint-cloud representations of the environment, we performdepth-map and colour fusion directly into a multi-resolutiontriangular mesh that can be adaptively tessellated usingthe concept of Dynamic Level of Detail. Our method relieson least-squares optimisation, which enables a probabilisticallysound and principled formulation of the fusionalgorithm. We demonstrate that our method is capable ofobtaining high quality, close-up reconstruction, as well ascapturing overall scene geometry, while being memory andcomputationally efficient.

Conference paper

Johns E, Leutenegger S, Davison AJ, 2016, Pairwise Decomposition of Image Sequences for Active Multi-View Recognition, Computer Vision and Pattern Recognition, Publisher: Computer Vision Foundation (CVF), ISSN: 1063-6919

A multi-view image sequence provides a much richercapacity for object recognition than from a single image.However, most existing solutions to multi-view recognitiontypically adopt hand-crafted, model-based geometric methods,which do not readily embrace recent trends in deeplearning. We propose to bring Convolutional Neural Networksto generic multi-view recognition, by decomposingan image sequence into a set of image pairs, classifyingeach pair independently, and then learning an object classi-fier by weighting the contribution of each pair. This allowsfor recognition over arbitrary camera trajectories, withoutrequiring explicit training over the potentially infinite numberof camera paths and lengths. Building these pairwiserelationships then naturally extends to the next-best-viewproblem in an active recognition framework. To achievethis, we train a second Convolutional Neural Network tomap directly from an observed image to next viewpoint.Finally, we incorporate this into a trajectory optimisationtask, whereby the best recognition confidence is sought fora given trajectory length. We present state-of-the-art resultsin both guided and unguided multi-view recognition on theModelNet dataset, and show how our method can be usedwith depth images, greyscale images, or both.

Conference paper

Bardow P, Davison AJ, Leutenegger S, 2016, Simultaneous Optical Flow and Intensity Estimation from an Event Camera, Computer Vision and Pattern Recognition 2016, Publisher: Computer Vision Foundation (CVF), ISSN: 1063-6919

Event cameras are bio-inspired vision sensors whichmimic retinas to measure per-pixel intensity change ratherthan outputting an actual intensity image. This proposedparadigm shift away from traditional frame cameras offerssignificant potential advantages: namely avoiding highdata rates, dynamic range limitations and motion blur.Unfortunately, however, established computer vision algorithmsmay not at all be applied directly to event cameras.Methods proposed so far to reconstruct images, estimateoptical flow, track a camera and reconstruct a scene comewith severe restrictions on the environment or on the motionof the camera, e.g. allowing only rotation. Here, wepropose, to the best of our knowledge, the first algorithm tosimultaneously recover the motion field and brightness image,while the camera undergoes a generic motion throughany scene. Our approach employs minimisation of a costfunction that contains the asynchronous event data as wellas spatial and temporal regularisation within a sliding windowtime interval. Our implementation relies on GPU optimisationand runs in near real-time. In a series of examples,we demonstrate the successful operation of our framework,including in situations where conventional cameras sufferfrom dynamic range limitations and motion blur.

Conference paper

Zienkiewicz J, Davison AJ, Leutenegger S, 2016, Real-Time Height Map Fusion using Differentiable Rendering, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, Publisher: IEEE, ISSN: 2153-0866

We present a robust real-time method whichperforms dense reconstruction of high quality height mapsfrom monocular video. By representing the height map as atriangular mesh, and using efficient differentiable renderingapproach, our method enables rigorous incremental probabilisticfusion of standard locally estimated depth and colour intoan immediately usable dense model. We present results forthe application of free space and obstacle mapping by a lowcostrobot, showing that detailed maps suitable for autonomousnavigation can be obtained using only a single forward-lookingcamera.

Conference paper

Whelan T, Salas Moreno R, Leutenegger S, Davison A, Glocker Bet al., 2016, Modelling a Three-Dimensional Space, WO2016189274

Patent

Johns E, Leutenegger S, Davison AJ, 2016, Deep Learning a Grasp Function for Grasping Under Gripper Pose Uncertainty, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, Publisher: IEEE, ISSN: 2153-0866

This paper presents a new method for paralleljawgrasping of isolated objects from depth images, underlarge gripper pose uncertainty. Whilst most approaches aimto predict the single best grasp pose from an image, ourmethod first predicts a score for every possible grasp pose,which we denote the grasp function. With this, it is possibleto achieve grasping robust to the gripper’s pose uncertainty,by smoothing the grasp function with the pose uncertaintyfunction. Therefore, if the single best pose is adjacent to aregion of poor grasp quality, that pose will no longer be chosen,and instead a pose will be chosen which is surrounded by aregion of high grasp quality. To learn this function, we traina Convolutional Neural Network which takes as input a singledepth image of an object, and outputs a score for each grasppose across the image. Training data for this is generated byuse of physics simulation and depth image simulation with 3Dobject meshes, to enable acquisition of sufficient data withoutrequiring exhaustive real-world experiments. We evaluate withboth synthetic and real experiments, and show that the learnedgrasp score is more robust to gripper pose uncertainty thanwhen this uncertainty is not accounted for.

Conference paper

Handa A, Bloesch M, Patraucean V, Stent S, McCormac J, Davison Aet al., 2016, gvnn: neural network library for geometric computer vision, 14th European Conference on Computer Vision (ECCV), Publisher: Springer Verlag, Pages: 67-82, ISSN: 0302-9743

We introduce gvnn, a neural network library in Torch aimed towards bridging the gap between classic geometric computer vision and deep learning. Inspired by the recent success of Spatial Transformer Networks, we propose several new layers which are often used as parametric transformations on the data in geometric computer vision. These layers can be inserted within a neural network much in the spirit of the original spatial transformers and allow backpropagation to enable end-to-end learning of a network involving any domain knowledge in geometric computer vision. This opens up applications in learning invariance to 3D geometric transformation for place recognition, end-to-end visual odometry, depth estimation and unsupervised learning through warping with a parametric transformation for image reconstruction error.

Conference paper

Tsiotsios C, Davison AJ, Kim T-K, 2016, Near-lighting Photometric Stereo for unknown scene distance and medium attenuation, IMAGE AND VISION COMPUTING, Vol: 57, Pages: 44-57, ISSN: 0262-8856

Journal article

Whelan T, Salas-Moreno RF, Glocker B, Davison AJ, Leutenegger Set al., 2016, ElasticFusion: real-time dense SLAM and light source estimation, International Journal of Robotics Research, Vol: 35, Pages: 1697-1716, ISSN: 1741-3176

We present a novel approach to real-time dense visual SLAM. Our system is capable of capturing comprehensive dense globallyconsistent surfel-based maps of room scale environments and beyond explored using an RGB-D camera in an incrementalonline fashion, without pose graph optimisation or any post-processing steps. This is accomplished by using dense frame-tomodelcamera tracking and windowed surfel-based fusion coupled with frequent model refinement through non-rigid surfacedeformations. Our approach applies local model-to-model surface loop closure optimisations as often as possible to stay closeto the mode of the map distribution, while utilising global loop closure to recover from arbitrary drift and maintain global consistency.In the spirit of improving map quality as well as tracking accuracy and robustness, we furthermore explore a novelapproach to real-time discrete light source detection. This technique is capable of detecting numerous light sources in indoorenvironments in real-time as a user handheld camera explores the scene. Absolutely no prior information about the scene ornumber of light sources is required. By making a small set of simple assumptions about the appearance properties of the sceneour method can incrementally estimate both the quantity and location of multiple light sources in the environment in an onlinefashion. Our results demonstrate that our technique functions well in many different environments and lighting configurations.We show that this enables (a) more realistic augmented reality (AR) rendering; (b) a richer understanding of the scene beyondpure geometry and; (c) more accurate and robust photometric tracking

Journal article

Kim H, Leutenegger S, Davison AJ, Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera, ECCV 2016-European Conference on Computer Vision, Publisher: Springer, ISSN: 0302-9743

We propose a method which can perform real-time 3D reconstructionfrom a single hand-held event camera with no additional sensing,and works in unstructured scenes of which it has no prior knowledge.It is based on three decoupled probabilistic filters, each estimating 6-DoFcamera motion, scene logarithmic (log) intensity gradient and scene inversedepth relative to a keyframe, and we build a real-time graph ofthese to track and model over an extended local workspace. We alsoupgrade the gradient estimate for each keyframe into an intensity image,allowing us to recover a real-time video-like intensity sequence withspatial and temporal super-resolution from the low bit-rate input eventstream. To the best of our knowledge, this is the first algorithm provablyable to track a general 6D motion along with reconstruction of arbitrarystructure including its intensity and the reconstruction of grayscale videothat exclusively relies on event camera data.

Conference paper

Zia MZ, Nardi L, Jack A, Vespa E, Bodin B, Kelly PHJ, Davison AJet al., 2016, Comparative design space exploration of dense and semi-dense SLAM, 2016 IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 1292-1299, ISSN: 1050-4729

SLAM has matured significantly over the past few years, and is beginning to appear in serious commercial products. While new SLAM systems are being proposed at every conference, evaluation is often restricted to qualitative visualizations or accuracy estimation against a ground truth. This is due to the lack of benchmarking methodologies which can holistically and quantitatively evaluate these systems. Further investigation at the level of individual kernels and parameter spaces of SLAM pipelines is non-existent, which is absolutely essential for systems research and integration. We extend the recently introduced SLAMBench framework to allow comparing two state-of-the-art SLAM pipelines, namely KinectFusion and LSD-SLAM, along the metrics of accuracy, energy consumption, and processing frame rate on two different hardware platforms, namely a desktop and an embedded device. We also analyze the pipelines at the level of individual kernels and explore their algorithmic and hardware design spaces for the first time, yielding valuable insights.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00450245&limit=30&person=true