134 results found
James S, Ma Z, Arrojo DR, et al., 2020, RLBench: The robot learning benchmark & learning environment, IEEE Robotics and Automation Letters, Vol: 5, Pages: 3019-3026, ISSN: 2377-3766
We present a challenging new benchmark and learning-environment for robot learning: RLBench. The benchmark features 100 completely unique, hand-designed tasks, ranging in difficulty from simple target reaching and door opening to longer multi-stage tasks, such as opening an oven and placing a tray in it. We provide an array of both proprioceptive observations and visual observations, which include rgb, depth, and segmentation masks from an over-the-shoulder stereo camera and an eye-in-hand monocular camera. Uniquely, each task comes with an infinite supply of demos through the use of motion planners operating on a series of waypoints given during task creation time; enabling an exciting flurry of demonstration-based learning possibilities. RLBench has been designed with scalability in mind; new tasks, along with their motion-planned demos, can be easily created and then verified by a series of tools, allowing users to submit their own tasks to the RLBench task repository. This large-scale benchmark aims to accelerate progress in a number of vision-guided manipulation research areas, including: reinforcement learning, imitation learning, multi-task learning, geometric computer vision, and in particular, few-shot learning. With the benchmark's breadth of tasks and demonstrations, we propose the first large-scale few-shot challenge in robotics. We hope that the scale and diversity of RLBench offers unparalleled research opportunities in the robot learning community and beyond. Benchmarking code and videos can be found at https://sites.google.com/view/rlbench .
Bonardi A, James S, Davison AJ, 2020, Learning one-shot imitation from humans without humans, IEEE Robotics and Automation Letters, Vol: 5, Pages: 3533-3539, ISSN: 2377-3766
Humans can naturally learn to execute a new task by seeing it performed by other individuals once, and then reproduce it in a variety of configurations. Endowing robots with this ability of imitating humans from third person is a very immediate and natural way of teaching new tasks. Only recently, through meta-learning, there have been successful attempts to one-shot imitation learning from humans; however, these approaches require a lot of human resources to collect the data in the real world to train the robot. But is there a way to remove the need for real world human demonstrations during training? We show that with Task-Embedded Control Networks, we can infer control polices by embedding human demonstrations that can condition a control policy and achieve one-shot imitation learning. Importantly, we do not use a real human arm to supply demonstrations during training, but instead leverage domain randomisation in an application that has not been seen before: sim-to-real transfer on humans. Upon evaluating our approach on pushing and placing tasks in both simulation and in the real world, we show that in comparison to a system that was trained on real-world data we are able to achieve similar results by utilising only simulation data. Videos can be found here: https://sites.google.com/view/tecnets-humans .
Estimating motion and surrounding geometry of a moving camera remains a challenging inference problem. From an information theoretic point of view, estimates should get better as more information is included, such as is done in dense SLAM, but this is strongly dependent on the validity of the underlying models. In the present paper, we use triangular meshes as both compact and dense geometry representation. To allow for simple and fast usage, we propose a view-based formulation for which we predict the in-plane vertex coordinates directly from images and then employ the remaining vertex depth components as free variables. Flexible and continuous integration of information is achieved through the use of a residual based inference technique. This so-called factor graph encodes all information as mapping from free variables to residuals, the squared sum of which is minimised during inference. We propose the use of different types of learnable residuals, which are trained end-to-end to increase their suitability as information bearing models and to enable accurate and reliable estimation. Detailed evaluation of all components is provided on both synthetic and real data which confirms the practicability of the presented approach.
Czarnowski J, Laidlow T, Clark R, et al., 2020, DeepFactors: Real-time probabilistic dense monocular SLAM, IEEE Robotics and Automation Letters, Vol: 5, Pages: 721-728, ISSN: 2377-3766
The ability to estimate rich geometry and camera motion from monocular imagery is fundamental to future interactive robotics and augmented reality applications. Different approaches have been proposed that vary in scene geometry representation (sparse landmarks, dense maps), the consistency metric used for optimising the multi-view problem, and the use of learned priors. We present a SLAM system that unifies these methods in a probabilistic framework while still maintaining real-time performance. This is achieved through the use of a learned compact depth map representation and reformulating three different types of errors: photometric, reprojection and geometric, which we make use of within standard factor graph software. We evaluate our system on trajectory estimation and depth reconstruction on real-world sequences and present various examples of estimated dense geometry.
Johns E, Liu S, Davison A, 2020, End-to-end multi-task learning with attention, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, Publisher: IEEE
We propose a novel multi-task learning architecture, which allows learning of task-specific feature-level attention. Our design, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with a soft-attention module for each task. These modules allow for learning of task-specific features from the global features, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be trained end-to-end and can be built upon any feed-forward neural network, is simple to implement, and is parameter efficient. We evaluate our approach on a variety of datasets, across both image-to-image predictions and image classification tasks. We show that our architecture is state-of-the-art in multi-task learning compared to existing methods, and is also less sensitive to various weighting schemes in the multi-task loss function. Code is available at https://github.com/lorenmt/mtan.
Liu S, Davison A, Johns E, 2019, Self-supervised generalisation with meta auxiliary learning, 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Publisher: Neural Information Processing Systems Foundation, Inc.
Learning with auxiliary tasks can improve the ability of a primary task to generalise.However, this comes at the cost of manually labelling auxiliary data. We propose anew method which automatically learns appropriate labels for an auxiliary task,such that any supervised learning task can be improved without requiring access toany further data. The approach is to train two neural networks: a label-generationnetwork to predict the auxiliary labels, and a multi-task network to train theprimary task alongside the auxiliary task. The loss for the label-generation networkincorporates the loss of the multi-task network, and so this interaction between thetwo networks can be seen as a form of meta learning with a double gradient. Weshow that our proposed method, Meta AuXiliary Learning (MAXL), outperformssingle-task learning on 7 image datasets, without requiring any additional data.We also show that MAXL outperforms several other baselines for generatingauxiliary labels, and is even competitive when compared with human-definedauxiliary labels. The self-supervised nature of our method leads to a promisingnew direction towards automated generalisation. Source code can be found athttps://github.com/lorenmt/maxl.
Bujanca M, Gafton P, Saeedi S, et al., 2019, SLAMBench 3.0: Systematic automated reproducible evaluation of SLAM systems for robot vision challenges and scene understanding, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729
As the SLAM research area matures and the number of SLAM systems available increases, the need for frameworks that can objectively evaluate them against prior work grows. This new version of SLAMBench moves beyond traditional visual SLAM, and provides new support for scene understanding and non-rigid environments (dynamic SLAM). More concretely for dynamic SLAM, SLAMBench 3.0 includes the first publicly available implementation of DynamicFusion, along with an evaluation infrastructure. In addition, we include two SLAM systems (one dense, one sparse) augmented with convolutional neural networks for scene understanding, together with datasets and appropriate metrics. Through a series of use-cases, we demonstrate the newly incorporated algorithms, visulation aids and metrics (6 new metrics, 4 new datasets and 5 new algorithms).
Saeedi S, Carvalho EDC, Li W, et al., 2019, Characterizing visual localization and mapping datasets, 2019 International Conference on Robotics and Automation (ICRA), Publisher: Institute of Electrical and Electronics Engineers, ISSN: 1050-4729
Benchmarking mapping and motion estimation algorithms is established practice in robotics and computer vision. As the diversity of datasets increases, in terms of the trajectories, models, and scenes, it becomes a challenge to select datasets for a given benchmarking purpose. Inspired by the Wasserstein distance, this paper addresses this concern by developing novel metrics to evaluate trajectories and the environments without relying on any SLAM or motion estimation algorithm. The metrics, which so far have been missing in the research community, can be applied to the plethora of datasets that exist. Additionally, to improve the robotics SLAM benchmarking, the paper presents a new dataset for visual localization and mapping algorithms. A broad range of real-world trajectories is used in very high-quality scenes and a rendering framework to create a set of synthetic datasets with ground-truth trajectory and dense map which are representative of key SLAM applications such as virtual reality (VR), micro aerial vehicle (MAV) flight, and ground robotics.
Zhi S, Bloesch M, Leutenegger S, et al., 2019, Scenecode: Monocular dense semantic reconstruction using learned encoded scene representations
© 2019 IEEE. Systems which incrementally create 3D semantic maps from image sequences must store and update representations of both geometry and semantic entities. However, while there has been much work on the correct formulation for geometrical estimation, state-of-the-art systems usually rely on simple semantic representations which store and update independent label estimates for each surface element (depth pixels, surfels, or voxels). Spatial correlation is discarded, and fused label maps are incoherent and noisy. We introduce a new compact and optimisable semantic representation by training a variational auto-encoder that is conditioned on a colour image. Using this learned latent space, we can tackle semantic label fusion by jointly optimising the low-dimenional codes associated with each of a set of overlapping images, producing consistent fused label maps which preserve spatial correlation. We also show how this approach can be used within a monocular keyframe based semantic mapping system where a similar code approach is used for geometry. The probabilistic formulation allows a flexible formulation where we can jointly estimate motion, geometry and semantics in a unified optimisation.
Xu B, Li W, Tzoumanikas D, et al., MID-fusion: octree-based object-level multi-instance dynamic SLAM, ICRA 2019- IEEE International Conference on Robotics and Automation, Publisher: IEEE
We propose a new multi-instance dynamic RGB-D SLAM system using anobject-level octree-based volumetric representation. It can provide robustcamera tracking in dynamic environments and at the same time, continuouslyestimate geometric, semantic, and motion properties for arbitrary objects inthe scene. For each incoming frame, we perform instance segmentation to detectobjects and refine mask boundaries using geometric and motion information.Meanwhile, we estimate the pose of each existing moving object using anobject-oriented tracking method and robustly track the camera pose against thestatic scene. Based on the estimated camera pose and object poses, we associatesegmented masks with existing models and incrementally fuse correspondingcolour, depth, semantic, and foreground object probabilities into each objectmodel. In contrast to existing approaches, our system is the first system togenerate an object-level dynamic volumetric map from a single RGB-D camera,which can be used directly for robotic tasks. Our method can run at 2-3 Hz on aCPU, excluding the instance segmentation part. We demonstrate its effectivenessby quantitatively and qualitatively testing it on both synthetic and real-worldsequences.
Saeedi Gharahbolagh S, Bodin B, Wagstaff H, et al., 2018, Navigating the landscape for real-time localisation and mapping for robotics, virtual and augmented reality, Proceedings of the IEEE, Vol: 106, Pages: 2020-2039, ISSN: 0018-9219
Visual understanding of 3-D environments in real time, at low power, is a huge computational challenge. Often referred to as simultaneous localization and mapping (SLAM), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, and virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are: 1) tools and methodology for systematic quantitative evaluation of SLAM algorithms; 2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives; 3) end-to-end simulation tools to enable optimization of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches; and 4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.
Matas J, James S, Davison A, 2018, Sim-to-real reinforcement learning for deformable object manipulation, Conference on Robot Learning 2018, Publisher: PMLR, Pages: 734-743
We have seen much recent progress in rigid object manipulation, but in-teraction with deformable objects has notably lagged behind. Due to the large con-figuration space of deformable objects, solutions using traditional modelling ap-proaches require significant engineering work. Perhaps then, bypassing the needfor explicit modelling and instead learning the control in an end-to-end mannerserves as a better approach? Despite the growing interest in the use of end-to-endrobot learning approaches, only a small amount of work has focused on their ap-plicability to deformable object manipulation. Moreover, due to the large amountof data needed to learn these end-to-end solutions, an emerging trend is to learncontrol policies in simulation and then transfer them over to the real world. To-date, no work has explored whether it is possible to learn and transfer deformableobject policies. We believe that if sim-to-real methods are to be employed fur-ther, then it should be possible to learn to interact with a wide variety of objects,and not only rigid objects. In this work, we use a combination of state-of-the-artdeep reinforcement learning algorithms to solve the problem of manipulating de-formable objects (specifically cloth). We evaluate our approach on three tasks —folding a towel up to a mark, folding a face towel diagonally, and draping a pieceof cloth over a hanger. Our agents are fully trained in simulation with domainrandomisation, and then successfully deployed in the real world without havingseen any real deformable objects.
James S, Blosch M, Davison A, 2018, Task-embedded control networks for few-shot imitation learning, Conference on Robot Learning, Publisher: PMLR, Pages: 783-795
Much like humans, robots should have the ability to leverage knowledge from previously learned tasks in order to learn new tasks quickly in new and unfamiliar environments. Despite this, most robot learning approaches have focused on learning a single task, from scratch, with a limited notion of generalisation, and no way of leveraging the knowledge to learn other tasks more efficiently. One possible solution is meta-learning, but many of the related approaches are limited in their ability to scale to a large number of tasks and to learn further tasks without forgetting previously learned ones. With this in mind, we introduce Task-Embedded Control Networks, which employ ideas from metric learning in order to create a task embedding that can be used by a robot to learn new tasks from one or more demonstrations. In the area of visually-guided manipulation, we present simulation results in which we surpass the performance of a state-of-the-art method when using only visual information from each demonstration. Additionally, we demonstrate that our approach can also be used in conjunction with domain randomisation to train our few-shot learning ability in simulation and then deploy in the real world without any additional training. Once deployed, the robot can learn new tasks from a single real-world demonstration.
McCormac J, Clark R, Bloesch M, et al., 2018, Fusion++: Volumetric object-level SLAM, 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), International Conference on, Publisher: IEEE, Pages: 32-41, ISSN: 2378-3826
We propose an online object-level SLAM system which builds a persistent and accurate 3D graph map of arbitrary reconstructed objects. As an RGB-D camera browses a cluttered indoor scene, Mask-RCNN instance segmentations are used to initialise compact per-object Truncated Signed Distance Function (TSDF) reconstructions with object size-dependent resolutions and a novel 3D foreground mask. Reconstructed objects are stored in an optimisable 6DoF pose graph which is our only persistent map representation. Objects are incrementally refined via depth fusion, and are used for tracking, relocalisation and loop closure detection. Loop closures cause adjustments in the relative pose estimates of object instances, but no intra-object warping. Each object also carries semantic information which is refined over time and an existence probability to account for spurious instance predictions. We demonstrate our approach on a hand-held RGB-D sequence from a cluttered office scene with a large number and variety of object instances, highlighting how the system closes loops and makes good use of existing objects on repeated loops. We quantitatively evaluate the trajectory error of our system against a baseline approach on the RGB-D SLAM benchmark, and qualitatively compare reconstruction quality of discovered objects on the YCB video dataset. Performance evaluation shows our approach is highly memory efficient and runs online at 4-8Hz (excluding relocalisation) despite not being optimised at the software level.
Clark R, Bloesch M, Czarnowski J, et al., 2018, Learning to solve nonlinear least squares for monocular stereo, 15th European Conference on Computer Vision, Publisher: Springer Nature Switzerland AG 2018, Pages: 291-306, ISSN: 0302-9743
Sum-of-squares objective functions are very popular in computer vision algorithms. However, these objective functions are not always easy to optimize. The underlying assumptions made by solvers are often not satisfied and many problems are inherently ill-posed. In this paper, we propose a neural nonlinear least squares optimization algorithm which learns to effectively optimize these cost functions even in the presence of adversities. Unlike traditional approaches, the proposed solver requires no hand-crafted regularizers or priors as these are implicitly learned from the data. We apply our method to the problem of motion stereo ie. jointly estimating the motion and scene geometry from pairs of images of a monocular sequence. We show that our learned optimizer is able to efficiently and effectively solve this challenging optimization problem.
Clark R, Bloesch M, Czarnowski J, et al., LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo, European Conference on Computer Vision
Bodin B, Wagstaff H, Saeedi S, et al., 2018, SLAMBench2: multi-objective head-to-head benchmarking for visual SLAM, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE, Pages: 3637-3644, ISSN: 1050-4729
SLAM is becoming a key component of robotics and augmented reality (AR) systems. While a large number of SLAM algorithms have been presented, there has been little effort to unify the interface of such algorithms, or to perform a holistic comparison of their capabilities. This is a problem since different SLAM applications can have different functional and non-functional requirements. For example, a mobile phone-based AR application has a tight energy budget, while a UAV navigation system usually requires high accuracy. SLAMBench2 is a benchmarking framework to evaluate existing and future SLAM systems, both open and close source, over an extensible list of datasets, while using a comparable and clearly specified list of performance metrics. A wide variety of existing SLAM algorithms and datasets is supported, e.g. ElasticFusion, InfiniTAM, ORB-SLAM2, OKVIS, and integrating new ones is straightforward and clearly specified by the framework. SLAMBench2 is a publicly-available software framework which represents a starting point for quantitative, comparable and val-idatable experimental research to investigate trade-offs across SLAM systems.
Bloesch M, Czarnowski J, Clark R, et al., CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM, IEEE Computer Vision and Pattern Recognition 2018, Publisher: IEEE
The representation of geometry in real-time 3D per-ception systems continues to be a critical research issue.Dense maps capture complete surface shape and can beaugmented with semantic labels, but their high dimension-ality makes them computationally costly to store and pro-cess, and unsuitable for rigorous probabilistic inference.Sparse feature-based representations avoid these problems,but capture only partial scene information and are mainlyuseful for localisation only.We present a new compact but dense representation ofscene geometry which is conditioned on the intensity datafrom a single image and generated from a code consistingof a small number of parameters. We are inspired by workboth on learned depth from images, and auto-encoders. Ourapproach is suitable for use in a keyframe-based monoculardense SLAM system: While each keyframe with a code canproduce a depth map, the code can be optimised efficientlyjointlywith pose variables and together with the codes ofoverlapping keyframes to attain global consistency. Condi-tioning the depth map on the image allows the code to onlyrepresent aspects of the local geometry which cannot di-rectly be predicted from the image. We explain how to learnour code representation, and demonstrate its advantageousproperties in monocular SLAM.
Czarnowski J, Leutenegger S, Davison AJ, 2018, Semantic Texture for Robust Dense Tracking, 16th IEEE International Conference on Computer Vision (ICCV), Publisher: IEEE, Pages: 851-859, ISSN: 2473-9936
McCormac, Handa A, Leutenegger S, et al., 2017, SceneNet RGB-D: Can 5M synthetic images beat generic ImageNet pre-training on indoor segmentation?, International Conference on Computer Vision 2017, Publisher: IEEE, ISSN: 2380-7504
We introduce SceneNet RGB-D, a dataset providing pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection. It also provides perfect camera poses and depth data, allowing investigation into geometric computer vision problems such as optical flow, camera pose estimation, and 3D scene labelling tasks. Random sampling permits virtually unlimited scene configurations, and here we provide 5M rendered RGB-D images from 16K randomly generated 3D trajectories in synthetic layouts, with random but physically simulated object configurations. We compare the semantic segmentation performance of network weights produced from pretraining on RGB images from our dataset against generic VGG-16 ImageNet weights. After fine-tuning on the SUN RGB-D and NYUv2 real-world datasets we find in both cases that the synthetically pre-trained network outperforms the VGG-16 weights. When synthetic pre-training includes a depth channel (something ImageNet cannot natively provide) the performance is greater still. This suggests that large-scale high-quality synthetic RGB datasets with task-specific labels can be more useful for pretraining than real-world generic pre-training such as ImageNet. We host the dataset at http://robotvault. bitbucket.io/scenenet-rgbd.html.
James S, Davison A, Johns E, 2017, Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task, Conference on Robot Learning, Publisher: PMLR, Pages: 334-343
End-to-end control for robot manipulation and grasping is emergingas an attractive alternative to traditional pipelined approaches. However, end-to-end methods tend to either be slow to train, exhibit little or no generalisability,or lack the ability to accomplish long-horizon or multi-stage tasks. In this paper,we show how two simple techniques can lead to end-to-end (image to velocity)execution of a multi-stage task, which is analogous to a simple tidying routine,without having seen a single real image. This involves locating, reaching for, andgrasping a cube, then locating a basket and dropping the cube inside. To achievethis, robot trajectories are computed in a simulator, to collect a series of controlvelocities which accomplish the task. Then, a CNN is trained to map observedimages to velocities, using domain randomisation to enable generalisation to realworld images. Results show that we are able to successfully accomplish the taskin the real world with the ability to generalise to novel environments, includingthose with dynamic lighting conditions, distractor objects, and moving objects,including the basket itself. We believe our approach to be simple, highly scalable,and capable of learning long-horizon tasks that have until now not been shownwith the state-of-the-art in end-to-end robot control.
McCormac J, Handa A, Davison AJ, et al., 2017, SemanticFusion: dense 3D semantic mapping with convolutional neural networks, IEEE International Conference on Robotics and Automation (ICRA), 2017, Publisher: IEEE
Ever more robust, accurate and detailed mapping using visual sensing has proven to be an enabling factor for mobile robots across a wide variety of applications. For the next level of robot intelligence and intuitive user interaction, maps need to extend beyond geometry and appearance — they need to contain semantics. We address this challenge by combining Convolutional Neural Networks (CNNs) and a state-of-the-art dense Simultaneous Localization and Mapping (SLAM) system, ElasticFusion, which provides long-term dense correspondences between frames of indoor RGB-D video even during loopy scanning trajectories. These correspondences allow the CNN's semantic predictions from multiple view points to be probabilistically fused into a map. This not only produces a useful semantic 3D map, but we also show on the NYUv2 dataset that fusing multiple predictions leads to an improvement even in the 2D semantic labelling over baseline single frame predictions. We also show that for a smaller reconstruction dataset with larger variation in prediction viewpoint, the improvement over single frame segmentation increases. Our system is efficient enough to allow real-time interactive use at frame-rates of ≈25Hz.
Saeedi Gharahbolagh S, Nardi L, Johns E, et al., 2017, Application-oriented design space exploration for SLAM algorithms, IEEE International Conference on Robotics and Automation (ICRA), Publisher: IEEE
In visual SLAM, there are many software and hardware parameters, such as algorithmic thresholds and GPU frequency, that need to be tuned; however, this tuning should also take into account the structure and motion of the camera. In this paper, we determine the complexity of the structure and motion with a few parameters calculated using information theory. Depending on this complexity and the desired performance metrics, suitable parameters are explored and determined. Additionally, based on the proposed structure and motion parameters, several applications are presented, including a novel active SLAM approach which guides the camera in such a way that the SLAM algorithm achieves the desired performance metrics. Real-world and simulated experimental results demonstrate the effectiveness of the proposed design space and its applications.
Lukierski R, Leutenegger S, Davison AJ, 2017, Room layout estimation from rapid omnidirectional exploration, IEEE International Conference on Robotics and Automation (ICRA), 2017, Publisher: IEEE
A new generation of practical, low-cost indoor robots is now using wide-angle cameras to aid navigation, but usually this is limited to position estimation via sparse feature-based SLAM. Such robots usually have little global sense of the dimensions, demarcation or identities of the rooms they are in, information which would be very useful to enable behaviour with much more high level intelligence. In this paper we show that we can augment an omni-directional SLAM pipeline with straightforward dense stereo estimation and simple and robust room model fitting to obtain rapid and reliable estimation of the global shape of typical rooms from short robot motions. We have tested our method extensively in real homes, offices and on synthetic data. We also give examples of how our method can extend to making composite maps of larger rooms, and detecting room transitions.
Platinsky L, Davison AJ, Leutenegger S, 2017, Monocular visual odometry: sparse joint optimisation or dense alternation?, IEEE International Conference on Robotics and Automation (ICRA), 2017, Publisher: IEEE, Pages: 5126-5133
Real-time monocular SLAM is increasingly mature and entering commercial products. However, there is a divide between two techniques providing similar performance. Despite the rise of `dense' and `semi-dense' methods which use large proportions of the pixels in a video stream to estimate motion and structure via alternating estimation, they have not eradicated feature-based methods which use a significantly smaller amount of image information from keypoints and retain a more rigorous joint estimation framework. Dense methods provide more complete scene information, but in this paper we focus on how the amount of information and different optimisation methods affect the accuracy of local motion estimation (monocular visual odometry). This topic becomes particularly relevant after the recent results from a direct sparse system. We propose a new method for fairly comparing the accuracy of SLAM frontends in a common setting. We suggest computational cost models for an overall comparison which indicates that there is relative parity between the approaches at the settings allowed by current serial processors when evaluated under equal conditions.
McCormac J, Handa A, Davison A, et al., 2017, SemanticFusion: Dense 3D semantic mapping with convolutional neural networks
© 2017 IEEE. Ever more robust, accurate and detailed mapping using visual sensing has proven to be an enabling factor for mobile robots across a wide variety of applications. For the next level of robot intelligence and intuitive user interaction, maps need to extend beyond geometry and appearance - they need to contain semantics. We address this challenge by combining Convolutional Neural Networks (CNNs) and a state-of-the-art dense Simultaneous Localization and Mapping (SLAM) system, ElasticFusion, which provides long-term dense correspondences between frames of indoor RGB-D video even during loopy scanning trajectories. These correspondences allow the CNN's semantic predictions from multiple view points to be probabilistically fused into a map. This not only produces a useful semantic 3D map, but we also show on the NYUv2 dataset that fusing multiple predictions leads to an improvement even in the 2D semantic labelling over baseline single frame predictions. We also show that for a smaller reconstruction dataset with larger variation in prediction viewpoint, the improvement over single frame segmentation increases. Our system is efficient enough to allow real-time interactive use at frame-rates of ≈25Hz.
Canelhas DR, Schaffernicht E, Stoyanov T, et al., 2017, Compressed voxel-based mapping using unsupervised learning, Robotics, Vol: 6, ISSN: 2218-6581
In order to deal with the scaling problem of volumetric map representations, we propose spatially local methods for high-ratio compression of 3D maps, represented as truncated signed distance fields. We show that these compressed maps can be used as meaningful descriptors for selective decompression in scenarios relevant to robotic applications. As compression methods, we compare using PCA-derived low-dimensional bases to nonlinear auto-encoder networks. Selecting two application-oriented performance metrics, we evaluate the impact of different compression rates on reconstruction fidelity as well as to the task of map-aided ego-motion estimation. It is demonstrated that lossily reconstructed distance fields used as cost functions for ego-motion estimation can outperform the original maps in challenging scenarios from standard RGB-D (color plus depth) data sets due to the rejection of high-frequency noise content.
Tsiotsios A, Davison A, Kim T, Near-lighting Photometric Stereo for unknown scene distance and medium attenuation, Image and Vision Computing, ISSN: 0262-8856
Nardi L, Bodin B, Saeedi S, et al., Algorithmic performance-accuracy trade-off in 3D vision applications using hypermapper, IPDPS, Publisher: IEEE
In this paper we investigate an emerging appli-cation, 3D scene understanding, likely to be significant in themobile space in the near future. The goal of this explorationis to reduce execution time while meeting our quality of resultobjectives. In previous work, we showed for the first time thatit is possible to map this application to power constrainedembedded systems, highlighting that decision choices made atthe algorithmic design-level have the most significant impact.As the algorithmic design space is too large to be exhaus-tively evaluated, we use a previously introduced multi-objectiverandom forest active learning prediction framework dubbedHyperMapper, to find good algorithmic designs. We showthat HyperMapper generalizes on a recent cutting edge 3Dscene understanding algorithm and on a modern GPU-basedcomputer architecture. HyperMapper is able to beat an experthuman hand-tuning the algorithmic parameters of the classof computer vision applications taken under consideration inthis paper automatically. In addition, we use crowd-sourcingusing a 3D scene understanding Android app to show that thePareto front obtained on an embedded system can be used toaccelerate the same application on all the 83 smart-phones andtablets with speedups ranging from 2x to over 12x.
Zienkiewicz J, Tsiotsios C, Davison AJ, et al., 2016, Monocular, Real-Time Surface Reconstruction using Dynamic Level of Detail, International Conference on 3DVision, Publisher: IEEE
We present a scalable, real-time capable method for robustsurface reconstruction that explicitly handles multiplescales. As a monocular camera browses a scene, ouralgorithm processes images as they arrive and incrementallybuilds a detailed surface model. While most of theexisting reconstruction approaches rely on volumetric orpoint-cloud representations of the environment, we performdepth-map and colour fusion directly into a multi-resolutiontriangular mesh that can be adaptively tessellated usingthe concept of Dynamic Level of Detail. Our method relieson least-squares optimisation, which enables a probabilisticallysound and principled formulation of the fusionalgorithm. We demonstrate that our method is capable ofobtaining high quality, close-up reconstruction, as well ascapturing overall scene geometry, while being memory andcomputationally efficient.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.