Imperial College London

DrRonaldClark

Faculty of EngineeringDepartment of Computing

Imperial College Research Fellow
 
 
 
//

Contact

 

ronald.clark Website

 
 
//

Location

 

Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

17 results found

Lin S, Clark R, 2020, LaDDer: latent data distribution modelling with a generative prior, British Machine Vision Conference (BMVC), Publisher: British Machine Vision Association

n this paper, we show that the performance of a learnt generative model is closelyrelated to the model’s ability to accurately represent the inferredlatent data distri-bution, i.e. its topology and structural properties. We propose LaDDer to achieveaccurate modelling of the latent data distribution in a variational autoencoder frame-work and to facilitate better representation learning. The central idea of LaDDer isa meta-embedding concept, which uses multiple VAE models to learn an embeddingof the embeddings, forming a ladder of encodings. We use a non-parametric mix-ture as the hyper prior for the innermost VAE and learn all the parameters in a uni-fied variational framework. From extensive experiments, we show that our LaDDermodel is able to accurately estimate complex latent distribution and results in improve-ment in the representation quality. We also propose a novel latent space interpolationmethod that utilises the derived data distribution. The code and demos are available athttps://github.com/lin-shuyu/ladder-latent-data-distribution-modelling.

Conference paper

Carvalho EDC, Clark R, Nicastro A, Kelly PHJet al., 2020, Scalable uncertainty for computer vision with functional variationalinference, CVPR 2020, Publisher: IEEE, Pages: 12003-12013

As Deep Learning continues to yield successful applications in ComputerVision, the ability to quantify all forms of uncertainty is a paramountrequirement for its safe and reliable deployment in the real-world. In thiswork, we leverage the formulation of variational inference in function space,where we associate Gaussian Processes (GPs) to both Bayesian CNN priors andvariational family. Since GPs are fully determined by their mean and covariancefunctions, we are able to obtain predictive uncertainty estimates at the costof a single forward pass through any chosen CNN architecture and for anysupervised learning task. By leveraging the structure of the induced covariancematrices, we propose numerically efficient algorithms which enable fasttraining in the context of high-dimensional tasks such as depth estimation andsemantic segmentation. Additionally, we provide sufficient conditions forconstructing regression loss functions whose probabilistic counterparts arecompatible with aleatoric uncertainty quantification.

Conference paper

Bloesch M, Laidlow T, Clark R, Leutenegger S, Davison Aet al., 2020, Learning meshes for dense visual SLAM, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Publisher: IEEE

Estimating motion and surrounding geometry of a moving camera remains a challenging inference problem. From an information theoretic point of view, estimates should get better as more information is included, such as is done in dense SLAM, but this is strongly dependent on the validity of the underlying models. In the present paper, we use triangular meshes as both compact and dense geometry representation. To allow for simple and fast usage, we propose a view-based formulation for which we predict the in-plane vertex coordinates directly from images and then employ the remaining vertex depth components as free variables. Flexible and continuous integration of information is achieved through the use of a residual based inference technique. This so-called factor graph encodes all information as mapping from free variables to residuals, the squared sum of which is minimised during inference. We propose the use of different types of learnable residuals, which are trained end-to-end to increase their suitability as information bearing models and to enable accurate and reliable estimation. Detailed evaluation of all components is provided on both synthetic and real data which confirms the practicability of the presented approach.

Conference paper

Bo Y, Jianan W, Clark R, Sen W, Andrew M, Qingyong H, Niki Tet al., 2019, Learning object bounding boxes for 3D instance segmentation on point clouds, 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Publisher: Neural Information Processing Systems Foundation, Inc.

We propose a novel, conceptually simple and general framework for instance seg-mentation on 3D point clouds. Our method, called3D-BoNet, follows the simpledesign philosophy of per-point multilayer perceptrons (MLPs). The frameworkdirectly regresses 3Dboundingboxes for all instances in a point cloud, whilesimultaneously predicting a point-level mask for each instance. It consists of abackbone network followed by two parallel network branches for 1) bounding boxregression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-freeand end-to-end trainable. Moreover, it is remarkably computationally efficientas, unlike existing approaches, it does not require any post-processing steps suchas non-maximum suppression, feature sampling, clustering or voting. Extensiveexperiments show that our approach surpasses existing work on both ScanNet andS3DIS datasets while being approximately10×more computationally efficient.Comprehensive ablation studies demonstrate the effectiveness of our design.

Conference paper

Wen H, Clark R, Wang S, Lu X, Du B, Hu W, Trigoni Net al., 2019, Efficient indoor positioning with visual experiences via lifelong learning, IEEE Transactions on Mobile Computing, Vol: 18, Pages: 814-829, ISSN: 1536-1233

Positioning with visual sensors in indoor environments has many advantages: it doesn't require infrastructure or accurate maps, and is more robust and accurate than other modalities such as WiFi. However, one of the biggest hurdles that prevents its practical application on mobile devices is the time-consuming visual processing pipeline. To overcome this problem, this paper proposes a novel lifelong learning approach to enable efficient and real-time visual positioning. We explore the fact that when following a previous visual experience for multiple times, one could gradually discover clues on how to traverse it with much less effort, e.g., which parts of the scene are more informative, and what kind of visual elements we should expect. Such second-order information is recorded as parameters, which provide key insights of the context and empower our system to dynamically optimise itself to stay localised with minimum cost. We implement the proposed approach on an array of mobile and wearable devices, and evaluate its performance in two indoor settings. Experimental results show our approach can reduce the visual processing time up to two orders of magnitude, while achieving sub-metre positioning accuracy.

Journal article

Nicastro A, Clark R, Leutenegger S, 2019, X-Section: cross-section prediction for enhanced RGBD fusion

Detailed 3D reconstruction is an important challenge with application torobotics, augmented and virtual reality, which has seen impressive progressthroughout the past years. Advancements were driven by the availability ofdepth cameras (RGB-D), as well as increased compute power, e.g.\ in the form ofGPUs -- but also thanks to inclusion of machine learning in the process. Here,we propose X-Section, an RGB-D 3D reconstruction approach that leverages deeplearning to make object-level predictions about thicknesses that can be readilyintegrated into a volumetric multi-view fusion process, where we propose anextension to the popular KinectFusion approach. In essence, our method allowsto complete shape in general indoor scenes behind what is sensed by the RGB-Dcamera, which may be crucial e.g.\ for robotic manipulation tasks or efficientscene exploration. Predicting object thicknesses rather than volumes allows usto work with comparably high spatial resolution without exploding memory andtraining data requirements on the employed Convolutional Neural Networks. In aseries of qualitative and quantitative evaluations, we demonstrate how weaccurately predict object thickness and reconstruct general 3D scenescontaining multiple objects.

Working paper

Bloesch M, Czarnowski J, Clark R, Leutenegger S, Davison AJet al., 2018, CodeSLAM - Learning a compact, optimisable representation for dense visual SLAM, IEEE Computer Vision and Pattern Recognition 2018, Publisher: IEEE, Pages: 2560-2568

The representation of geometry in real-time 3D perception systems continues to be a critical research issue. Dense maps capture complete surface shape and can be augmented with semantic labels, but their high dimensionality makes them computationally costly to store and process, and unsuitable for rigorous probabilistic inference. Sparse feature-based representations avoid these problems, but capture only partial scene information and are mainly useful for localisation only. We present a new compact but dense representation of scene geometry which is conditioned on the intensity data from a single image and generated from a code consisting of a small number of parameters. We are inspired by work both on learned depth from images, and auto-encoders. Our approach is suitable for use in a keyframe-based monocular dense SLAM system: While each keyframe with a code can produce a depth map, the code can be optimised efficiently jointly with pose variables and together with the codes of overlapping keyframes to attain global consistency. Conditioning the depth map on the image allows the code to only represent aspects of the local geometry which cannot directly be predicted from the image. We explain how to learn our code representation, and demonstrate its advantageous properties in monocular SLAM.

Conference paper

McCormac J, Clark R, Bloesch M, Davison A, Leutenegger Set al., 2018, Fusion++: Volumetric object-level SLAM, 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), International Conference on, Publisher: IEEE, Pages: 32-41, ISSN: 2378-3826

We propose an online object-level SLAM system which builds a persistent and accurate 3D graph map of arbitrary reconstructed objects. As an RGB-D camera browses a cluttered indoor scene, Mask-RCNN instance segmentations are used to initialise compact per-object Truncated Signed Distance Function (TSDF) reconstructions with object size-dependent resolutions and a novel 3D foreground mask. Reconstructed objects are stored in an optimisable 6DoF pose graph which is our only persistent map representation. Objects are incrementally refined via depth fusion, and are used for tracking, relocalisation and loop closure detection. Loop closures cause adjustments in the relative pose estimates of object instances, but no intra-object warping. Each object also carries semantic information which is refined over time and an existence probability to account for spurious instance predictions. We demonstrate our approach on a hand-held RGB-D sequence from a cluttered office scene with a large number and variety of object instances, highlighting how the system closes loops and makes good use of existing objects on repeated loops. We quantitatively evaluate the trajectory error of our system against a baseline approach on the RGB-D SLAM benchmark, and qualitatively compare reconstruction quality of discovered objects on the YCB video dataset. Performance evaluation shows our approach is highly memory efficient and runs online at 4-8Hz (excluding relocalisation) despite not being optimised at the software level.

Conference paper

Clark R, Bloesch M, Czarnowski J, Leutenegger S, Davison AJet al., 2018, Learning to solve nonlinear least squares for monocular stereo, 15th European Conference on Computer Vision, Publisher: Springer Nature Switzerland AG 2018, Pages: 291-306, ISSN: 0302-9743

Sum-of-squares objective functions are very popular in computer vision algorithms. However, these objective functions are not always easy to optimize. The underlying assumptions made by solvers are often not satisfied and many problems are inherently ill-posed. In this paper, we propose a neural nonlinear least squares optimization algorithm which learns to effectively optimize these cost functions even in the presence of adversities. Unlike traditional approaches, the proposed solver requires no hand-crafted regularizers or priors as these are implicitly learned from the data. We apply our method to the problem of motion stereo ie. jointly estimating the motion and scene geometry from pairs of images of a monocular sequence. We show that our learned optimizer is able to efficiently and effectively solve this challenging optimization problem.

Conference paper

Li W, Saeedi Gharahbolagh S, McCormac J, Clark R, Tzoumanikas D, Ye Q, Tang R, Leutenegger Set al., 2018, InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset, British Machine Vision Conference (BMVC), Publisher: BMVC

Datasets have gained an enormous amount of popularity in the computer vision com-munity, from training and evaluation of Deep Learning-based methods to benchmarkingSimultaneous Localization and Mapping (SLAM). Without a doubt, synthetic imagerybears a vast potential due to scalability in terms of amounts of data obtainable withouttedious manual ground truth annotations or measurements. Here, we present a datasetwith the aim of providing a higher degree of photo-realism, larger scale, more variabil-ity as well as serving a wider range of purposes compared to existing datasets. Ourdataset leverages the availability of millions of professional interior designs and millionsof production-level furniture and object assets – all coming with fine geometric detailsand high-resolution texture. We render high-resolution and high frame-rate video se-quences following realistic trajectories while supporting various camera types as well asproviding inertial measurements. Together with the release of the dataset, we will makeexecutable program of our interactive simulator software as well as our renderer avail-able athttps://interiornetdataset.github.io. To showcase the usabilityand uniqueness of our dataset, we show benchmarking results of both sparse and denseSLAM algorithms.

Conference paper

Clark R, Bloesch M, Czarnowski J, Leutenegger S, Davison Aet al., 2018, LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo, European Conference on Computer Vision

Conference paper

Wang S, Clark R, Wen H, Trigoni Net al., 2018, End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks, International Journal of Robotics Research, Vol: 37, Pages: 513-542, ISSN: 0278-3649

This paper studies visual odometry (VO) from the perspective of deep learning. After tremendous efforts in the robotics and computer vision communities over the past few decades, state-of-the-art VO algorithms have demonstrated incredible performance. However, since the VO problem is typically formulated as a pure geometric problem, one of the key features still missing from current VO systems is the capability to automatically gain knowledge and improve performance through learning. In this paper, we investigate whether deep neural networks can be effective and beneficial to the VO problem. An end-to-end, sequence-to-sequence probabilistic visual odometry (ESP-VO) framework is proposed for the monocular VO based on deep recurrent convolutional neural networks. It is trained and deployed in an end-to-end manner, that is, directly inferring poses and uncertainties from a sequence of raw images (video) without adopting any modules from the conventional VO pipeline. It can not only automatically learn effective feature representation encapsulating geometric information through convolutional neural networks, but also implicitly model sequential dynamics and relation for VO using deep recurrent neural networks. Uncertainty is also derived along with the VO estimation without introducing much extra computation. Extensive experiments on several datasets representing driving, flying and walking scenarios show competitive performance of the proposed ESP-VO to the state-of-the-art methods, demonstrating a promising potential of the deep learning technique for VO and verifying that it can be a viable complement to current VO systems.

Journal article

Yang B, Wen H, Wang S, Clark R, Markham A, Trigoni Net al., 2017, 3D Object reconstruction from a single depth view with adversarial learning, 16th IEEE International Conference on Computer Vision (ICCV), Publisher: IEEE, Pages: 679-688, ISSN: 2473-9936

In this paper, we propose a novel 3D-RecGAN approach, which reconstructs the complete 3D structure of a given object from a single arbitrary depth view using generative adversarial networks. Unlike the existing work which typically requires multiple views of the same object or class labels to recover the full 3D geometry, the proposed 3D-RecGAN only takes the voxel grid representation of a depth view of the object as input, and is able to generate the complete 3D occupancy grid by filling in the occluded/missing regions. The key idea is to combine the generative capabilities of autoencoders and the conditional Generative Adversarial Networks (GAN) framework, to infer accurate and fine-grained 3D structures of objects in high-dimensional voxel space. Extensive experiments on large synthetic datasets show that the proposed 3D-RecGAN significantly outperforms the state of the art in single view 3D object reconstruction, and is able to reconstruct unseen types of objects. Our code and data are available at: https://github.com/Yang7879/3D-RecGAN.

Conference paper

Clark R, Wang S, Markham A, Trigoni N, Wen Het al., 2017, VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization, IEEE International Conference on Computer Vision and Pattern Recognition

Conference paper

Clark R, Wang S, Wen H, Markham A, Trigoni Net al., 2016, VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem, Thirty-First AAAI Conference on Artificial Intelligence

Conference paper

Li W, Saeedi S, McCormac J, Clark R, Tzoumanikas D, Ye Q, Huang Y, Tang R, Leutenegger Set al., InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset

Datasets have gained an enormous amount of popularity in the computer visioncommunity, from training and evaluation of Deep Learning-based methods tobenchmarking Simultaneous Localization and Mapping (SLAM). Without a doubt,synthetic imagery bears a vast potential due to scalability in terms of amountsof data obtainable without tedious manual ground truth annotations ormeasurements. Here, we present a dataset with the aim of providing a higherdegree of photo-realism, larger scale, more variability as well as serving awider range of purposes compared to existing datasets. Our dataset leveragesthe availability of millions of professional interior designs and millions ofproduction-level furniture and object assets -- all coming with fine geometricdetails and high-resolution texture. We render high-resolution and highframe-rate video sequences following realistic trajectories while supportingvarious camera types as well as providing inertial measurements. Together withthe release of the dataset, we will make executable program of our interactivesimulator software as well as our renderer available athttps://interiornetdataset.github.io. To showcase the usability and uniquenessof our dataset, we show benchmarking results of both sparse and dense SLAMalgorithms.

Conference paper

Clark R, Bloesch M, Czarnowski J, Leutenegger S, Davison AJet al., LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo

Sum-of-squares objective functions are very popular in computer visionalgorithms. However, these objective functions are not always easy to optimize.The underlying assumptions made by solvers are often not satisfied and manyproblems are inherently ill-posed. In this paper, we propose LS-Net, a neuralnonlinear least squares optimization algorithm which learns to effectivelyoptimize these cost functions even in the presence of adversities. Unliketraditional approaches, the proposed solver requires no hand-craftedregularizers or priors as these are implicitly learned from the data. We applyour method to the problem of motion stereo ie. jointly estimating the motionand scene geometry from pairs of images of a monocular sequence. We show thatour learned optimizer is able to efficiently and effectively solve thischallenging optimization problem.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00954784&limit=30&person=true