Imperial College London


Faculty of EngineeringDepartment of Computing

Senior Lecturer



+44 (0)20 7594 7123s.leutenegger Website




360ACE ExtensionSouth Kensington Campus






BibTex format

@inproceedings{McCormac and Handa:2017:10.1109/ICCV.2017.292,
author = {McCormac and Handa, A and Leutenegger, S and Davison, AJ},
doi = {10.1109/ICCV.2017.292},
publisher = {IEEE},
title = {SceneNet RGB-D: Can 5M synthetic images beat generic ImageNet pre-training on indoor segmentation?},
url = {},
year = {2017}

RIS format (EndNote, RefMan)

AB - We introduce SceneNet RGB-D, a dataset providing pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection. It also provides perfect camera poses and depth data, allowing investigation into geometric computer vision problems such as optical flow, camera pose estimation, and 3D scene labelling tasks. Random sampling permits virtually unlimited scene configurations, and here we provide 5M rendered RGB-D images from 16K randomly generated 3D trajectories in synthetic layouts, with random but physically simulated object configurations. We compare the semantic segmentation performance of network weights produced from pretraining on RGB images from our dataset against generic VGG-16 ImageNet weights. After fine-tuning on the SUN RGB-D and NYUv2 real-world datasets we find in both cases that the synthetically pre-trained network outperforms the VGG-16 weights. When synthetic pre-training includes a depth channel (something ImageNet cannot natively provide) the performance is greater still. This suggests that large-scale high-quality synthetic RGB datasets with task-specific labels can be more useful for pretraining than real-world generic pre-training such as ImageNet. We host the dataset at http://robotvault.
AU - McCormac
AU - Handa,A
AU - Leutenegger,S
AU - Davison,AJ
DO - 10.1109/ICCV.2017.292
PY - 2017///
SN - 2380-7504
TI - SceneNet RGB-D: Can 5M synthetic images beat generic ImageNet pre-training on indoor segmentation?
UR -
UR -
ER -