AutoRF: Learning 3D Object Radiance Fields from Single View Observations

Abstract

We introduce AutoRF – a new approach for learning neural 3D object representations where each object in the training set is observed by only a single view.

This setting is in stark contrast to the majority of existing works that leverage multiple views of the same object, employ explicit priors during training, or require pixel-perfect annotations. To address this challenging setting, we propose to learn a normalized, object-centric representation whose embedding describes and disentangles shape, appearance, and pose.

Each encoding provides well-generalizable, compact information about the object of interest, which is decoded in a single-shot into a new target view, thus enabling novel view synthesis.

We further improve the reconstruction quality by optimizing shape and appearance codes at test time by fitting the representation tightly to the input image.

In a series of experiments, we show that our method generalizes well to unseen objects, even across different datasets of challenging real-world street scenes such as nuScenes, KITTI, and Mapillary Metropolis.

Video

Scene editing

AutoRF naturally disentangles object shape, appearance and pose. This allows to control each property individually while freely moving the camera leading to the first ever implicitly reverse-parked car.

Unseen datasets

Even on unseen datasets with highly different camera properties, light conditions and scene compositions, AutoRF can synthesis reasonable scene representations.

KITTI

Mapillary Metropolis

BibTeX

@inproceedings{mueller2022autorf,
  author    = {M{\"{u}}ller, Norman and Simonelli, Andrea and Porzi, Lorenzo and Bulò, Samuel Rota and Nie{\ss}ner, Matthias and Kontschieder, Peter}},
  title     = {AutoRF: Learning 3D Object Radiance Fields from Single View Observations},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2022}}