DiffRF: Rendering-guided 3D Radiance Field Diffusion

CVPR 2023 Highlight

1Technical University of Munich, 2Meta Reality Labs
(Work was done during Norman’s and Yawar’s internships at Meta Reality Labs Zurich as well as at TUM.)
(Version update, initial version contained a data mapping error causing all methods being trained/evaluated on a subset)

DiffRF is a denoising diffusion probabilistic model directly operating on 3D radiance fields and trained with an additional volumetric rendering loss. This enables learning strong radiance priors with high rendering quality and accurate geometry.

Abstract

We introduce DiffRF, a novel approach for 3D radiance field synthesis based on denoising diffusion probabilistic models. While existing diffusion-based methods operate on images, latent codes, or point cloud data, we are the first to directly generate volumetric radiance fields. To this end, we propose a 3D denoising model which directly operates on an explicit voxel grid representation. However, as radiance fields generated from a set of posed images can be ambiguous and contain artifacts, obtaining ground truth radiance field samples is non-trivial. We address this challenge by pairing the denoising formulation with a rendering loss, enabling our model to learn a deviated prior that favours good image quality instead of trying to replicate fitting errors like floating artifacts. In contrast to 2D-diffusion models, our model learns multi-view consistent priors, enabling free-view synthesis and accurate shape generation. Compared to 3D GANs, our diffusion-based approach naturally enables conditional generation like masked completion or single-view 3D synthesis at inference time.

Video

Radiance Field Synthesis

Our 3D denoising diffusion probabilistic model learns to synthesize diverse radiance fields that enable high-quality rendering with accurate geometry.

Unconditional synthesis results on PhotoShape Chairs

Unconditional synthesis results on ABO Tables

3D Masked Completion

DiffRF naturally enables 3D masked completion: Given a 3D mask (of arbitrary shape), the goal is to synthesize a completion of the masked region that harmonizes with the non-masked area. We observe that our model produces diverse and matching completions.

Image-to-Volume Synthesis

Given a posed (relative to the 3D bounded box), foreground-segmented image, we can guide the sampling process by simultaneously minimizing the photometric rendering error. This leads to plausible radiance field proposals.

Asset generation for scenes

We see future applications of our radiance field diffusion method in the generation of scene assets where the accurately synthesized geometry can enable physics-based interaction.

Related Links

For some more 3D diffusion-based work, please also check out

DreamFusion: Text-to-3D using 2D Diffusion performs text-guided NeRF generation by 2D Diffusion. They propose Score Distillation Sampling in order to optimize samples via diffusion which could potentially also been applied to other modalities than text.

LION: Latent Point Diffusion Models for 3D Shape Generation introduces a hierarchical approach to learn high-quality point cloud synthesis that can be augmented with mdoern surface reconstruction techniques to generate smooth 3D meshes.

Video Diffusion Models With the third dimension as time, this paper propose a natural extension of image architectures to tackle the task of video diffusion. They introduce a novel conditioning technique for long and high-resolution videos and achieve state-of-the-art results on unconditional video generation.

BibTeX

@inproceedings{muller2023diffrf,
  title={Diffrf: Rendering-guided 3d radiance field diffusion},
  author={M{\"u}ller, Norman and Siddiqui, Yawar and Porzi, Lorenzo and Bulo, Samuel Rota and Kontschieder, Peter and Nie{\ss}ner, Matthias},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={4328--4338},
  year={2023}
}