An apple in a basket.
Bacon cooking on a frying pan.
A blue cube on a red cube.
A bunny on a living room sofa.
A teacup on a saucer.
A hamburger on a plate on a stool.
A lamp on a night stand.
Frying pan on a stove.
A chair next to a table.
Roasted chicken on a plain plate on a wooden table.
Campfire next to a tent.
A TV on a table console.
With the onset of diffusion-based generative models and their ability to generate text-conditioned images, content generation has received a massive invigoration. Recently, these models have been shown to provide useful guidance for the generation of 3D graphics assets. However, existing work in text-conditioned 3D generation faces fundamental constraints: (i) inability to generate detailed, multi-object scenes, (ii) inability to textually control multi-object configurations, and (iii) physically realistic scene composition. In this work, we propose CG3D, a method for compositionally generating scalable 3D assets that resolves these constraints. We find that explicit Gaussian radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes. By utilizing a guidance framework built around this explicit representation, we show state of the art results, capable of even exceeding the guiding diffusion model in terms of object combinations and physics accuracy.
The explicit nature of the compositional radiance field, enables scene editing.
We show a scene of "A table with a banana and plant in front of a couch that is next to a lamp on a stool",
which we first edit remove the lamp on the stool,
then move the potted plant from the table to the stool,
and lastly, edit the couch to become "a red leather couch".
Gaussian radiance field generation for text-to-3D neccesistates frequency duplication of Gaussians to enable expressability.
However, this results in a large number of Gaussians, many of which are redundant.
We propose a distillation method to reduce the number of Gaussians while maintaining the quality of the generated scene.
This enables better scalability for compositions with many objects.
To the right, we show an example of a generated 3D pear that has been distilled (for visualization, we also include a rendering with the Gaussians scaled down to 1%).
@article{vilesov23cg3d,
author = {Vilesov, Alexander and Chari, Pradyumna and Kadambi, Achuta},
title = {CG3D: Compositional Generation for Text-to-3D via Gaussian Splatting},
journal = {Arxiv},
year = {2023},
}