JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

Paper Project Code (Coming Soon) Gallery

Abstract

Score Distillation Sampling (SDS) by well-trained 2D diffusion models has shown great promise in text-to-3D generation. However, this paradigm distills view-agnostic 2D image distributions into the rendering distribution of 3D representation for each view independently, overlooking the coherence across views and yielding 3D inconsistency in generations.

In this work, we propose Joint Score Distillation (JSD), a new paradigm that ensures coherent 3D generations. Specifically, we model the joint rendering distribution, which introduces an energy function to capture the coherence among rendered views. We then derive the joint score distillation on multiple rendered views, as opposed to a single view in SDS. In addition, we propose an efficient yet effective binary classification model as an energy function, along with other universal view-aware models, demonstrating compatibility with JSD.

Empirically, JSD significantly mitigates the 3D inconsistency problem in SDS by a 70% drop in Janus rate, while maintaining text congruence. Moreover, we introduce the Geometry Fading scheme and Classifier-Free Guidance (CFG) Switching strategy to enhance generative details. Our framework, JointDreamer, establishes a new benchmark in text-to-3D generation, achieving outstanding results with an 88.5% CLIP R-Precision and 27.7% CLIP Score. These metrics demonstrate exceptional text congruence, as well as remarkable geometric consistency and texture fidelity.

architecture

Example generated objects

JointDreamer generates objects ensuring geometry and textural consistency.

A DSLR photo of a pink Spiderman dancing ballet, Marvel character HD, highly detailed 3D model
Woodies talking with each other, Toy Story, Anime style, more details, 8K, HD
A wide angle zoomed out DSLR photo of a red dragon dressed in a tuxedo and playing chess, 8K, HD, photorealistic
A panda rowing a boat in a pond, 8K, HD, photorealistic
a confused beagle sitting at a desk working on homework
A DSLR photo of Queen Elizabeth riding a motorcycle, 8K, HD, photorealistic
Additional Examples

Comparison Results

We collected 14 prompts from different sources to compare with other text-to-3D methods. A fixed default configuration is used for all prompts without hyper-paramter tuning with threestudio.

Dreamfusion-IF

Magic3D-IF-SD

ProlificDreamer

MVDream

Ours

a DSLR photo of a squirrel playing guitar

Corgi riding a rocket

A zoomed out DSLR photo of a hippo biting through a watermelon, 8K, HD, photorealistic

a DSLR photo of a fox working on a jigsaw puzzle, 8K, HD, photorealistic

a white cat curled up on a wooden chair

a dog is sleeping on a pile of pillows

A DSLR photo of Kungfu panda eating a dumpling, movie style, 8K, HD, photorealistic

A pink Spiderman dancing ballet, Marvel character HD, highly detailed 3D model

a rabbit cutting grass with a lawnmower

a wide angle zoomed out DSLR photo of a skiing penguin wearing a puffy jacket

a zoomed out DSLR photo of a bear playing electric bass


Universal View-aware Models

Our JSD can incorporate with various view-aware models to capture inter-view coherency.

Baseline (SDS)

Binary Classification Model

Image-to-Image Translation Model

Multi-view Generation Model (JointDreamer)

A panda rowing a boat in a pond.

a blue jay standing on a large basket of rainbow macarons.