Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Zhao, Zibo; Liu, Wen; Chen, Xin; Zeng, Xianfang; Wang, Rui; Cheng, Pei; Fu, Bin; Chen, Tao; Yu, Gang; Gao, Shenghua

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Scopus: eid_2-s2.0-85188850274
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Title	Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation
Authors	Zhao, Zibo Liu, Wen Chen, Xin Zeng, Xianfang Wang, Rui Cheng, Pei Fu, Bin Chen, Tao Yu, Gang Gao, Shenghua
Issue Date	2023
Citation	Advances in Neural Information Processing Systems, 2023, v. 36 How to Cite?
Abstract	We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to produce inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.
Persistent Identifier	http://hdl.handle.net/10722/345378
ISSN	1049-5258 2020 SCImago Journal Rankings: 1.399

DC Field	Value	Language
dc.contributor.author	Zhao, Zibo	-
dc.contributor.author	Liu, Wen	-
dc.contributor.author	Chen, Xin	-
dc.contributor.author	Zeng, Xianfang	-
dc.contributor.author	Wang, Rui	-
dc.contributor.author	Cheng, Pei	-
dc.contributor.author	Fu, Bin	-
dc.contributor.author	Chen, Tao	-
dc.contributor.author	Yu, Gang	-
dc.contributor.author	Gao, Shenghua	-
dc.date.accessioned	2024-08-15T09:26:58Z	-
dc.date.available	2024-08-15T09:26:58Z	-
dc.date.issued	2023	-
dc.identifier.citation	Advances in Neural Information Processing Systems, 2023, v. 36	-
dc.identifier.issn	1049-5258	-
dc.identifier.uri	http://hdl.handle.net/10722/345378	-
dc.description.abstract	We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to produce inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.	-
dc.language	eng	-
dc.relation.ispartof	Advances in Neural Information Processing Systems	-
dc.title	Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.scopus	eid_2-s2.0-85188850274	-
dc.identifier.volume	36	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats