3D neural implicit human modeling from image and text

Cao, Yukang; 操雨康

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: 3D neural implicit human modeling from image and text

Title	3D neural implicit human modeling from image and text
Authors	Cao, Yukang 操雨康
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Cao, Y. [操雨康]. (2024). 3D neural implicit human modeling from image and text. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	3D human modeling holds significant importance and finds widespread applications in diverse domains like virtual and augmented reality. However, existing algorithms encounter difficulties in producing fine face and clothing topology that meet current high standards of ultra-high-definition rendering. Additionally, these algorithms heavily rely on hard-to-acquire 3D datasets, while no prior work has been explored to eliminate such need and directly generate 3D human models from text. This thesis tackles three challenges of 3D neural implicit human modeling from image and text, namely single-view face-enhanced 3D human reconstruction, uncalibrated multi-view 3D human reconstruction with complex poses, and controllable 3D human generation from text. The first part of this thesis aims to enhance the face quality for single-view 3D human reconstruction. Existing methods often fall short in capturing fine face details in both dimension of geometry and texture. We propose a jointly-aligned implicit face function to combine the merits of the implicit function and 3D face prior for high-quality face geometry. We further introduce a coarse-to-fine architecture to produce high-fidelity texture for the reconstructed face model. Our approach is flexible as it can be seamlessly integrated with any reconstruction pipeline for the body, resulting in a reconstruction of the 3D full-body model. In the second part of this thesis, a novel self-evolved signed distance field module is proposed for uncalibrated multi-view 3D human reconstruction. Our framework employs the parametric SMPL-X model as the 3D prior and learns to deform the signed distance field derived from SMPL-X, such that detailed geometry reflecting the actual clothed human can be encoded for better reconstruction. Existing approaches require complex camera calibration and treat features from different viewpoints equally, which greatly hinders their performance and real-world applications. In contrast, we introduce a simple yet effective self-calibration technique based on SMPL-X model and an occlusion-aware feature fusion strategy to aggregate the most useful features from different views. In contrast to the significant reliance on 3D training datasets observed in the previous two parts, the third part of this thesis considers the task of generating 3D human models from only text in a self-optimization manner. Specifically, we employ a trainable NeRF to predict density and color for 3D points and pre-trained text-to-image diffusion models for providing 2D self-supervision. We first leverage the SMPL model to provide shape and pose guidance for the generation, and then introduce a dual-observation-space design that involves the joint optimization of a canonical space and a posed space that are linked by a learnable deformation field. We also jointly optimize the losses computed from the full body and the zoomed-in 3D head to alleviate the multi-face ''Janus'' problem. We show that our method can generate 3D human models under user-guided controllable shapes and poses simply from textual descriptions, largely expanding the potential for future 3D human modeling.
Degree	Doctor of Philosophy
Subject	Computer animation Computer simulation Human figure in art Three-dimensional display systems
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/345441

DC Field	Value	Language
dc.contributor.author	Cao, Yukang	-
dc.contributor.author	操雨康	-
dc.date.accessioned	2024-08-26T08:59:50Z	-
dc.date.available	2024-08-26T08:59:50Z	-
dc.date.issued	2024	-
dc.identifier.citation	Cao, Y. [操雨康]. (2024). 3D neural implicit human modeling from image and text. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/345441	-
dc.description.abstract	3D human modeling holds significant importance and finds widespread applications in diverse domains like virtual and augmented reality. However, existing algorithms encounter difficulties in producing fine face and clothing topology that meet current high standards of ultra-high-definition rendering. Additionally, these algorithms heavily rely on hard-to-acquire 3D datasets, while no prior work has been explored to eliminate such need and directly generate 3D human models from text. This thesis tackles three challenges of 3D neural implicit human modeling from image and text, namely single-view face-enhanced 3D human reconstruction, uncalibrated multi-view 3D human reconstruction with complex poses, and controllable 3D human generation from text. The first part of this thesis aims to enhance the face quality for single-view 3D human reconstruction. Existing methods often fall short in capturing fine face details in both dimension of geometry and texture. We propose a jointly-aligned implicit face function to combine the merits of the implicit function and 3D face prior for high-quality face geometry. We further introduce a coarse-to-fine architecture to produce high-fidelity texture for the reconstructed face model. Our approach is flexible as it can be seamlessly integrated with any reconstruction pipeline for the body, resulting in a reconstruction of the 3D full-body model. In the second part of this thesis, a novel self-evolved signed distance field module is proposed for uncalibrated multi-view 3D human reconstruction. Our framework employs the parametric SMPL-X model as the 3D prior and learns to deform the signed distance field derived from SMPL-X, such that detailed geometry reflecting the actual clothed human can be encoded for better reconstruction. Existing approaches require complex camera calibration and treat features from different viewpoints equally, which greatly hinders their performance and real-world applications. In contrast, we introduce a simple yet effective self-calibration technique based on SMPL-X model and an occlusion-aware feature fusion strategy to aggregate the most useful features from different views. In contrast to the significant reliance on 3D training datasets observed in the previous two parts, the third part of this thesis considers the task of generating 3D human models from only text in a self-optimization manner. Specifically, we employ a trainable NeRF to predict density and color for 3D points and pre-trained text-to-image diffusion models for providing 2D self-supervision. We first leverage the SMPL model to provide shape and pose guidance for the generation, and then introduce a dual-observation-space design that involves the joint optimization of a canonical space and a posed space that are linked by a learnable deformation field. We also jointly optimize the losses computed from the full body and the zoomed-in 3D head to alleviate the multi-face ''Janus'' problem. We show that our method can generate 3D human models under user-guided controllable shapes and poses simply from textual descriptions, largely expanding the potential for future 3D human modeling.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer animation	-
dc.subject.lcsh	Computer simulation	-
dc.subject.lcsh	Human figure in art	-
dc.subject.lcsh	Three-dimensional display systems	-
dc.title	3D neural implicit human modeling from image and text	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044843669603414	-

File Download

Supplementary

postgraduate thesis: 3D neural implicit human modeling from image and text

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats