File Download
Supplementary

postgraduate thesis: 3D neural implicit human modeling from image and text

Title3D neural implicit human modeling from image and text
Authors
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Cao, Y. [操雨康]. (2024). 3D neural implicit human modeling from image and text. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract3D human modeling holds significant importance and finds widespread applications in diverse domains like virtual and augmented reality. However, existing algorithms encounter difficulties in producing fine face and clothing topology that meet current high standards of ultra-high-definition rendering. Additionally, these algorithms heavily rely on hard-to-acquire 3D datasets, while no prior work has been explored to eliminate such need and directly generate 3D human models from text. This thesis tackles three challenges of 3D neural implicit human modeling from image and text, namely single-view face-enhanced 3D human reconstruction, uncalibrated multi-view 3D human reconstruction with complex poses, and controllable 3D human generation from text. The first part of this thesis aims to enhance the face quality for single-view 3D human reconstruction. Existing methods often fall short in capturing fine face details in both dimension of geometry and texture. We propose a jointly-aligned implicit face function to combine the merits of the implicit function and 3D face prior for high-quality face geometry. We further introduce a coarse-to-fine architecture to produce high-fidelity texture for the reconstructed face model. Our approach is flexible as it can be seamlessly integrated with any reconstruction pipeline for the body, resulting in a reconstruction of the 3D full-body model. In the second part of this thesis, a novel self-evolved signed distance field module is proposed for uncalibrated multi-view 3D human reconstruction. Our framework employs the parametric SMPL-X model as the 3D prior and learns to deform the signed distance field derived from SMPL-X, such that detailed geometry reflecting the actual clothed human can be encoded for better reconstruction. Existing approaches require complex camera calibration and treat features from different viewpoints equally, which greatly hinders their performance and real-world applications. In contrast, we introduce a simple yet effective self-calibration technique based on SMPL-X model and an occlusion-aware feature fusion strategy to aggregate the most useful features from different views. In contrast to the significant reliance on 3D training datasets observed in the previous two parts, the third part of this thesis considers the task of generating 3D human models from only text in a self-optimization manner. Specifically, we employ a trainable NeRF to predict density and color for 3D points and pre-trained text-to-image diffusion models for providing 2D self-supervision. We first leverage the SMPL model to provide shape and pose guidance for the generation, and then introduce a dual-observation-space design that involves the joint optimization of a canonical space and a posed space that are linked by a learnable deformation field. We also jointly optimize the losses computed from the full body and the zoomed-in 3D head to alleviate the multi-face ''Janus'' problem. We show that our method can generate 3D human models under user-guided controllable shapes and poses simply from textual descriptions, largely expanding the potential for future 3D human modeling.
DegreeDoctor of Philosophy
SubjectComputer animation
Computer simulation
Human figure in art
Three-dimensional display systems
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/345441

 

DC FieldValueLanguage
dc.contributor.authorCao, Yukang-
dc.contributor.author操雨康-
dc.date.accessioned2024-08-26T08:59:50Z-
dc.date.available2024-08-26T08:59:50Z-
dc.date.issued2024-
dc.identifier.citationCao, Y. [操雨康]. (2024). 3D neural implicit human modeling from image and text. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/345441-
dc.description.abstract3D human modeling holds significant importance and finds widespread applications in diverse domains like virtual and augmented reality. However, existing algorithms encounter difficulties in producing fine face and clothing topology that meet current high standards of ultra-high-definition rendering. Additionally, these algorithms heavily rely on hard-to-acquire 3D datasets, while no prior work has been explored to eliminate such need and directly generate 3D human models from text. This thesis tackles three challenges of 3D neural implicit human modeling from image and text, namely single-view face-enhanced 3D human reconstruction, uncalibrated multi-view 3D human reconstruction with complex poses, and controllable 3D human generation from text. The first part of this thesis aims to enhance the face quality for single-view 3D human reconstruction. Existing methods often fall short in capturing fine face details in both dimension of geometry and texture. We propose a jointly-aligned implicit face function to combine the merits of the implicit function and 3D face prior for high-quality face geometry. We further introduce a coarse-to-fine architecture to produce high-fidelity texture for the reconstructed face model. Our approach is flexible as it can be seamlessly integrated with any reconstruction pipeline for the body, resulting in a reconstruction of the 3D full-body model. In the second part of this thesis, a novel self-evolved signed distance field module is proposed for uncalibrated multi-view 3D human reconstruction. Our framework employs the parametric SMPL-X model as the 3D prior and learns to deform the signed distance field derived from SMPL-X, such that detailed geometry reflecting the actual clothed human can be encoded for better reconstruction. Existing approaches require complex camera calibration and treat features from different viewpoints equally, which greatly hinders their performance and real-world applications. In contrast, we introduce a simple yet effective self-calibration technique based on SMPL-X model and an occlusion-aware feature fusion strategy to aggregate the most useful features from different views. In contrast to the significant reliance on 3D training datasets observed in the previous two parts, the third part of this thesis considers the task of generating 3D human models from only text in a self-optimization manner. Specifically, we employ a trainable NeRF to predict density and color for 3D points and pre-trained text-to-image diffusion models for providing 2D self-supervision. We first leverage the SMPL model to provide shape and pose guidance for the generation, and then introduce a dual-observation-space design that involves the joint optimization of a canonical space and a posed space that are linked by a learnable deformation field. We also jointly optimize the losses computed from the full body and the zoomed-in 3D head to alleviate the multi-face ''Janus'' problem. We show that our method can generate 3D human models under user-guided controllable shapes and poses simply from textual descriptions, largely expanding the potential for future 3D human modeling.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer animation-
dc.subject.lcshComputer simulation-
dc.subject.lcshHuman figure in art-
dc.subject.lcshThree-dimensional display systems-
dc.title3D neural implicit human modeling from image and text-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044843669603414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats