Towards generalizable understanding of object 6D pose and human hand action

Wen, Yilin; 温依林

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Towards generalizable understanding of object 6D pose and human hand action

Title	Towards generalizable understanding of object 6D pose and human hand action
Authors	Wen, Yilin 温依林
Advisors	Advisor(s):Komura, T Wang, WP
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Wen, Y. [温依林]. (2024). Towards generalizable understanding of object 6D pose and human hand action. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Learning to understand from visual observations plays a fundamental role in computational intelligence. In particular, understanding object 6D pose and human hand action addresses the critical point of enabling computational agents to perceive the objective surroundings and interpret the intention of human subjects. Moreover, a natural desire is to learn models that can robustly generalize to various real-world scenarios, which further facilitates efficient deployment. In this thesis, we respectively study 6D object pose estimation and human hand action modeling, aiming at learning toward generalizable understanding. In the first half, we address 6D object pose estimation for rigid objects from RGB images. To cope with the inherent challenge of pose ambiguity due to object symmetry and scarceness of real labeled training data, we build on a well-known autoencoding framework to learn implicit encodings of object orientations by training on synthetic images, where we further improve this implicit orientation learning in its original instance-level setting and extend to address the scalable challenge. To improve at the instance level, we leverage sharp edge features of images and impose a geometric prior in training, therefore mitigating the sim-to-real domain gap and enhancing the regularity of the latent space in capturing the geometry of rotation space. We then step forward to learn a scalable network capable of handling multiple training objects and generalizing to novel ones. This is achieved by disentangling the object shape and pose in the implicit learning, meanwhile, we re-entangle the shape and canonical rotations to capture inconsistent latent pose spaces caused by different object symmetries. In the second half, we study dynamic human hand action and solve related tasks, where we propose unified frameworks that faithfully capture the semantic dependency and temporal granularity of hand pose and action. We first model the recognition side with a network hierarchy of two cascaded transformer encoders. The two encoders respectively work on short and long time spans to achieve robust hand pose estimation and accurate action recognition from egocentric RGB videos, with their cascade capturing the semantic correlation. Subsequently, we extend to jointly capture both the recognition and future prediction sides by building the network hierarchy on the generative transformer VAE architecture. In this way, our network design leverages the short-term hand motion and long-term action regularity shared between the observed and future timestamps, further enabling realistic motion prediction and robust performance generalization for hand action modeling. Extensive experiments demonstrate the enhanced capability and generalizability enabled by our technical designs.
Degree	Doctor of Philosophy
Subject	Gesture recognition (Computer science) Hand - Movements Computational intelligence
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/343766

DC Field	Value	Language
dc.contributor.advisor	Komura, T	-
dc.contributor.advisor	Wang, WP	-
dc.contributor.author	Wen, Yilin	-
dc.contributor.author	温依林	-
dc.date.accessioned	2024-06-06T01:04:49Z	-
dc.date.available	2024-06-06T01:04:49Z	-
dc.date.issued	2024	-
dc.identifier.citation	Wen, Y. [温依林]. (2024). Towards generalizable understanding of object 6D pose and human hand action. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/343766	-
dc.description.abstract	Learning to understand from visual observations plays a fundamental role in computational intelligence. In particular, understanding object 6D pose and human hand action addresses the critical point of enabling computational agents to perceive the objective surroundings and interpret the intention of human subjects. Moreover, a natural desire is to learn models that can robustly generalize to various real-world scenarios, which further facilitates efficient deployment. In this thesis, we respectively study 6D object pose estimation and human hand action modeling, aiming at learning toward generalizable understanding. In the first half, we address 6D object pose estimation for rigid objects from RGB images. To cope with the inherent challenge of pose ambiguity due to object symmetry and scarceness of real labeled training data, we build on a well-known autoencoding framework to learn implicit encodings of object orientations by training on synthetic images, where we further improve this implicit orientation learning in its original instance-level setting and extend to address the scalable challenge. To improve at the instance level, we leverage sharp edge features of images and impose a geometric prior in training, therefore mitigating the sim-to-real domain gap and enhancing the regularity of the latent space in capturing the geometry of rotation space. We then step forward to learn a scalable network capable of handling multiple training objects and generalizing to novel ones. This is achieved by disentangling the object shape and pose in the implicit learning, meanwhile, we re-entangle the shape and canonical rotations to capture inconsistent latent pose spaces caused by different object symmetries. In the second half, we study dynamic human hand action and solve related tasks, where we propose unified frameworks that faithfully capture the semantic dependency and temporal granularity of hand pose and action. We first model the recognition side with a network hierarchy of two cascaded transformer encoders. The two encoders respectively work on short and long time spans to achieve robust hand pose estimation and accurate action recognition from egocentric RGB videos, with their cascade capturing the semantic correlation. Subsequently, we extend to jointly capture both the recognition and future prediction sides by building the network hierarchy on the generative transformer VAE architecture. In this way, our network design leverages the short-term hand motion and long-term action regularity shared between the observed and future timestamps, further enabling realistic motion prediction and robust performance generalization for hand action modeling. Extensive experiments demonstrate the enhanced capability and generalizability enabled by our technical designs.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Gesture recognition (Computer science)	-
dc.subject.lcsh	Hand - Movements	-
dc.subject.lcsh	Computational intelligence	-
dc.title	Towards generalizable understanding of object 6D pose and human hand action	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044809205903414	-

File Download

Supplementary

postgraduate thesis: Towards generalizable understanding of object 6D pose and human hand action

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats