File Download
Supplementary

postgraduate thesis: Towards generalizable understanding of object 6D pose and human hand action

TitleTowards generalizable understanding of object 6D pose and human hand action
Authors
Advisors
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Wen, Y. [温依林]. (2024). Towards generalizable understanding of object 6D pose and human hand action. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractLearning to understand from visual observations plays a fundamental role in computational intelligence. In particular, understanding object 6D pose and human hand action addresses the critical point of enabling computational agents to perceive the objective surroundings and interpret the intention of human subjects. Moreover, a natural desire is to learn models that can robustly generalize to various real-world scenarios, which further facilitates efficient deployment. In this thesis, we respectively study 6D object pose estimation and human hand action modeling, aiming at learning toward generalizable understanding. In the first half, we address 6D object pose estimation for rigid objects from RGB images. To cope with the inherent challenge of pose ambiguity due to object symmetry and scarceness of real labeled training data, we build on a well-known autoencoding framework to learn implicit encodings of object orientations by training on synthetic images, where we further improve this implicit orientation learning in its original instance-level setting and extend to address the scalable challenge. To improve at the instance level, we leverage sharp edge features of images and impose a geometric prior in training, therefore mitigating the sim-to-real domain gap and enhancing the regularity of the latent space in capturing the geometry of rotation space. We then step forward to learn a scalable network capable of handling multiple training objects and generalizing to novel ones. This is achieved by disentangling the object shape and pose in the implicit learning, meanwhile, we re-entangle the shape and canonical rotations to capture inconsistent latent pose spaces caused by different object symmetries. In the second half, we study dynamic human hand action and solve related tasks, where we propose unified frameworks that faithfully capture the semantic dependency and temporal granularity of hand pose and action. We first model the recognition side with a network hierarchy of two cascaded transformer encoders. The two encoders respectively work on short and long time spans to achieve robust hand pose estimation and accurate action recognition from egocentric RGB videos, with their cascade capturing the semantic correlation. Subsequently, we extend to jointly capture both the recognition and future prediction sides by building the network hierarchy on the generative transformer VAE architecture. In this way, our network design leverages the short-term hand motion and long-term action regularity shared between the observed and future timestamps, further enabling realistic motion prediction and robust performance generalization for hand action modeling. Extensive experiments demonstrate the enhanced capability and generalizability enabled by our technical designs.
DegreeDoctor of Philosophy
SubjectGesture recognition (Computer science)
Hand - Movements
Computational intelligence
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/343766

 

DC FieldValueLanguage
dc.contributor.advisorKomura, T-
dc.contributor.advisorWang, WP-
dc.contributor.authorWen, Yilin-
dc.contributor.author温依林-
dc.date.accessioned2024-06-06T01:04:49Z-
dc.date.available2024-06-06T01:04:49Z-
dc.date.issued2024-
dc.identifier.citationWen, Y. [温依林]. (2024). Towards generalizable understanding of object 6D pose and human hand action. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/343766-
dc.description.abstractLearning to understand from visual observations plays a fundamental role in computational intelligence. In particular, understanding object 6D pose and human hand action addresses the critical point of enabling computational agents to perceive the objective surroundings and interpret the intention of human subjects. Moreover, a natural desire is to learn models that can robustly generalize to various real-world scenarios, which further facilitates efficient deployment. In this thesis, we respectively study 6D object pose estimation and human hand action modeling, aiming at learning toward generalizable understanding. In the first half, we address 6D object pose estimation for rigid objects from RGB images. To cope with the inherent challenge of pose ambiguity due to object symmetry and scarceness of real labeled training data, we build on a well-known autoencoding framework to learn implicit encodings of object orientations by training on synthetic images, where we further improve this implicit orientation learning in its original instance-level setting and extend to address the scalable challenge. To improve at the instance level, we leverage sharp edge features of images and impose a geometric prior in training, therefore mitigating the sim-to-real domain gap and enhancing the regularity of the latent space in capturing the geometry of rotation space. We then step forward to learn a scalable network capable of handling multiple training objects and generalizing to novel ones. This is achieved by disentangling the object shape and pose in the implicit learning, meanwhile, we re-entangle the shape and canonical rotations to capture inconsistent latent pose spaces caused by different object symmetries. In the second half, we study dynamic human hand action and solve related tasks, where we propose unified frameworks that faithfully capture the semantic dependency and temporal granularity of hand pose and action. We first model the recognition side with a network hierarchy of two cascaded transformer encoders. The two encoders respectively work on short and long time spans to achieve robust hand pose estimation and accurate action recognition from egocentric RGB videos, with their cascade capturing the semantic correlation. Subsequently, we extend to jointly capture both the recognition and future prediction sides by building the network hierarchy on the generative transformer VAE architecture. In this way, our network design leverages the short-term hand motion and long-term action regularity shared between the observed and future timestamps, further enabling realistic motion prediction and robust performance generalization for hand action modeling. Extensive experiments demonstrate the enhanced capability and generalizability enabled by our technical designs. -
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshGesture recognition (Computer science)-
dc.subject.lcshHand - Movements-
dc.subject.lcshComputational intelligence-
dc.titleTowards generalizable understanding of object 6D pose and human hand action-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044809205903414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats