File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Towards generalizable understanding of object 6D pose and human hand action
Title | Towards generalizable understanding of object 6D pose and human hand action |
---|---|
Authors | |
Advisors | |
Issue Date | 2024 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Wen, Y. [温依林]. (2024). Towards generalizable understanding of object 6D pose and human hand action. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Learning to understand from visual observations plays a fundamental role in computational intelligence. In particular, understanding object 6D pose and human hand action addresses the critical point of enabling computational agents to perceive the objective surroundings and interpret the intention of human subjects. Moreover, a natural desire is to learn models that can robustly generalize to various real-world scenarios, which further facilitates efficient deployment. In this thesis, we respectively study 6D object pose estimation and human hand action modeling, aiming at learning toward generalizable understanding.
In the first half, we address 6D object pose estimation for rigid objects from RGB images. To cope with the inherent challenge of pose ambiguity due to object symmetry and scarceness of real labeled training data, we build on a well-known autoencoding framework to learn implicit encodings of object orientations by training on synthetic images, where we further improve this implicit orientation learning in its original instance-level setting and extend to address the scalable challenge. To improve at the instance level, we leverage sharp edge features of images and impose a geometric prior in training, therefore mitigating the sim-to-real domain gap and enhancing the regularity of the latent space in capturing the geometry of rotation space. We then step forward to learn a scalable network capable of handling multiple training objects and generalizing to novel ones. This is achieved by disentangling the object shape and pose in the implicit learning, meanwhile, we re-entangle the shape and canonical rotations to capture inconsistent latent pose spaces caused by different object symmetries.
In the second half, we study dynamic human hand action and solve related tasks, where we propose unified frameworks that faithfully capture the semantic dependency and temporal granularity of hand pose and action. We first model the recognition side with a network hierarchy of two cascaded transformer encoders. The two encoders respectively work on short and long time spans to achieve robust hand pose estimation and accurate action recognition from egocentric RGB videos, with their cascade capturing the semantic correlation. Subsequently, we extend to jointly capture both the recognition and future prediction sides by building the network hierarchy on the generative transformer VAE architecture. In this way, our network design leverages the short-term hand motion and long-term action regularity shared between the observed and future timestamps, further enabling realistic motion prediction and robust performance generalization for hand action modeling.
Extensive experiments demonstrate the enhanced capability and generalizability enabled by our technical designs.
|
Degree | Doctor of Philosophy |
Subject | Gesture recognition (Computer science) Hand - Movements Computational intelligence |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/343766 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Komura, T | - |
dc.contributor.advisor | Wang, WP | - |
dc.contributor.author | Wen, Yilin | - |
dc.contributor.author | 温依林 | - |
dc.date.accessioned | 2024-06-06T01:04:49Z | - |
dc.date.available | 2024-06-06T01:04:49Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | Wen, Y. [温依林]. (2024). Towards generalizable understanding of object 6D pose and human hand action. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/343766 | - |
dc.description.abstract | Learning to understand from visual observations plays a fundamental role in computational intelligence. In particular, understanding object 6D pose and human hand action addresses the critical point of enabling computational agents to perceive the objective surroundings and interpret the intention of human subjects. Moreover, a natural desire is to learn models that can robustly generalize to various real-world scenarios, which further facilitates efficient deployment. In this thesis, we respectively study 6D object pose estimation and human hand action modeling, aiming at learning toward generalizable understanding. In the first half, we address 6D object pose estimation for rigid objects from RGB images. To cope with the inherent challenge of pose ambiguity due to object symmetry and scarceness of real labeled training data, we build on a well-known autoencoding framework to learn implicit encodings of object orientations by training on synthetic images, where we further improve this implicit orientation learning in its original instance-level setting and extend to address the scalable challenge. To improve at the instance level, we leverage sharp edge features of images and impose a geometric prior in training, therefore mitigating the sim-to-real domain gap and enhancing the regularity of the latent space in capturing the geometry of rotation space. We then step forward to learn a scalable network capable of handling multiple training objects and generalizing to novel ones. This is achieved by disentangling the object shape and pose in the implicit learning, meanwhile, we re-entangle the shape and canonical rotations to capture inconsistent latent pose spaces caused by different object symmetries. In the second half, we study dynamic human hand action and solve related tasks, where we propose unified frameworks that faithfully capture the semantic dependency and temporal granularity of hand pose and action. We first model the recognition side with a network hierarchy of two cascaded transformer encoders. The two encoders respectively work on short and long time spans to achieve robust hand pose estimation and accurate action recognition from egocentric RGB videos, with their cascade capturing the semantic correlation. Subsequently, we extend to jointly capture both the recognition and future prediction sides by building the network hierarchy on the generative transformer VAE architecture. In this way, our network design leverages the short-term hand motion and long-term action regularity shared between the observed and future timestamps, further enabling realistic motion prediction and robust performance generalization for hand action modeling. Extensive experiments demonstrate the enhanced capability and generalizability enabled by our technical designs. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Gesture recognition (Computer science) | - |
dc.subject.lcsh | Hand - Movements | - |
dc.subject.lcsh | Computational intelligence | - |
dc.title | Towards generalizable understanding of object 6D pose and human hand action | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044809205903414 | - |