File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/CVPR52729.2023.02035
- WOS: WOS:001062531305056
Supplementary
-
Citations:
- Web of Science: 0
- Appears in Collections:
Conference Paper: Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos
| Title | Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos |
|---|---|
| Authors | |
| Issue Date | 18-Jun-2023 |
| Abstract | Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices. |
| Persistent Identifier | http://hdl.handle.net/10722/333846 |
| ISI Accession Number ID |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Wen, Yilin | - |
| dc.contributor.author | Pan, Hao | - |
| dc.contributor.author | Yang, Lei | - |
| dc.contributor.author | Pan, Jia | - |
| dc.contributor.author | Komura, Taku | - |
| dc.contributor.author | Wang, Wenping | - |
| dc.date.accessioned | 2023-10-06T08:39:34Z | - |
| dc.date.available | 2023-10-06T08:39:34Z | - |
| dc.date.issued | 2023-06-18 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/333846 | - |
| dc.description.abstract | <p>Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.<br></p> | - |
| dc.language | eng | - |
| dc.relation.ispartof | The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) (18/06/2023-22/06/2023, Vancouver) | - |
| dc.title | Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos | - |
| dc.type | Conference_Paper | - |
| dc.identifier.doi | 10.1109/CVPR52729.2023.02035 | - |
| dc.identifier.isi | WOS:001062531305056 | - |
