MixSynthFormer: A Transformer Encoder-like Structure with Mixed Synthetic Self-attention for Efficient Human Pose Estimation

Sun, Yuran; Dougherty, Alan William; Zhang, Zhuoying; Choi, Yi King; Wu, Chuan

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: MixSynthFormer: A Transformer Encoder-like Structure with Mixed Synthetic Self-attention for Efficient Human Pose Estimation

Title	MixSynthFormer: A Transformer Encoder-like Structure with Mixed Synthetic Self-attention for Efficient Human Pose Estimation
Authors	Sun, Yuran Dougherty, Alan William Zhang, Zhuoying Choi, Yi King Wu, Chuan
Issue Date	2-Oct-2023
Publisher	IEEE
Abstract	Human pose estimation in videos has wide-ranging practical applications across various fields, many of which require fast inference on resource-scarce devices, necessitating the development of efficient and accurate algorithms. Previous works have demonstrated the feasibility of exploiting motion continuity to conduct pose estimation using sparsely sampled frames with transformer-based models. However, these methods only consider the temporal relation while neglecting spatial attention, and the complexity of dot product self-attention calculations in transformers are quadratically proportional to the embedding size. To address these limitations, we propose MixSynthFormer, a transformer encoder-like model with MLP-based mixed synthetic attention. By mixing synthesized spatial and temporal attentions, our model incorporates inter-joint and inter-frame importance and can accurately estimate human poses in an entire video sequence from sparsely sampled frames. Additionally, the flexible design of our model makes it versatile for other motion synthesis tasks. Our extensive experiments on 2D/3D pose estimation, body mesh recovery, and motion prediction validate the effectiveness and efficiency of MixSynthFormer.
Persistent Identifier	http://hdl.handle.net/10722/337947

DC Field	Value	Language
dc.contributor.author	Sun, Yuran	-
dc.contributor.author	Dougherty, Alan William	-
dc.contributor.author	Zhang, Zhuoying	-
dc.contributor.author	Choi, Yi King	-
dc.contributor.author	Wu, Chuan	-
dc.date.accessioned	2024-03-11T10:25:07Z	-
dc.date.available	2024-03-11T10:25:07Z	-
dc.date.issued	2023-10-02	-
dc.identifier.uri	http://hdl.handle.net/10722/337947	-
dc.description.abstract	<p>Human pose estimation in videos has wide-ranging practical applications across various fields, many of which require fast inference on resource-scarce devices, necessitating the development of efficient and accurate algorithms. Previous works have demonstrated the feasibility of exploiting motion continuity to conduct pose estimation using sparsely sampled frames with transformer-based models. However, these methods only consider the temporal relation while neglecting spatial attention, and the complexity of dot product self-attention calculations in transformers are quadratically proportional to the embedding size. To address these limitations, we propose MixSynthFormer, a transformer encoder-like model with MLP-based mixed synthetic attention. By mixing synthesized spatial and temporal attentions, our model incorporates inter-joint and inter-frame importance and can accurately estimate human poses in an entire video sequence from sparsely sampled frames. Additionally, the flexible design of our model makes it versatile for other motion synthesis tasks. Our extensive experiments on 2D/3D pose estimation, body mesh recovery, and motion prediction validate the effectiveness and efficiency of MixSynthFormer.</p>	-
dc.language	eng	-
dc.publisher	IEEE	-
dc.relation.ispartof	IEEE International Conference on Computer Vision 2023 (02/10/2023-06/10/2023, Paris)	-
dc.title	MixSynthFormer: A Transformer Encoder-like Structure with Mixed Synthetic Self-attention for Efficient Human Pose Estimation	-
dc.type	Conference_Paper	-

File Download

Supplementary

Conference Paper: MixSynthFormer: A Transformer Encoder-like Structure with Mixed Synthetic Self-attention for Efficient Human Pose Estimation

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats