BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer

Pang, Kunkun; Qin, Dafei; Fan, Yingruo; Habekost, Julian; Shiratori, Takaaki; Yamagishi, Junichi; Komura, Taku

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/3592456
Scopus: eid_2-s2.0-85166345445
WOS: WOS:001044671300009
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

See more details

Article: BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer

Title	BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer
Authors	Pang, Kunkun Qin, Dafei Fan, Yingruo Habekost, Julian Shiratori, Takaaki Yamagishi, Junichi Komura, Taku
Keywords	deep learning motion generation transformer
Issue Date	26-Jul-2023
Publisher	Association for Computing Machinery (ACM)
Citation	ACM Transactions on Graphics, 2023, v. 42, n. 4 How to Cite? DOI: http://dx.doi.org/10.1145/3592456
Abstract	Automatic gesture synthesis from speech is a topic that has attracted researchers for applications in remote communication, video games and Metaverse. Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training. In this paper, we propose a novel transformer-based framework for automatic 3D body gesture synthesis from speech. To learn the stochastic nature of the body gesture during speech, we propose a variational transformer to effectively model a probabilistic distribution over gestures, which can produce diverse gestures during inference. Furthermore, we introduce a mode positional embedding layer to capture the different motion speeds in different speaking modes. To cope with the scarcity of data, we design an intra-modal pre-training scheme that can learn the complex mapping between the speech and the 3D gesture from a limited amount of data. Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset. The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.
Persistent Identifier	http://hdl.handle.net/10722/331610
ISSN	0730-0301 2023 Impact Factor: 7.8 2023 SCImago Journal Rankings: 7.766
ISI Accession Number ID	WOS:001044671300009

DC Field	Value	Language
dc.contributor.author	Pang, Kunkun	-
dc.contributor.author	Qin, Dafei	-
dc.contributor.author	Fan, Yingruo	-
dc.contributor.author	Habekost, Julian	-
dc.contributor.author	Shiratori, Takaaki	-
dc.contributor.author	Yamagishi, Junichi	-
dc.contributor.author	Komura, Taku	-
dc.date.accessioned	2023-09-21T06:57:21Z	-
dc.date.available	2023-09-21T06:57:21Z	-
dc.date.issued	2023-07-26	-
dc.identifier.citation	ACM Transactions on Graphics, 2023, v. 42, n. 4	-
dc.identifier.issn	0730-0301	-
dc.identifier.uri	http://hdl.handle.net/10722/331610	-
dc.description.abstract	<p> Automatic gesture synthesis from speech is a topic that has attracted researchers for applications in remote communication, video games and Metaverse. Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training. In this paper, we propose a novel transformer-based framework for automatic 3D body gesture synthesis from speech. To learn the stochastic nature of the body gesture during speech, we propose a variational transformer to effectively model a probabilistic distribution over gestures, which can produce diverse gestures during inference. Furthermore, we introduce a mode positional embedding layer to capture the different motion speeds in different speaking modes. To cope with the scarcity of data, we design an intra-modal pre-training scheme that can learn the complex mapping between the speech and the 3D gesture from a limited amount of data. Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset. The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches. <br></p>	-
dc.language	eng	-
dc.publisher	Association for Computing Machinery (ACM)	-
dc.relation.ispartof	ACM Transactions on Graphics	-
dc.subject	deep learning	-
dc.subject	motion generation	-
dc.subject	transformer	-
dc.title	BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer	-
dc.type	Article	-
dc.identifier.doi	10.1145/3592456	-
dc.identifier.scopus	eid_2-s2.0-85166345445	-
dc.identifier.volume	42	-
dc.identifier.issue	4	-
dc.identifier.eissn	1557-7368	-
dc.identifier.isi	WOS:001044671300009	-
dc.identifier.issnl	0730-0301	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats