Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Zhu, L; Liu, X; Liu, X; Qian, R; Liu, Z; Yu, L

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/CVPR52729.2023.01016

Supplementary

Citations:
Appears in Collections:
- Statistics & Actuarial Science: Conference papers

Conference Paper: Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Title	Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
Authors	Zhu, L Liu, X Liu, X Qian, R Liu, Z Yu, L
Issue Date	22-Aug-2023
Abstract	Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-the-art performance, which renders coherent gestures with better mode coverage and stronger audio correlations. Code is available at https://github.com/Advocate99/DiffGesture.
Persistent Identifier	http://hdl.handle.net/10722/333824

DC Field	Value	Language
dc.contributor.author	Zhu, L	-
dc.contributor.author	Liu, X	-
dc.contributor.author	Liu, X	-
dc.contributor.author	Qian, R	-
dc.contributor.author	Liu, Z	-
dc.contributor.author	Yu, L	-
dc.date.accessioned	2023-10-06T08:39:23Z	-
dc.date.available	2023-10-06T08:39:23Z	-
dc.date.issued	2023-08-22	-
dc.identifier.uri	http://hdl.handle.net/10722/333824	-
dc.description.abstract	<p>Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-the-art performance, which renders coherent gestures with better mode coverage and stronger audio correlations. Code is available at https://github.com/Advocate99/DiffGesture.<br></p>	-
dc.language	eng	-
dc.relation.ispartof	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (17/06/2023-24/06/2023, Vancouver, BC, Canada)	-
dc.title	Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation	-
dc.type	Conference_Paper	-
dc.identifier.doi	10.1109/CVPR52729.2023.01016	-
dc.identifier.spage	10544	-
dc.identifier.epage	10553	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats