LEO: Generative Latent Image Animator for Human Video Synthesis

Wang, Yaohui; Ma, Xin; Chen, Xinyuan; Chen, Cunjian; Dantcheva, Antitza; Dai, Bo; Qiao, Yu

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/s11263-024-02231-3
Scopus: eid_2-s2.0-85204903721
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- HKU Musketeers Foundation Institute of Data Science: Journal/Magazine Articles

Article: LEO: Generative Latent Image Animator for Human Video Synthesis

Title	LEO: Generative Latent Image Animator for Human Video Synthesis
Authors	Wang, Yaohui Ma, Xin Chen, Xinyuan Chen, Cunjian Dantcheva, Antitza Dai, Bo Qiao, Yu
Keywords	Deep generative models Diffusion models Human analysis Video generation
Issue Date	2024
Citation	International Journal of Computer Vision, 2024 How to Cite? DOI: http://dx.doi.org/10.1007/s11263-024-02231-3
Abstract	Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing. Project page: https://wyhsirius.github.io/LEO-project/.
Persistent Identifier	http://hdl.handle.net/10722/352476
ISSN	0920-5691 2023 Impact Factor: 11.6 2023 SCImago Journal Rankings: 6.668

DC Field	Value	Language
dc.contributor.author	Wang, Yaohui	-
dc.contributor.author	Ma, Xin	-
dc.contributor.author	Chen, Xinyuan	-
dc.contributor.author	Chen, Cunjian	-
dc.contributor.author	Dantcheva, Antitza	-
dc.contributor.author	Dai, Bo	-
dc.contributor.author	Qiao, Yu	-
dc.date.accessioned	2024-12-16T03:59:18Z	-
dc.date.available	2024-12-16T03:59:18Z	-
dc.date.issued	2024	-
dc.identifier.citation	International Journal of Computer Vision, 2024	-
dc.identifier.issn	0920-5691	-
dc.identifier.uri	http://hdl.handle.net/10722/352476	-
dc.description.abstract	Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing. Project page: https://wyhsirius.github.io/LEO-project/.	-
dc.language	eng	-
dc.relation.ispartof	International Journal of Computer Vision	-
dc.subject	Deep generative models	-
dc.subject	Diffusion models	-
dc.subject	Human analysis	-
dc.subject	Video generation	-
dc.title	LEO: Generative Latent Image Animator for Human Video Synthesis	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1007/s11263-024-02231-3	-
dc.identifier.scopus	eid_2-s2.0-85204903721	-
dc.identifier.eissn	1573-1405	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: LEO: Generative Latent Image Animator for Human Video Synthesis

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats