Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing

Tong, Haonan; Li, Haopeng; Du, Hongyang; Yang, Zhaohui; Yin, Changchuan; Niyato, Dusit

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/LWC.2024.3488859
Scopus: eid_2-s2.0-85208406401
WOS: WOS:001395714200025
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Electrical & Electronic Engineering: Journal/Magazine Articles

Article: Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing

Title	Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing
Authors	Tong, Haonan Li, Haopeng Du, Hongyang Yang, Zhaohui Yin, Changchuan Niyato, Dusit
Keywords	generative adversarial network Multimodal semantic communication video generation
Issue Date	2024
Citation	IEEE Wireless Communications Letters, 2024 How to Cite? DOI: http://dx.doi.org/10.1109/LWC.2024.3488859
Abstract	This paper studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.
Persistent Identifier	http://hdl.handle.net/10722/353228
ISSN	2162-2337 2023 Impact Factor: 4.6 2023 SCImago Journal Rankings: 2.872
ISI Accession Number ID	WOS:001395714200025

DC Field	Value	Language
dc.contributor.author	Tong, Haonan	-
dc.contributor.author	Li, Haopeng	-
dc.contributor.author	Du, Hongyang	-
dc.contributor.author	Yang, Zhaohui	-
dc.contributor.author	Yin, Changchuan	-
dc.contributor.author	Niyato, Dusit	-
dc.date.accessioned	2025-01-13T03:02:44Z	-
dc.date.available	2025-01-13T03:02:44Z	-
dc.date.issued	2024	-
dc.identifier.citation	IEEE Wireless Communications Letters, 2024	-
dc.identifier.issn	2162-2337	-
dc.identifier.uri	http://hdl.handle.net/10722/353228	-
dc.description.abstract	This paper studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.	-
dc.language	eng	-
dc.relation.ispartof	IEEE Wireless Communications Letters	-
dc.subject	generative adversarial network	-
dc.subject	Multimodal semantic communication	-
dc.subject	video generation	-
dc.title	Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/LWC.2024.3488859	-
dc.identifier.scopus	eid_2-s2.0-85208406401	-
dc.identifier.eissn	2162-2345	-
dc.identifier.isi	WOS:001395714200025	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats