Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Zeng, Ziyun; Ge, Yuying; Liu, Xihui; Chen, Bin; Luo, Ping; Xia, Shu-Tao; Ge, Yixiao

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:

Conference Paper: Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Title	Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
Authors	Zeng, Ziyun Ge, Yuying Liu, Xihui Chen, Bin Luo, Ping Xia, Shu-Tao Ge, Yixiao
Issue Date	17-Jun-2023
Abstract	t Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal semantics, which hinders further progress in video understanding. Inspired by the great success of imagetext pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Our method enforces the vision model to contextualize what is happening over time so that it can re-organize the narrative transcripts, and can seamlessly apply to large-scale uncurated video data in the real world. Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing. The code is available at https://github.com/ TencentARC/TVT.
Persistent Identifier	http://hdl.handle.net/10722/337771

DC Field	Value	Language
dc.contributor.author	Zeng, Ziyun	-
dc.contributor.author	Ge, Yuying	-
dc.contributor.author	Liu, Xihui	-
dc.contributor.author	Chen, Bin	-
dc.contributor.author	Luo, Ping	-
dc.contributor.author	Xia, Shu-Tao	-
dc.contributor.author	Ge, Yixiao	-
dc.date.accessioned	2024-03-11T10:23:46Z	-
dc.date.available	2024-03-11T10:23:46Z	-
dc.date.issued	2023-06-17	-
dc.identifier.uri	http://hdl.handle.net/10722/337771	-
dc.description.abstract	<p>t Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal semantics, which hinders further progress in video understanding. Inspired by the great success of imagetext pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Our method enforces the vision model to contextualize what is happening over time so that it can re-organize the narrative transcripts, and can seamlessly apply to large-scale uncurated video data in the real world. Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing. The code is available at https://github.com/ TencentARC/TVT.<br></p>	-
dc.language	eng	-
dc.relation.ispartof	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (17/06/2023-24/06/2023, Vancouver, BC, Canada)	-
dc.title	Learning Transferable Spatiotemporal Representations from Natural Script Knowledge	-
dc.type	Conference_Paper	-

File Download

Supplementary

Conference Paper: Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats