Bridging video-text retrieval with multiple  choice questions

Ge, Y; Ge, Y; Liu, D; Li, D; Shan, Y; Qie, X; Luo, P

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Bridging video-text retrieval with multiple choice questions

Title	Bridging video-text retrieval with multiple choice questions
Authors	Ge, Y Ge, Y Liu, D Li, D Shan, Y Qie, X Luo, P
Issue Date	2022
Publisher	IEEE.
Citation	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Hybrid), New Orleans, United States, June 21, 2022, p. 16167-16176 How to Cite?
Abstract	Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the 'questions' constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional content and temporal dynamics. In the form of questions and answers, the semantic associations between local video-text features can be properly established. BridgeFormer is able to be removed for downstream retrieval, rendering an efficient and flexible model with only two encoders. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets with different experimental setups (i.e., zero-shot and fine-tune), including HowTo100M (one million videos). We further conduct zero-shot action recognition, which can be cast as video-to-text retrieval, and our approach also significantly surpasses its counterparts. As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e.g., action recognition with linear evaluation.
Description	Oral presentation
Persistent Identifier	http://hdl.handle.net/10722/315552

DC Field	Value	Language
dc.contributor.author	Ge, Y	-
dc.contributor.author	Ge, Y	-
dc.contributor.author	Liu, D	-
dc.contributor.author	Li, D	-
dc.contributor.author	Shan, Y	-
dc.contributor.author	Qie, X	-
dc.contributor.author	Luo, P	-
dc.date.accessioned	2022-08-19T09:00:01Z	-
dc.date.available	2022-08-19T09:00:01Z	-
dc.date.issued	2022	-
dc.identifier.citation	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Hybrid), New Orleans, United States, June 21, 2022, p. 16167-16176	-
dc.identifier.uri	http://hdl.handle.net/10722/315552	-
dc.description	Oral presentation	-
dc.description.abstract	Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the 'questions' constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional content and temporal dynamics. In the form of questions and answers, the semantic associations between local video-text features can be properly established. BridgeFormer is able to be removed for downstream retrieval, rendering an efficient and flexible model with only two encoders. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets with different experimental setups (i.e., zero-shot and fine-tune), including HowTo100M (one million videos). We further conduct zero-shot action recognition, which can be cast as video-to-text retrieval, and our approach also significantly surpasses its counterparts. As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e.g., action recognition with linear evaluation.	-
dc.language	eng	-
dc.publisher	IEEE.	-
dc.relation.ispartof	Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022	-
dc.rights	Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Copyright © IEEE.	-
dc.title	Bridging video-text retrieval with multiple choice questions	-
dc.type	Conference_Paper	-
dc.identifier.email	Luo, P: pluo@hku.hk	-
dc.identifier.authority	Luo, P=rp02575	-
dc.identifier.hkuros	335566	-
dc.identifier.spage	16167	-
dc.identifier.epage	16176	-
dc.publisher.place	United States	-

File Download

Supplementary

Conference Paper: Bridging video-text retrieval with multiple choice questions

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats