Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension

Chen, Z; Wang, P; Ma, L; Wong, KKY; Wu, Q

File Download

Content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/CVPR42600.2020.01010
Scopus: eid_2-s2.0-85094165012
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension

Title	Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension
Authors	Chen, Z Wang, P Ma, L Wong, KKY Wu, Q
Issue Date	2020
Publisher	IEEE Computer Society. The Proceedings' web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000147
Citation	IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, USA, 14-19 June 2020, p. 10086-10095 How to Cite? DOI: http://dx.doi.org/10.1109/CVPR42600.2020.01010
Abstract	Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. It requires joint reasoning over the textual and visual domains to solve the problem. Some popular referring expression datasets, however, fail to provide an ideal test bed for evaluating the reasoning ability of the models, mainly because 1) their expressions typically describe only some simple distinctive properties of the object and 2) their images contain limited distracting information. To bridge the gap, we propose a new dataset for visual reasoning in context of referring expression comprehension with two main features. First, we design a novel expression engine rendering various reasoning logics that can be flexibly combined with rich visual properties to generate expressions with varying compositionality. Second, to better exploit the full reasoning chain embodied in an expression, we propose a new test setting by adding additional distracting images containing objects sharing similar properties with the referent, thus minimising the success rate of reasoning-free cross-domain alignment. We evaluate several state-of-the-art REF models, but find none of them can achieve promising performance. A proposed modular hard mining strategy performs the best but still leaves substantial room for improvement. We hope this new dataset and task can serve as a benchmark for deeper visual reasoning analysis and foster the research on referring expression comprehension.
Description	Session: Poster 3.1 — Recognition (Detection, Categorization); Video Analysis and Understanding; Vision + Language - Poster no. 39 ; Paper ID: 206 CVPR 2020 was held virtually due to COVID-19
Persistent Identifier	http://hdl.handle.net/10722/281712
ISSN	1063-6919 2023 SCImago Journal Rankings: 10.331

DC Field	Value	Language
dc.contributor.author	Chen, Z	-
dc.contributor.author	Wang, P	-
dc.contributor.author	Ma, L	-
dc.contributor.author	Wong, KKY	-
dc.contributor.author	Wu, Q	-
dc.date.accessioned	2020-03-22T04:18:37Z	-
dc.date.available	2020-03-22T04:18:37Z	-
dc.date.issued	2020	-
dc.identifier.citation	IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, USA, 14-19 June 2020, p. 10086-10095	-
dc.identifier.issn	1063-6919	-
dc.identifier.uri	http://hdl.handle.net/10722/281712	-
dc.description	Session: Poster 3.1 — Recognition (Detection, Categorization); Video Analysis and Understanding; Vision + Language - Poster no. 39 ; Paper ID: 206	-
dc.description	CVPR 2020 was held virtually due to COVID-19	-
dc.description.abstract	Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. It requires joint reasoning over the textual and visual domains to solve the problem. Some popular referring expression datasets, however, fail to provide an ideal test bed for evaluating the reasoning ability of the models, mainly because 1) their expressions typically describe only some simple distinctive properties of the object and 2) their images contain limited distracting information. To bridge the gap, we propose a new dataset for visual reasoning in context of referring expression comprehension with two main features. First, we design a novel expression engine rendering various reasoning logics that can be flexibly combined with rich visual properties to generate expressions with varying compositionality. Second, to better exploit the full reasoning chain embodied in an expression, we propose a new test setting by adding additional distracting images containing objects sharing similar properties with the referent, thus minimising the success rate of reasoning-free cross-domain alignment. We evaluate several state-of-the-art REF models, but find none of them can achieve promising performance. A proposed modular hard mining strategy performs the best but still leaves substantial room for improvement. We hope this new dataset and task can serve as a benchmark for deeper visual reasoning analysis and foster the research on referring expression comprehension.	-
dc.language	eng	-
dc.publisher	IEEE Computer Society. The Proceedings' web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000147	-
dc.relation.ispartof	IEEE International Conference on Computer Vision and Pattern Recognition	-
dc.rights	IEEE Conference on Computer Vision and Pattern Recognition. Proceedings. Copyright © IEEE Computer Society.	-
dc.rights	©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	-
dc.title	Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension	-
dc.type	Conference_Paper	-
dc.identifier.email	Wong, KKY: kykwong@cs.hku.hk	-
dc.identifier.authority	Wong, KKY=rp01393	-
dc.description.nature	postprint	-
dc.identifier.doi	10.1109/CVPR42600.2020.01010	-
dc.identifier.scopus	eid_2-s2.0-85094165012	-
dc.identifier.hkuros	309424	-
dc.identifier.hkuros	310869	-
dc.identifier.spage	10086	-
dc.identifier.epage	10095	-
dc.publisher.place	United States	-
dc.identifier.issnl	1063-6919	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats