Improving referring expression grounding with cross-modal attention-guided erasing

Liu, Xihui; Wang, Zihao; Shao, Jing; Wang, Xiaogang; Li, Hongsheng

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/CVPR.2019.00205
Scopus: eid_2-s2.0-85074842634
WOS: WOS:000529484002012
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Electrical & Electronic Engineering: Conference papers

Conference Paper: Improving referring expression grounding with cross-modal attention-guided erasing

Title	Improving referring expression grounding with cross-modal attention-guided erasing
Authors	Liu, Xihui Wang, Zihao Shao, Jing Wang, Xiaogang Li, Hongsheng
Keywords	Categorization Recognition: Detection Retrieval Vision + Language
Issue Date	2019
Citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, v. 2019-June, p. 1950-1959 How to Cite? DOI: http://dx.doi.org/10.1109/CVPR.2019.00205
Abstract	Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that there could be multiple comprehensive textual-visual correspondences between images and referring expressions. To tackle this issue, we design a novel cross-modal attention-guided erasing approach, where we discard the most dominant information from either textual or visual domains to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences. Extensive experiments demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance on three referring expression grounding datasets.
Persistent Identifier	http://hdl.handle.net/10722/316529
ISSN	1063-6919 2023 SCImago Journal Rankings: 10.331
ISI Accession Number ID	WOS:000529484002012

DC Field	Value	Language
dc.contributor.author	Liu, Xihui	-
dc.contributor.author	Wang, Zihao	-
dc.contributor.author	Shao, Jing	-
dc.contributor.author	Wang, Xiaogang	-
dc.contributor.author	Li, Hongsheng	-
dc.date.accessioned	2022-09-14T11:40:41Z	-
dc.date.available	2022-09-14T11:40:41Z	-
dc.date.issued	2019	-
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, v. 2019-June, p. 1950-1959	-
dc.identifier.issn	1063-6919	-
dc.identifier.uri	http://hdl.handle.net/10722/316529	-
dc.description.abstract	Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that there could be multiple comprehensive textual-visual correspondences between images and referring expressions. To tackle this issue, we design a novel cross-modal attention-guided erasing approach, where we discard the most dominant information from either textual or visual domains to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences. Extensive experiments demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance on three referring expression grounding datasets.	-
dc.language	eng	-
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition	-
dc.subject	Categorization	-
dc.subject	Recognition: Detection	-
dc.subject	Retrieval	-
dc.subject	Vision + Language	-
dc.title	Improving referring expression grounding with cross-modal attention-guided erasing	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/CVPR.2019.00205	-
dc.identifier.scopus	eid_2-s2.0-85074842634	-
dc.identifier.volume	2019-June	-
dc.identifier.spage	1950	-
dc.identifier.epage	1959	-
dc.identifier.isi	WOS:000529484002012	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Improving referring expression grounding with cross-modal attention-guided erasing

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats