File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Relation constraint self-attention for image captioning

TitleRelation constraint self-attention for image captioning
Authors
KeywordsImage captioning
Relation constraint self-attention
Scene graph
Transformer
Issue Date2022
Citation
Neurocomputing, 2022, v. 501, p. 778-789 How to Cite?
AbstractSelf-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA.
Persistent Identifierhttp://hdl.handle.net/10722/325566
ISSN
2023 Impact Factor: 5.5
2023 SCImago Journal Rankings: 1.815
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorJi, Junzhong-
dc.contributor.authorWang, Mingzhan-
dc.contributor.authorZhang, Xiaodan-
dc.contributor.authorLei, Minglong-
dc.contributor.authorQu, Liangqiong-
dc.date.accessioned2023-02-27T07:34:21Z-
dc.date.available2023-02-27T07:34:21Z-
dc.date.issued2022-
dc.identifier.citationNeurocomputing, 2022, v. 501, p. 778-789-
dc.identifier.issn0925-2312-
dc.identifier.urihttp://hdl.handle.net/10722/325566-
dc.description.abstractSelf-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA.-
dc.languageeng-
dc.relation.ispartofNeurocomputing-
dc.subjectImage captioning-
dc.subjectRelation constraint self-attention-
dc.subjectScene graph-
dc.subjectTransformer-
dc.titleRelation constraint self-attention for image captioning-
dc.typeArticle-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1016/j.neucom.2022.06.062-
dc.identifier.scopuseid_2-s2.0-85133231302-
dc.identifier.volume501-
dc.identifier.spage778-
dc.identifier.epage789-
dc.identifier.eissn1872-8286-
dc.identifier.isiWOS:000829601500010-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats