{"title":"Improve Image Captioning by Modeling Dynamic Scene Graph Extension","authors":"Minghao Geng, Qingjie Zhao","doi":"10.1145/3512527.3531401","DOIUrl":null,"url":null,"abstract":"Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.