Application of Dual Attention Mechanism in Chinese Image Captioning

Yong Zhang, Jing Zhang
{"title":"Application of Dual Attention Mechanism in Chinese Image Captioning","authors":"Yong Zhang, Jing Zhang","doi":"10.4236/jilsa.2020.121002","DOIUrl":null,"url":null,"abstract":"Objective: The Chinese description of images combines the two directions of computer vision and natural language processing. It is a typical representative of multi-mode and cross-domain problems with artificial intelligence algorithms. The image Chinese description model needs to output a Chinese description for each given test picture, describe the sentence requirements to conform to the natural language habits, and point out the important information in the image, covering the main characters, scenes, actions and other content. Since the current open source datasets are mostly in English, the research on the direction of image description is mainly in English. Chinese descriptions usually have greater flexibility in syntax and lexicalization, and the challenges of algorithm implementation are also large. Therefore, only a few people have studied image descriptions, especially Chinese descriptions. Methods: This study attempts to derive a model of image description generation from the Flickr8k-cn and Flickr30k-cn datasets. At each time period of the description, the model can decide whether to rely more on images or text information. The model captures more important information from the image to improve the richness and accuracy of the Chinese description of the image. The image description data set of this study is mainly composed of Chinese description sentences. The method consists of an encoder and a decoder. The encoder is based on a convolutional neural network. The decoder is based on a long-short memory network and is composed of a multi-modal summary generation network. Results: Experiments on Flickr8k-cn and Flickr30k-cn Chinese datasets show that the proposed method is superior to the existing Chinese abstract generation model. Conclusion: The method proposed in this paper is effective, and the performance has been greatly improved on the basis of the benchmark model. Compared with the existing Chinese abstract generation model, its performance is also superior. In the next step, more visual prior information will be incorporated into the model, such as the action category, the relationship between the object and the object, etc., to further improve the quality of the description sentence, and achieve the effect of “seeing the picture writing”.","PeriodicalId":69452,"journal":{"name":"智能学习系统与应用(英文)","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"智能学习系统与应用(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.4236/jilsa.2020.121002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Objective: The Chinese description of images combines the two directions of computer vision and natural language processing. It is a typical representative of multi-mode and cross-domain problems with artificial intelligence algorithms. The image Chinese description model needs to output a Chinese description for each given test picture, describe the sentence requirements to conform to the natural language habits, and point out the important information in the image, covering the main characters, scenes, actions and other content. Since the current open source datasets are mostly in English, the research on the direction of image description is mainly in English. Chinese descriptions usually have greater flexibility in syntax and lexicalization, and the challenges of algorithm implementation are also large. Therefore, only a few people have studied image descriptions, especially Chinese descriptions. Methods: This study attempts to derive a model of image description generation from the Flickr8k-cn and Flickr30k-cn datasets. At each time period of the description, the model can decide whether to rely more on images or text information. The model captures more important information from the image to improve the richness and accuracy of the Chinese description of the image. The image description data set of this study is mainly composed of Chinese description sentences. The method consists of an encoder and a decoder. The encoder is based on a convolutional neural network. The decoder is based on a long-short memory network and is composed of a multi-modal summary generation network. Results: Experiments on Flickr8k-cn and Flickr30k-cn Chinese datasets show that the proposed method is superior to the existing Chinese abstract generation model. Conclusion: The method proposed in this paper is effective, and the performance has been greatly improved on the basis of the benchmark model. Compared with the existing Chinese abstract generation model, its performance is also superior. In the next step, more visual prior information will be incorporated into the model, such as the action category, the relationship between the object and the object, etc., to further improve the quality of the description sentence, and achieve the effect of “seeing the picture writing”.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
双注意机制在中文图像字幕中的应用
目的:图像中文描述结合了计算机视觉和自然语言处理两个方向。它是人工智能算法中多模式、跨领域问题的典型代表。图像中文描述模型需要对每个给定的测试图片输出中文描述,描述符合自然语言习惯的句子要求,并指出图像中的重要信息,涵盖主要人物、场景、动作等内容。由于目前的开源数据集多为英文,因此对图像描述方向的研究主要以英文为主。中文描述通常在语法和词汇化方面具有较大的灵活性,但算法实现的挑战也很大。因此,对图像描述,尤其是中文描述进行研究的人很少。方法:本研究试图从Flickr8k-cn和Flickr30k-cn数据集中导出图像描述生成模型。在描述的每个时间段,模型可以决定更多地依赖图像还是文本信息。该模型从图像中捕获更多重要信息,提高了图像中文描述的丰富性和准确性。本研究的图像描述数据集主要由中文描述句组成。该方法包括一个编码器和一个解码器。编码器基于卷积神经网络。该解码器基于长-短记忆网络,由多模态摘要生成网络组成。结果:在Flickr8k-cn和Flickr30k-cn中文数据集上的实验表明,本文提出的方法优于现有的中文摘要生成模型。结论:本文提出的方法是有效的,在基准模型的基础上,性能有了很大的提高。与现有中文抽象生成模型相比,该模型的性能也较为优越。下一步将在模型中加入更多的视觉先验信息,如动作类别、对象与对象之间的关系等,进一步提高描述句子的质量,达到“见图写字”的效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
135
期刊最新文献
Architecting the Metaverse: Blockchain and the Financial and Legal Regulatory Challenges of Virtual Real Estate A Proposed Meta-Reality Immersive Development Pipeline: Generative AI Models and Extended Reality (XR) Content for the Metaverse A Comparison of PPO, TD3 and SAC Reinforcement Algorithms for Quadruped Walking Gait Generation Multiple Collaborative Service Model and System Construction Based on Industrial Competitive Intelligence Skin Cancer Classification Using Transfer Learning by VGG16 Architecture (Case Study on Kaggle Dataset)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1