SCT:根据多模态摘要检索相关图像的摘要标题技术

IF 1.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-02-17 DOI:10.1145/3645029
Shaik Rafi, Ranjita Das
{"title":"SCT:根据多模态摘要检索相关图像的摘要标题技术","authors":"Shaik Rafi, Ranjita Das","doi":"10.1145/3645029","DOIUrl":null,"url":null,"abstract":"<p>This work proposes an efficient Summary Caption Technique(SCT) which considers the multimodal summary and image captions as input to retrieve the correspondence images from the captions that are highly influential to the multimodal summary. Matching a multimodal summary with an appropriate image is a challenging task in computer vision (CV) and natural language processing (NLP) fields. Merging of these fields are tedious, though the research community has steadily focused on the cross-modal retrieval. These issues include the visual question-answering, matching queries with the images, and semantic relationship matching between two modalities for retrieving the corresponding image. Relevant works consider in questions to match the relationship of visual information, object detection and to match the text with visual information, and employing structural-level representation to align the images with the text. However, these techniques are primarily focused on retrieving the images to text or for the image captioning. But less effort has been spent on retrieving relevant images for the multimodal summary. Hence, our proposed technique extracts and merge features in Hybrid Image Text(HIT) layer and captions in the semantic embeddings with word2vec where the contextual features and semantic relationships are compared and matched with each vector between the modalities, with cosine semantic similarity. In cross-modal retrieval, we achieve top five related images and align the relevant images to the multimodal summary that achieves the highest cosine score among the retrieved images. The model has been trained with seq-to-seq modal with 100 epochs, besides reducing the information loss by the sparse categorical cross entropy. Further, experimenting with the multimodal summarization with multimodal output dataset (MSMO), in cross-modal retrieval, helps to evaluate the quality of image alignment with an image-precision metric that demonstrate the best results.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SCT:Summary Caption Technique for Retrieving Relevant Images in Alignment with Multimodal Abstractive Summary\",\"authors\":\"Shaik Rafi, Ranjita Das\",\"doi\":\"10.1145/3645029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>This work proposes an efficient Summary Caption Technique(SCT) which considers the multimodal summary and image captions as input to retrieve the correspondence images from the captions that are highly influential to the multimodal summary. Matching a multimodal summary with an appropriate image is a challenging task in computer vision (CV) and natural language processing (NLP) fields. Merging of these fields are tedious, though the research community has steadily focused on the cross-modal retrieval. These issues include the visual question-answering, matching queries with the images, and semantic relationship matching between two modalities for retrieving the corresponding image. Relevant works consider in questions to match the relationship of visual information, object detection and to match the text with visual information, and employing structural-level representation to align the images with the text. However, these techniques are primarily focused on retrieving the images to text or for the image captioning. But less effort has been spent on retrieving relevant images for the multimodal summary. Hence, our proposed technique extracts and merge features in Hybrid Image Text(HIT) layer and captions in the semantic embeddings with word2vec where the contextual features and semantic relationships are compared and matched with each vector between the modalities, with cosine semantic similarity. In cross-modal retrieval, we achieve top five related images and align the relevant images to the multimodal summary that achieves the highest cosine score among the retrieved images. The model has been trained with seq-to-seq modal with 100 epochs, besides reducing the information loss by the sparse categorical cross entropy. Further, experimenting with the multimodal summarization with multimodal output dataset (MSMO), in cross-modal retrieval, helps to evaluate the quality of image alignment with an image-precision metric that demonstrate the best results.</p>\",\"PeriodicalId\":54312,\"journal\":{\"name\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-02-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3645029\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3645029","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

本作品提出了一种高效的摘要标题技术(SCT),该技术将多模态摘要和图像标题作为输入,从对多模态摘要影响较大的标题中检索对应图像。在计算机视觉(CV)和自然语言处理(NLP)领域,将多模态摘要与合适的图像相匹配是一项具有挑战性的任务。虽然研究界一直在关注跨模态检索,但将这些领域合并在一起是一项繁琐的工作。这些问题包括视觉问题解答、查询与图像匹配,以及两种模式之间的语义关系匹配以检索相应图像。相关作品考虑了在问题中匹配视觉信息的关系、对象检测并将文本与视觉信息匹配,以及采用结构层表示法将图像与文本对齐。然而,这些技术主要集中在检索图像到文本或图像标题。但在为多模态摘要检索相关图像方面所做的努力较少。因此,我们提出的技术通过 word2vec 提取并合并混合图像文本(HIT)层中的特征和语义嵌入中的标题,其中上下文特征和语义关系通过余弦语义相似性与模态之间的每个向量进行比较和匹配。在跨模态检索中,我们获得前五张相关图像,并将相关图像对齐到检索图像中余弦分数最高的多模态摘要。除了通过稀疏分类交叉熵减少信息损失外,该模型还经过了 100 次序列到序列模态训练。此外,在跨模态检索中使用多模态输出数据集(MSMO)进行多模态总结实验,有助于通过图像精度指标评估图像对齐的质量,从而展示最佳结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SCT:Summary Caption Technique for Retrieving Relevant Images in Alignment with Multimodal Abstractive Summary

This work proposes an efficient Summary Caption Technique(SCT) which considers the multimodal summary and image captions as input to retrieve the correspondence images from the captions that are highly influential to the multimodal summary. Matching a multimodal summary with an appropriate image is a challenging task in computer vision (CV) and natural language processing (NLP) fields. Merging of these fields are tedious, though the research community has steadily focused on the cross-modal retrieval. These issues include the visual question-answering, matching queries with the images, and semantic relationship matching between two modalities for retrieving the corresponding image. Relevant works consider in questions to match the relationship of visual information, object detection and to match the text with visual information, and employing structural-level representation to align the images with the text. However, these techniques are primarily focused on retrieving the images to text or for the image captioning. But less effort has been spent on retrieving relevant images for the multimodal summary. Hence, our proposed technique extracts and merge features in Hybrid Image Text(HIT) layer and captions in the semantic embeddings with word2vec where the contextual features and semantic relationships are compared and matched with each vector between the modalities, with cosine semantic similarity. In cross-modal retrieval, we achieve top five related images and align the relevant images to the multimodal summary that achieves the highest cosine score among the retrieved images. The model has been trained with seq-to-seq modal with 100 epochs, besides reducing the information loss by the sparse categorical cross entropy. Further, experimenting with the multimodal summarization with multimodal output dataset (MSMO), in cross-modal retrieval, helps to evaluate the quality of image alignment with an image-precision metric that demonstrate the best results.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.60
自引率
15.00%
发文量
241
期刊介绍: The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.
期刊最新文献
Learning and Vision-based approach for Human fall detection and classification in naturally occurring scenes using video data A DENSE SPATIAL NETWORK MODEL FOR EMOTION RECOGNITION USING LEARNING APPROACHES CNN-Based Models for Emotion and Sentiment Analysis Using Speech Data TRGCN: A Prediction Model for Information Diffusion Based on Transformer and Relational Graph Convolutional Network Adaptive Semantic Information Extraction of Tibetan Opera Mask with Recall Loss
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1