不同的处理模式和适当的模型选择可带来更丰富的多模式输入表征

Saroj Kumar Panda, Tausif Diwan, Omprakash G. Kakde
{"title":"不同的处理模式和适当的模型选择可带来更丰富的多模式输入表征","authors":"Saroj Kumar Panda, Tausif Diwan, Omprakash G. Kakde","doi":"10.1007/s41870-024-02113-4","DOIUrl":null,"url":null,"abstract":"<p>We aim to effectively solve and improvise the Meta Meme Challenge for the binary classification of hateful memes detection on a multimodal dataset launched by Meta. This problem has its challenges in terms of individual modality processing and its impact on the final classification of hateful memes. We focus on feature-level fusion methodologies in proposing the solutions for hateful memes detection in comparison with the decision-level fusion as feature-level fusion generates richer features’ representation for further processing. Appropriate model selection in multimodal data processing plays an important role in the downstream tasks. Moreover, inherent negativity associated with the visual modality may not be detected completely through the visual processing models, necessitating the differently processed visual data through some other techniques. Specifically, we propose two feature-level fusion-based methodologies for the aforesaid classification problem, employing VisualBERT for the effective representation of textual and visual modality. Additionally, we employ image captioning generating the textual captions from the visual modality of the multimodal input which is further fused with the actual text associated with the input through the Tensor Fusion Networks. Our proposed model considerably outperforms the state of the arts on accuracy and AuROC performance metrics.</p>","PeriodicalId":14138,"journal":{"name":"International Journal of Information Technology","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Differently processed modality and appropriate model selection lead to richer representation of the multimodal input\",\"authors\":\"Saroj Kumar Panda, Tausif Diwan, Omprakash G. Kakde\",\"doi\":\"10.1007/s41870-024-02113-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>We aim to effectively solve and improvise the Meta Meme Challenge for the binary classification of hateful memes detection on a multimodal dataset launched by Meta. This problem has its challenges in terms of individual modality processing and its impact on the final classification of hateful memes. We focus on feature-level fusion methodologies in proposing the solutions for hateful memes detection in comparison with the decision-level fusion as feature-level fusion generates richer features’ representation for further processing. Appropriate model selection in multimodal data processing plays an important role in the downstream tasks. Moreover, inherent negativity associated with the visual modality may not be detected completely through the visual processing models, necessitating the differently processed visual data through some other techniques. Specifically, we propose two feature-level fusion-based methodologies for the aforesaid classification problem, employing VisualBERT for the effective representation of textual and visual modality. Additionally, we employ image captioning generating the textual captions from the visual modality of the multimodal input which is further fused with the actual text associated with the input through the Tensor Fusion Networks. Our proposed model considerably outperforms the state of the arts on accuracy and AuROC performance metrics.</p>\",\"PeriodicalId\":14138,\"journal\":{\"name\":\"International Journal of Information Technology\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s41870-024-02113-4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-024-02113-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们的目标是有效解决和改进 Meta Meme 挑战赛,在 Meta 推出的多模态数据集上对仇恨备忘录检测进行二元分类。这一问题在单个模态处理及其对仇恨备忘录最终分类的影响方面存在挑战。与决策级融合相比,我们在提出仇恨备忘录检测解决方案时侧重于特征级融合方法,因为特征级融合能生成更丰富的特征表征供进一步处理。多模态数据处理中适当的模型选择在下游任务中发挥着重要作用。此外,与视觉模式相关的固有否定性可能无法通过视觉处理模型完全检测出来,这就需要通过其他技术对视觉数据进行不同的处理。具体来说,我们针对上述分类问题提出了两种基于特征级融合的方法,并利用 VisualBERT 有效地表示文本和视觉模式。此外,我们还采用了图像标题技术,从多模态输入的视觉模态生成文本标题,并通过张量融合网络与输入的实际文本进一步融合。我们提出的模型在准确性和 AuROC 性能指标上大大优于同类技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Differently processed modality and appropriate model selection lead to richer representation of the multimodal input

We aim to effectively solve and improvise the Meta Meme Challenge for the binary classification of hateful memes detection on a multimodal dataset launched by Meta. This problem has its challenges in terms of individual modality processing and its impact on the final classification of hateful memes. We focus on feature-level fusion methodologies in proposing the solutions for hateful memes detection in comparison with the decision-level fusion as feature-level fusion generates richer features’ representation for further processing. Appropriate model selection in multimodal data processing plays an important role in the downstream tasks. Moreover, inherent negativity associated with the visual modality may not be detected completely through the visual processing models, necessitating the differently processed visual data through some other techniques. Specifically, we propose two feature-level fusion-based methodologies for the aforesaid classification problem, employing VisualBERT for the effective representation of textual and visual modality. Additionally, we employ image captioning generating the textual captions from the visual modality of the multimodal input which is further fused with the actual text associated with the input through the Tensor Fusion Networks. Our proposed model considerably outperforms the state of the arts on accuracy and AuROC performance metrics.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Statistical cryptanalysis of seven classical lightweight ciphers CNN-BO-LSTM: an ensemble framework for prognosis of liver cancer Architecting lymphoma fusion: PROMETHEE-II guided optimization of combination therapeutic synergy RBCA-ETS: enhancing extractive text summarization with contextual embedding and word-level attention RAMD and transient analysis of a juice clarification unit in sugar plants
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1