Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems

IF 3.9 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Internet Technology Pub Date : 2024-03-16 DOI:10.1145/3653018
Han Liang, Jincai Chen, Fazlullah Khan, Gautam Srivastava, Jiangfeng Zeng
{"title":"Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems","authors":"Han Liang, Jincai Chen, Fazlullah Khan, Gautam Srivastava, Jiangfeng Zeng","doi":"10.1145/3653018","DOIUrl":null,"url":null,"abstract":"<p>Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.</p>","PeriodicalId":50911,"journal":{"name":"ACM Transactions on Internet Technology","volume":null,"pages":null},"PeriodicalIF":3.9000,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Internet Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3653018","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用多任务混合注意力网络为智能医疗系统提供视听事件定位功能
人类的感知在很大程度上依赖于两种主要感官:视觉和听觉,这两种感官紧密相连,能够相互补充。因此,出现了各种多模态学习任务,视听事件定位(AVEL)就是一个突出的例子。视听事件定位(AVEL)是多模态学习领域的一项热门任务,其主要目标是识别每个视频片段中是否存在事件,并预测其各自的类别。这项任务在医疗监控和监视等领域具有重要作用。一般来说,与单一模式学习相比,视听协同学习能提供更全面的信息,因为它能更全面地感知环境信息,符合现实世界的应用。然而,音频和视频数据固有的异质性可能会带来与事件语义不一致相关的挑战,从而可能导致不正确的预测。为了应对这些挑战,我们提出了一种多任务混合注意力网络(MHAN),以获得多模态数据的高质量表示。具体来说,我们的网络结合了单模态和并行跨模态混合注意力(HAUC)模块,由一个单模态注意力区块和一个并行跨模态注意力区块组成,利用多模态互补和隐藏信息获得更好的表征。此外,我们还主张使用单模态视觉任务作为辅助监督,以提高采用多任务学习策略的多模态任务的性能。在 AVE 数据集上进行的大量实验证明,我们提出的模型优于最先进的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Internet Technology
ACM Transactions on Internet Technology 工程技术-计算机:软件工程
CiteScore
10.30
自引率
1.90%
发文量
137
审稿时长
>12 weeks
期刊介绍: ACM Transactions on Internet Technology (TOIT) brings together many computing disciplines including computer software engineering, computer programming languages, middleware, database management, security, knowledge discovery and data mining, networking and distributed systems, communications, performance and scalability etc. TOIT will cover the results and roles of the individual disciplines and the relationshipsamong them.
期刊最新文献
Towards a Sustainable Blockchain: A Peer-to-Peer Federated Learning based Approach Navigating the Metaverse: A Comprehensive Analysis of Consumer Electronics Prospects and Challenges A Novel Point Cloud Registration Method for Multimedia Communication in Automated Driving Metaverse Interpersonal Communication Interconnection in Media Convergence Metaverse Using Reinforcement Learning and Error Models for Drone Precision Landing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1