Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Internet Technology Pub Date : 2024-03-16 DOI:10.1145/3653018

Han Liang, Jincai Chen, Fazlullah Khan, Gautam Srivastava, Jiangfeng Zeng

{"title":"Audio-Visual Event Localization using Multi-task Hybrid Attention Networks for Smart Healthcare Systems","authors":"Han Liang, Jincai Chen, Fazlullah Khan, Gautam Srivastava, Jiangfeng Zeng","doi":"10.1145/3653018","DOIUrl":null,"url":null,"abstract":"<p>Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.</p>","PeriodicalId":50911,"journal":{"name":"ACM Transactions on Internet Technology","volume":"21 1","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Internet Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3653018","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Human perception heavily relies on two primary senses: vision and hearing, which are closely inter-connected and capable of complementing each other. Consequently, various multimodal learning tasks have emerged, with audio-visual event localization (AVEL) being a prominent example. AVEL is a popular task within the realm of multimodal learning, with the primary objective of identifying the presence of events within each video segment and predicting their respective categories. This task holds significant utility in domains such as healthcare monitoring and surveillance, among others. Generally speaking, audio-visual co-learning offers a more comprehensive information landscape compared to single-modal learning, as it allows for a more holistic perception of ambient information, aligning with real-world applications. Nevertheless, the inherent heterogeneity of audio and visual data can introduce challenges related to event semantics inconsistency, potentially leading to incorrect predictions. To track these challenges, we propose a multi-task hybrid attention network (MHAN) to acquire high-quality representation for multimodal data. Specifically, our network incorporates hybrid attention of uni- and parallel cross-modal (HAUC) modules, which consists of a uni-modal attention block and a parallel cross-modal attention block, leveraging multimodal complementary and hidden information for better representation. Furthermore, we advocate for the use of a uni-modal visual task as auxiliary supervision to enhance the performance of multimodal tasks employing a multi-task learning strategy. Our proposed model has been proven to outperform the state-of-the-art results based on extensive experiments conducted on the AVE dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用多任务混合注意力网络为智能医疗系统提供视听事件定位功能

人类的感知在很大程度上依赖于两种主要感官：视觉和听觉，这两种感官紧密相连，能够相互补充。因此，出现了各种多模态学习任务，视听事件定位（AVEL）就是一个突出的例子。视听事件定位（AVEL）是多模态学习领域的一项热门任务，其主要目标是识别每个视频片段中是否存在事件，并预测其各自的类别。这项任务在医疗监控和监视等领域具有重要作用。一般来说，与单一模式学习相比，视听协同学习能提供更全面的信息，因为它能更全面地感知环境信息，符合现实世界的应用。然而，音频和视频数据固有的异质性可能会带来与事件语义不一致相关的挑战，从而可能导致不正确的预测。为了应对这些挑战，我们提出了一种多任务混合注意力网络（MHAN），以获得多模态数据的高质量表示。具体来说，我们的网络结合了单模态和并行跨模态混合注意力（HAUC）模块，由一个单模态注意力区块和一个并行跨模态注意力区块组成，利用多模态互补和隐藏信息获得更好的表征。此外，我们还主张使用单模态视觉任务作为辅助监督，以提高采用多任务学习策略的多模态任务的性能。在 AVE 数据集上进行的大量实验证明，我们提出的模型优于最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Internet Technology 工程技术-计算机：软件工程

CiteScore

10.30

自引率

1.90%

发文量

137

审稿时长

>12 weeks

期刊介绍： ACM Transactions on Internet Technology (TOIT) brings together many computing disciplines including computer software engineering, computer programming languages, middleware, database management, security, knowledge discovery and data mining, networking and distributed systems, communications, performance and scalability etc. TOIT will cover the results and roles of the individual disciplines and the relationshipsamong them.