AudioNet: Supervised Deep Hashing for Retrieval of Similar Audio Events

IF 4.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-17 DOI:10.1109/TASLP.2024.3446232

Sagar Dutta;Vipul Arora

{"title":"AudioNet: Supervised Deep Hashing for Retrieval of Similar Audio Events","authors":"Sagar Dutta;Vipul Arora","doi":"10.1109/TASLP.2024.3446232","DOIUrl":null,"url":null,"abstract":"This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4526-4536"},"PeriodicalIF":4.1000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10682122/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

音频网：有监督的深度散列检索相似音频事件

这项研究提出了一种用于检索相似音频事件的有监督深度散列方法。所提出的方法名为 AudioNet，是一种基于深度学习的系统，可使用音频示例作为查询，对类似音频事件进行高效散列和检索。通过为相似音频事件生成二进制散列码，AudioNet 在多个标准数据集上实现了较高的检索性能，在该领域树立了新的标杆，并凸显了其与其他散列方法相比的功效和有效性。通过对标准数据集的全面实验，我们的研究在评估相似音频事件的检索性能方面做出了开创性的努力。我们提出了一种新的损失函数，它结合了加权对比损失和加权成对损失以及哈希码平衡，以提高音频事件检索的效率。该方法采用离散梯度传播，允许在反向传播过程中通过离散变量传播梯度。这样，网络就能使用通常用于连续变量的基于梯度的标准优化算法来优化离散散列码。实验结果表明，即使在处理不平衡数据集时，所提出的方法也能显示出良好的检索性能。本研究中进行的系统分析进一步证实了所提方法在多个数据集检索性能方面的显著优势。本研究的发现为今后利用深度音频嵌入高效检索相似音频事件的研究奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach