FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow

IF 4.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-04 DOI:10.1109/TASLP.2024.3486227

Anup Singh;Kris Demuynck;Vipul Arora

{"title":"FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow","authors":"Anup Singh;Kris Demuynck;Vipul Arora","doi":"10.1109/TASLP.2024.3486227","DOIUrl":null,"url":null,"abstract":"Nearest neighbor search on context representation vectors is a formidable task due to challenges posed by high dimensionality, scalability issues, and potential noise within query vectors. Our novel approach leverages normalizing flow within a self-supervised learning framework to effectively tackle these challenges, specifically in the context of audio fingerprinting tasks. Audio fingerprinting systems incorporate two key components: audio encoding and indexing. The existing systems consider these components independently, resulting in suboptimal performance. Our approach optimizes the interplay between these components, facilitating the adaptation of vectors to the indexing structure. Additionally, we distribute vectors in the latent \n<inline-formula><tex-math>$\\mathbb {R}^{K}$</tex-math></inline-formula>\n space using normalizing flow, resulting in balanced \n<inline-formula><tex-math>$K$</tex-math></inline-formula>\n-bit hash codes. This allows indexing vectors using a balanced hash table, where vectors are uniformly distributed across all possible \n<inline-formula><tex-math>$2^{K}$</tex-math></inline-formula>\n hash buckets. This significantly accelerates retrieval, achieving speedups of up to 2× and 1.4× compared to the Locality-Sensitive Hashing (LSH) and Product Quantization (PQ), respectively. We empirically demonstrate that our system is scalable, highly effective, and efficient in identifying short audio queries (\n<inline-formula><tex-math>$\\leq$</tex-math></inline-formula>\n2 s), particularly at high noise and reverberation levels.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4961-4970"},"PeriodicalIF":4.1000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10741572/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Nearest neighbor search on context representation vectors is a formidable task due to challenges posed by high dimensionality, scalability issues, and potential noise within query vectors. Our novel approach leverages normalizing flow within a self-supervised learning framework to effectively tackle these challenges, specifically in the context of audio fingerprinting tasks. Audio fingerprinting systems incorporate two key components: audio encoding and indexing. The existing systems consider these components independently, resulting in suboptimal performance. Our approach optimizes the interplay between these components, facilitating the adaptation of vectors to the indexing structure. Additionally, we distribute vectors in the latent

$\mathbb {R}^{K}$

space using normalizing flow, resulting in balanced

$K$

-bit hash codes. This allows indexing vectors using a balanced hash table, where vectors are uniformly distributed across all possible

$2^{K}$

hash buckets. This significantly accelerates retrieval, achieving speedups of up to 2× and 1.4× compared to the Locality-Sensitive Hashing (LSH) and Product Quantization (PQ), respectively. We empirically demonstrate that our system is scalable, highly effective, and efficient in identifying short audio queries (

$\leq$

2 s), particularly at high noise and reverberation levels.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

流式散列：通过规范化流量平衡散列加速音频搜索

由于高维度、可扩展性问题和查询向量中的潜在噪声所带来的挑战，在上下文表示向量上进行近邻搜索是一项艰巨的任务。我们的新方法利用自监督学习框架中的归一化流来有效地应对这些挑战，特别是在音频指纹识别任务中。音频指纹识别系统包含两个关键部分：音频编码和索引。现有系统单独考虑这两个部分，导致性能不理想。我们的方法优化了这些组件之间的相互作用，促进了向量对索引结构的适应。此外，我们使用归一化流在潜在的 $\mathbb {R}^{K}$ 空间中分配向量，从而产生平衡的 $K$ 位散列码。这样就可以使用平衡哈希表来索引向量，其中向量均匀分布在所有可能的 2^{K}$ 哈希桶中。这大大加快了检索速度，与位置敏感散列（LSH）和乘积量化（PQ）相比，检索速度分别提高了 2 倍和 1.4 倍。我们通过经验证明，我们的系统在识别短音频查询（$\leq$2 s）方面是可扩展、高效和有效的，尤其是在高噪声和混响水平下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach