Pub Date : 2024-09-18DOI: 10.1109/TASLP.2024.3463415
Tomoro Tanaka;Kohei Yatabe;Yasuhiro Oikawa
Audio inpainting restores locally corrupted parts of digital audio signals. Sparsity-based methods achieve this by promoting sparsity in the time-frequency (T-F) domain, assuming short-time audio segments consist of a few sinusoids. However, such sparsity promotion reduces the magnitudes of the resulting waveforms; moreover, it often ignores the temporal connections of sinusoidal components. To address these problems, we propose a novel phase-aware audio inpainting method. Our method minimizes the time variations of a particular T-F representation calculated using the time derivative of the phase. This promotes sinusoidal components that coherently fit in the corrupted parts without directly suppressing the magnitudes. Both objective and subjective experiments confirmed the superiority of the proposed method compared with state-of-the-art methods.
{"title":"PHAIN: Audio Inpainting via Phase-Aware Optimization With Instantaneous Frequency","authors":"Tomoro Tanaka;Kohei Yatabe;Yasuhiro Oikawa","doi":"10.1109/TASLP.2024.3463415","DOIUrl":"10.1109/TASLP.2024.3463415","url":null,"abstract":"Audio inpainting restores locally corrupted parts of digital audio signals. Sparsity-based methods achieve this by promoting sparsity in the time-frequency (T-F) domain, assuming short-time audio segments consist of a few sinusoids. However, such sparsity promotion reduces the magnitudes of the resulting waveforms; moreover, it often ignores the temporal connections of sinusoidal components. To address these problems, we propose a novel phase-aware audio inpainting method. Our method minimizes the time variations of a particular T-F representation calculated using the time derivative of the phase. This promotes sinusoidal components that coherently fit in the corrupted parts without directly suppressing the magnitudes. Both objective and subjective experiments confirmed the superiority of the proposed method compared with state-of-the-art methods.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4471-4485"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1109/TASLP.2024.3446232
Sagar Dutta;Vipul Arora
This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.
{"title":"AudioNet: Supervised Deep Hashing for Retrieval of Similar Audio Events","authors":"Sagar Dutta;Vipul Arora","doi":"10.1109/TASLP.2024.3446232","DOIUrl":"10.1109/TASLP.2024.3446232","url":null,"abstract":"This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4526-4536"},"PeriodicalIF":4.1,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-12DOI: 10.1109/TASLP.2024.3458796
Ying Mo;Jiahao Liu;Hongyin Tang;Qifan Wang;Zenglin Xu;Jingang Wang;Xiaojun Quan;Wei Wu;Zhoujun Li
Most previous sequential labeling models are task-specific, while recent years have witnessed the rise of generative models due to the advantage of unifying all named entity recognition (NER) tasks into the encoder-decoder framework. Although achieving promising performance, our pilot studies demonstrate that existing generative models are ineffective at detecting entity boundaries and estimating entity types. In this paper, we propose a multi-task Transformer, which incorporates an entity boundary detection task into the named entity recognition task. More concretely, we achieve entity boundary detection by classifying the relations between tokens within the sentence. To improve the accuracy of entity-type mapping during decoding, we adopt an external knowledge base to calculate the prior entity-type distributions and then incorporate the information into the model via the self- and cross-attention mechanisms. We perform experiments on extensive NER benchmarks, including flat, nested, and discontinuous NER datasets involving long entities. It substantially increases nearly $+0.3 sim +1.5;{F_1}$