2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)最新文献

英文中文

Estimation of Angular Power Spectrum Using Multikernel Adaptive Filtering 用多核自适应滤波估计角功率谱

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980067

Eiji Ninomiya, M. Yukawa, Renato L. G. Cavalcante, Lorenzo Miretti

This paper addresses the problem of estimating the angular power spectrum (APS) of massive multiple input multiple output wireless channels. Estimating the APS is useful, for instance, for simplifying the downlink channel estimation problem in frequency division duplex systems. We propose an efficient online algorithm that estimates the APS from the channel spatial covariance matrix. The proposed algorithm approximates the APS as a sum of Gaussian functions and leverages the framework of multikernel adaptive filtering.

研究了大规模多输入多输出无线信道的角功率谱估计问题。估计APS是有用的，例如，对于简化分频双工系统中的下行信道估计问题。我们提出了一种从信道空间协方差矩阵估计APS的有效在线算法。该算法将APS近似为高斯函数的和，并利用多核自适应滤波的框架。

引用次数: 0

C-CycleTransGAN: A Non-parallel Controllable Cross-gender Voice Conversion Model with CycleGAN and Transformer C-CycleTransGAN:一个具有CycleGAN和变压器的非并行可控跨性别语音转换模型

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979821

Changzeng Fu, Chaoran Liu, C. Ishi, H. Ishiguro

In this study, we propose a conversion intensity controllable model for the cross-gender voice conversion (VC)11Demo page can be found at https://cz26.github.io/DemoPage-c-CycleTransGAN-VoiceConversion/. In particular, we combine the CycleGAN and transformer module, and build a condition embedding network as an intensity controller. The model is firstly pre-trained with self-supervised learning on the single-gender voice reconstruction task, with the condition set to male-to-male or female-to-female. Then, we fine-tune the model on the cross-gender voice conversion task after the pretraining is completed, with the condition set to male-to-female or female-to-male. In the testing procedure, the condition is expected to be employed as a controllable parameter (scale) to adjust the conversion intensity. The proposed method was evaluated on the Voice Conversion Challenge dataset and compared to two baselines (CycleGAN, CycleTransGAN) with objective and subjective evaluations. The results show that our proposed model is able to equip the model with an additional function of cross-gender controllability and without hurting the voice conversion performance.

在这项研究中，我们提出了一个跨性别语音转换(VC)的转换强度可控模型11演示页面可在https://cz26.github.io/DemoPage-c-CycleTransGAN-VoiceConversion/找到。特别地，我们将CycleGAN与变压器模块相结合，构建了一个状态嵌入网络作为强度控制器。首先对模型进行单性别语音重构任务的自监督学习预训练，条件设置为男对男或女对女。然后，在预训练完成后，我们对跨性别语音转换任务的模型进行微调，将条件设置为男变女或女变男。在测试过程中，期望将条件作为一个可控参数(尺度)来调节转换强度。在Voice Conversion Challenge数据集上对该方法进行了评估，并将其与两个基线(CycleGAN, CycleTransGAN)进行了客观和主观评价。结果表明，我们提出的模型能够在不影响语音转换性能的情况下为模型增加跨性别可控性功能。

{"title":"C-CycleTransGAN: A Non-parallel Controllable Cross-gender Voice Conversion Model with CycleGAN and Transformer","authors":"Changzeng Fu, Chaoran Liu, C. Ishi, H. Ishiguro","doi":"10.23919/APSIPAASC55919.2022.9979821","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979821","url":null,"abstract":"In this study, we propose a conversion intensity controllable model for the cross-gender voice conversion (VC)11Demo page can be found at https://cz26.github.io/DemoPage-c-CycleTransGAN-VoiceConversion/. In particular, we combine the CycleGAN and transformer module, and build a condition embedding network as an intensity controller. The model is firstly pre-trained with self-supervised learning on the single-gender voice reconstruction task, with the condition set to male-to-male or female-to-female. Then, we fine-tune the model on the cross-gender voice conversion task after the pretraining is completed, with the condition set to male-to-female or female-to-male. In the testing procedure, the condition is expected to be employed as a controllable parameter (scale) to adjust the conversion intensity. The proposed method was evaluated on the Voice Conversion Challenge dataset and compared to two baselines (CycleGAN, CycleTransGAN) with objective and subjective evaluations. The results show that our proposed model is able to equip the model with an additional function of cross-gender controllability and without hurting the voice conversion performance.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131218551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Highly Robust Action Retrieval using View-invariant Pose Feature and Simple yet Effective Query Expansion Method 基于视图不变姿态特征的高鲁棒动作检索和简单有效的查询扩展方法

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979865

Noboru Yoshida, Jianquan Liu

Action retrieval and detection utilizing view-invariant pose based feature achieve high precision. However the technology has a problem of low recall because of the large individual differences in action. Query-expansion(QE) methods are well known as effective ways to improve recall in object detection and retrieval task, but few research adapt it to the action retrieval task. We focused on the query expansion method and proposed new query generation method in which two queries containing missing points complement each other's missing points to perform high-recall action retrieval. The experimental results are reported to show that our method outperforms the state-of-the-art methods in a simulated dataset with annotated multi-view 2D poses and a real-world video dataset.

利用基于视不变姿态特征的动作检索和检测具有较高的精度。然而，由于行动中的个体差异很大，该技术存在召回率低的问题。查询扩展(Query-expansion, QE)方法被认为是提高目标检测和检索任务召回率的有效方法，但很少有研究将其应用于动作检索任务。我们重点研究了查询扩展方法，提出了一种新的查询生成方法，该方法将两个包含缺失点的查询相互补充，以实现高查全率的动作检索。据报道，实验结果表明，我们的方法在带有注释的多视图2D姿势的模拟数据集和真实世界视频数据集中优于最先进的方法。

引用次数: 0

Neural Network Based Watermarking Trained with Quantized Activation Function 基于量化激活函数训练的神经网络水印

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980204

Shingo Yamauchi, Masaki Kawamura

We propose a watermarking method that incor-porates a quantized activation function to provide robustness against quantization. Zhu et al. showed that the introduction of a noise layer between the encoder and decoder can increase the robustness against attacks. Although there are various attacks on stego-images, these images are often JPEG-compressed. As the process of JPEG compression includes quantization, the watermark decoder must be able to estimate watermarks from compressed images. Hence, we propose a quantization layer that introduces a quantized activation function consisting of the hy-perbolic tangent function. The proposed neural network is based on that proposed by Hamamoto and Kawamura. By simulating the quantization of JPEG compression, the quantization layer is expected to improve the robustness against JPEG compression. The robustness was evaluated by the bit error rate (BER), and the stego-image quality was evaluated by the peak signal-to-noise ratio (PSNR). The proposed network achieved a high image quality of more than 35 dB, and it could extract watermarks with a BER of less than 0.1 for Q-values of 30 or higher in JPEG compression. It was thus more robust against JPEG compression than Hamamoto and Kawamura's model.

我们提出了一种包含量化激活函数的水印方法，以提供对量化的鲁棒性。Zhu等人表明，在编码器和解码器之间引入噪声层可以增加对攻击的鲁棒性。尽管对隐写图像有各种各样的攻击，但这些图像通常是jpeg压缩的。由于JPEG压缩过程包含量化，因此水印解码器必须能够从压缩图像中估计出水印。因此，我们提出了一个量化层，引入了一个由双曲正切函数组成的量化激活函数。所提出的神经网络是基于Hamamoto和Kawamura提出的神经网络。通过模拟JPEG压缩的量化，量化层有望提高对JPEG压缩的鲁棒性。以误码率(BER)评价鲁棒性，以峰值信噪比(PSNR)评价隐写图像质量。该网络实现了35 dB以上的高图像质量，并且在JPEG压缩中，对于q值为30或更高的情况下，可以提取BER小于0.1的水印。因此，它对JPEG压缩比Hamamoto和Kawamura的模型更健壮。

{"title":"Neural Network Based Watermarking Trained with Quantized Activation Function","authors":"Shingo Yamauchi, Masaki Kawamura","doi":"10.23919/APSIPAASC55919.2022.9980204","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980204","url":null,"abstract":"We propose a watermarking method that incor-porates a quantized activation function to provide robustness against quantization. Zhu et al. showed that the introduction of a noise layer between the encoder and decoder can increase the robustness against attacks. Although there are various attacks on stego-images, these images are often JPEG-compressed. As the process of JPEG compression includes quantization, the watermark decoder must be able to estimate watermarks from compressed images. Hence, we propose a quantization layer that introduces a quantized activation function consisting of the hy-perbolic tangent function. The proposed neural network is based on that proposed by Hamamoto and Kawamura. By simulating the quantization of JPEG compression, the quantization layer is expected to improve the robustness against JPEG compression. The robustness was evaluated by the bit error rate (BER), and the stego-image quality was evaluated by the peak signal-to-noise ratio (PSNR). The proposed network achieved a high image quality of more than 35 dB, and it could extract watermarks with a BER of less than 0.1 for Q-values of 30 or higher in JPEG compression. It was thus more robust against JPEG compression than Hamamoto and Kawamura's model.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133546980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural Vocoder Feature Estimation for Dry Singing Voice Separation 干唱腔分离的神经声码器特征估计

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980093

Jae-Yeol Im, Soonbeom Choi, Sangeon Yong, Juhan Nam

Singing voice separation (SVS) is a task that separates singing voice audio from its mixture with instrumental audio. Previous SVS studies have mainly employed the spectrogram masking method which requires a large dimensionality in predicting the binary masks. In addition, they focused on extracting a vocal stem that retains the wet sound with the reverberation effect. This result may hinder the reusability of the isolated singing voice. This paper addresses the issues by predicting mel-spectrogram of dry singing voices from the mixed audio as neural vocoder features and synthesizing the singing voice waveforms from the neural vocoder. We experimented with two separation methods. One is predicting binary masks in the mel-spectrogram domain and the other is directly predicting the mel-spectrogram. Furthermore, we add a singing voice detector to identify the singing voice segments over time more explicitly. We measured the model performance in terms of audio, dereverberation, separation, and overall quality. The results show that our proposed model outperforms state-of-the-art singing voice separation models in both objective and subjective evaluation except the audio quality.

歌唱声音分离(SVS)是一项将歌唱声音从其与乐器音频的混合中分离出来的任务。以往的SVS研究主要采用谱图掩模法，该方法在预测二元掩模时需要较大的维数。此外，他们还专注于提取保留湿音和混响效果的声干。这一结果可能会阻碍孤立歌声的重复使用。本文通过从混合音频中预测干歌唱声音的梅尔谱图作为神经声码器特征，并从神经声码器合成歌唱声音波形来解决这一问题。我们试验了两种分离方法。一种是预测梅尔谱域的二元掩模，另一种是直接预测梅尔谱。此外，我们添加了一个歌唱声音检测器来更明确地识别随时间变化的歌唱声音片段。我们在音频、去混响、分离和整体质量方面测量了模型的性能。结果表明，除了音质外，我们提出的模型在客观和主观评价方面都优于目前最先进的歌声分离模型。

{"title":"Neural Vocoder Feature Estimation for Dry Singing Voice Separation","authors":"Jae-Yeol Im, Soonbeom Choi, Sangeon Yong, Juhan Nam","doi":"10.23919/APSIPAASC55919.2022.9980093","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980093","url":null,"abstract":"Singing voice separation (SVS) is a task that separates singing voice audio from its mixture with instrumental audio. Previous SVS studies have mainly employed the spectrogram masking method which requires a large dimensionality in predicting the binary masks. In addition, they focused on extracting a vocal stem that retains the wet sound with the reverberation effect. This result may hinder the reusability of the isolated singing voice. This paper addresses the issues by predicting mel-spectrogram of dry singing voices from the mixed audio as neural vocoder features and synthesizing the singing voice waveforms from the neural vocoder. We experimented with two separation methods. One is predicting binary masks in the mel-spectrogram domain and the other is directly predicting the mel-spectrogram. Furthermore, we add a singing voice detector to identify the singing voice segments over time more explicitly. We measured the model performance in terms of audio, dereverberation, separation, and overall quality. The results show that our proposed model outperforms state-of-the-art singing voice separation models in both objective and subjective evaluation except the audio quality.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132186695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Syllable Analysis Data Augmentation for Khmer Ancient Palm leaf Recognition 高棉古棕榈叶识别的音节分析数据增强

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980217

Nimol Thuon, Jun Du, Jianshu Zhang

The unique forms and physical conditions of the Khmer palm leaf manuscript recognition system are receiving more attention from researchers. In the state-of-the-art, data augmentation is commonly used for data training; however, grammatical mistakes and data availability in the training process would determine or limit the accuracy rate. The two significant challenges lie in (1) grammar complexity and (2) wording similarity; therefore, this paper presents the Syllable Analysis Data Augmentation (SADA) technique, which aims at boosting the accuracy of the text recognition system for one of Southeast Asia's historical manuscripts from Cambodia. SADA comprises two fundamental modules: (1) formulating a collection of syllables/words to structure glyph patterns and (2) generating patterns from existing data through augmentation techniques and utilizing flexible geometric image transformation to increase similar word/text images. Initially, image collections are established, whereby datasets are interpreted according to the reordered grammatical structures to construct multiple glyph images. Next, we aim at conducting the experiment with a text/word recognition system before regulating attention-based encoder-decoder to enhance the probability of transcriptions of low and high-resolution images. At last, the experiment centers on datasets from various sources, including public datasets from ICFHR 2018 contest and our new augmentation datasets, all of which aim at demonstrating and evaluating the accuracy of the findings.

高棉棕榈叶手稿识别系统的独特形态和物理条件越来越受到研究者的关注。在最先进的技术中，数据增强通常用于数据训练;然而，训练过程中的语法错误和数据可用性将决定或限制准确率。两大挑战在于(1)语法复杂性和(2)措辞相似性;因此，本文提出了音节分析数据增强(SADA)技术，旨在提高柬埔寨一份东南亚历史手抄本的文本识别系统的准确性。SADA包括两个基本模块:(1)制定音节/单词集合来构建字形模式;(2)通过增强技术从现有数据中生成模式，并利用灵活的几何图像变换来增加相似的单词/文本图像。首先，建立图像集合，根据重新排序的语法结构对数据集进行解释，构建多个字形图像。接下来，我们的目标是在调节基于注意力的编码器-解码器之前，在文本/单词识别系统上进行实验，以提高低分辨率和高分辨率图像的转录概率。最后，实验以各种来源的数据集为中心，包括ICFHR 2018比赛的公共数据集和我们新的增强数据集，所有这些数据集都旨在证明和评估研究结果的准确性。

{"title":"Syllable Analysis Data Augmentation for Khmer Ancient Palm leaf Recognition","authors":"Nimol Thuon, Jun Du, Jianshu Zhang","doi":"10.23919/APSIPAASC55919.2022.9980217","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980217","url":null,"abstract":"The unique forms and physical conditions of the Khmer palm leaf manuscript recognition system are receiving more attention from researchers. In the state-of-the-art, data augmentation is commonly used for data training; however, grammatical mistakes and data availability in the training process would determine or limit the accuracy rate. The two significant challenges lie in (1) grammar complexity and (2) wording similarity; therefore, this paper presents the Syllable Analysis Data Augmentation (SADA) technique, which aims at boosting the accuracy of the text recognition system for one of Southeast Asia's historical manuscripts from Cambodia. SADA comprises two fundamental modules: (1) formulating a collection of syllables/words to structure glyph patterns and (2) generating patterns from existing data through augmentation techniques and utilizing flexible geometric image transformation to increase similar word/text images. Initially, image collections are established, whereby datasets are interpreted according to the reordered grammatical structures to construct multiple glyph images. Next, we aim at conducting the experiment with a text/word recognition system before regulating attention-based encoder-decoder to enhance the probability of transcriptions of low and high-resolution images. At last, the experiment centers on datasets from various sources, including public datasets from ICFHR 2018 contest and our new augmentation datasets, all of which aim at demonstrating and evaluating the accuracy of the findings.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115420373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DBR: A Depth-Branch-Resorting Algorithm for Locality Exploration in Graph Processing 基于深度分支的图处理局部探索算法

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980127

Lin Jiang, Ru Feng, Junjie Wang, Junyong Deng

Unstructured and irregular graph data causes strong randomness and poor locality of data access in graph processing. In order to alleviate this problem, this paper proposes a Depth-Branch-Resorting (DBR) Algorithm for locality exploration in graph processing, and the corresponding graph data compression format DBR_DCSR. The DBR algorithm and DBR_DCSR format are tested and verified on the framework GraphBIG. The results show that in terms of execution time, the DBR algorithm and DBR_DCSR format reduce GraphBIG execution time by 55.6% compared with the original GraphBIG framework, and 71.7%, 11.46% less than the frameworks of Ligra, Gemini respectively. While compared with the original GraphBIG framework, the optimized GraphBIG framework in DBR_DCSR format has a maximum reduction of 87.9% in data movement and 52.3% in data computation. Compared to the Ligra, Genimi, the amount of data movement are reduced by 33.5% and 49.7%, the amount of data calculation reduced by 54.3% and 43.9% respectively.

非结构化和不规则的图数据导致了图处理中数据访问的随机性强、局部性差。为了缓解这一问题，本文提出了一种用于图处理局部探索的深度分支求助算法(DBR)，以及相应的图数据压缩格式DBR_DCSR。在GraphBIG框架上对DBR算法和DBR_DCSR格式进行了测试和验证。结果表明，在执行时间上，DBR算法和DBR_DCSR格式比原始GraphBIG框架减少了55.6%的GraphBIG执行时间，比Ligra、Gemini框架分别减少了71.7%、11.46%。而优化后的DBR_DCSR格式的GraphBIG框架与原始GraphBIG框架相比，数据移动量最大减少87.9%，数据计算量最大减少52.3%。与Ligra、Genimi相比，数据移动量分别减少33.5%和49.7%，数据计算量分别减少54.3%和43.9%。

引用次数: 0

Nonlinear Residual Echo Suppression Based on Gated Dual Signal Transformation LSTM Network 基于门控对偶信号变换LSTM网络的非线性残留回波抑制

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980060

Kai Xie, Ziye Yang, Jie Chen

Although adaptive filters play a vital role in the acoustic echo cancellation system, multiple factors prevent them from completely eliminating the echo signal. Consequently, additional suppression module is required and crucial for enhancing the echo cancellation performance. In this work, we propose a gated dual signal transformation LSTM network (Gated DTLN) that improves upon the recently developed Dual Signal Trans-formation LSTM Network for AEC (DTLN-aec). The gated convolution units are inserted to enhance filtering features in the time domain part of the model, while the echo reference signal is removed from the input of this part to reduce the complexity of the mask generator. The experimental results on different signal-to-echo ratio (SER) datasets demonstrate the superiority of our proposed method.

尽管自适应滤波器在声回波消除系统中起着至关重要的作用，但多种因素使其无法完全消除回波信号。因此，需要额外的抑制模块，这对于提高回波消除性能至关重要。在这项工作中，我们提出了一种门控双信号变换LSTM网络(门控DTLN)，它对最近开发的用于AEC的双信号变换LSTM网络(DTLN- AEC)进行了改进。在模型时域部分插入门控卷积单元增强滤波特征，同时在时域部分的输入端去除回波参考信号，降低掩模发生器的复杂度。在不同信回波比(SER)数据集上的实验结果表明了该方法的优越性。

引用次数: 0

Adapted Spectrogram Transformer for Unsupervised Cross-Domain Acoustic Anomaly Detection 无监督跨域声学异常检测的自适应谱图变换

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980266

Gilles Van De Vyver, Zhaoyi Liu, Koustabh Dolui, D. Hughes, Sam Michiels

Anomaly detection models can help to automatically and proactively detect faults in industrial machines. Microphones are appealing as they are generally inexpensive and unlike visual inspection, recording sound samples can give information about the internals of the machine. However, conventional methods based on an AutoEncoder (AE) structure learned from scratch generally struggle to learn how to robustly reconstruct samples with limited available data. This paper addresses this problem by presenting a method for unsupervised Acoustic Anomaly Detection (AAD) that adapts intermediate embeddings from a pretrained, self-attention-based spectrogram transformer. Transfer learning from a large, successful model offers a solution to learning with limited data by reusing external knowledge. For AAD, this can help to recognize subtle anomalies. This work proposes two method variants that take advantage of Intermediate Feature Embeddings (IFEs) from the Audio Spectrogram Transformer (AST). The first fits a Gaussian Mixture Model (GMM) on the IFEs produced by intermediate layers of the AST. We call this ADIFAST: Anomaly Detection from Intermediate Features extracted from AST. The second uses the IFEs in a different, more effective way by adapting the AST to an AE structure. We call it TELD: Transformer Encoder Linear Decoder network. The relationship between the two method variants is that they both take advantage of the IFEs extracted by the AST. Evaluating TELD on task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge gives an average improvement to the Area Under Curve (AUC) score of 3.9% for binary labeling normal and anomalous samples in the target domain.

异常检测模型可以帮助自动和主动检测工业机器中的故障。麦克风很有吸引力，因为它们通常不贵，而且与目测不同，录制声音样本可以提供有关机器内部的信息。然而，基于从头开始学习的AutoEncoder (AE)结构的传统方法通常难以学习如何在有限的可用数据下鲁棒地重建样本。本文通过提出一种无监督声学异常检测(AAD)方法来解决这个问题，该方法适应来自预训练的、基于自注意的频谱图转换器的中间嵌入。从一个成功的大型模型中进行迁移学习，可以通过重用外部知识来解决使用有限数据进行学习的问题。对于AAD，这可以帮助识别细微的异常。本研究提出了两种利用音频频谱变换(AST)的中间特征嵌入(IFEs)的方法变体。第一种方法将高斯混合模型(GMM)拟合到由AST的中间层产生的IFEs上。我们称之为ADIFAST:从AST提取的中间特征中进行异常检测。第二种方法以一种不同的、更有效的方式使用IFEs，使AST适应于AE结构。我们称之为TELD:变压器编码器线性解码器网络。这两种方法变体之间的关系是，它们都利用了AST提取的ife。在声学场景和事件检测与分类(DCASE) 2021挑战的任务2中评估TELD，对于目标域中正常和异常样本的二元标记，曲线下面积(AUC)得分平均提高了3.9%。

{"title":"Adapted Spectrogram Transformer for Unsupervised Cross-Domain Acoustic Anomaly Detection","authors":"Gilles Van De Vyver, Zhaoyi Liu, Koustabh Dolui, D. Hughes, Sam Michiels","doi":"10.23919/APSIPAASC55919.2022.9980266","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980266","url":null,"abstract":"Anomaly detection models can help to automatically and proactively detect faults in industrial machines. Microphones are appealing as they are generally inexpensive and unlike visual inspection, recording sound samples can give information about the internals of the machine. However, conventional methods based on an AutoEncoder (AE) structure learned from scratch generally struggle to learn how to robustly reconstruct samples with limited available data. This paper addresses this problem by presenting a method for unsupervised Acoustic Anomaly Detection (AAD) that adapts intermediate embeddings from a pretrained, self-attention-based spectrogram transformer. Transfer learning from a large, successful model offers a solution to learning with limited data by reusing external knowledge. For AAD, this can help to recognize subtle anomalies. This work proposes two method variants that take advantage of Intermediate Feature Embeddings (IFEs) from the Audio Spectrogram Transformer (AST). The first fits a Gaussian Mixture Model (GMM) on the IFEs produced by intermediate layers of the AST. We call this ADIFAST: Anomaly Detection from Intermediate Features extracted from AST. The second uses the IFEs in a different, more effective way by adapting the AST to an AE structure. We call it TELD: Transformer Encoder Linear Decoder network. The relationship between the two method variants is that they both take advantage of the IFEs extracted by the AST. Evaluating TELD on task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 challenge gives an average improvement to the Area Under Curve (AUC) score of 3.9% for binary labeling normal and anomalous samples in the target domain.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123919003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Intelligibility prediction of enhanced speech using recognition accuracy of end-to-end ASR systems 基于端到端ASR系统识别精度的增强语音可理解性预测

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980257

Kenichi Arai, A. Ogawa, S. Araki, K. Kinoshita, T. Nakatani, Naoyuki Kamo, T. Irino

We propose speech intelligibility (SI) prediction methods using the recognition accuracy of an end-to-end (E2E) automatic speech recognition (ASR) system whose ASR performance has become comparable to the human auditory system due to its recent significant progress. Such predictors will fuel the development of speech enhancement methods for human listeners. In this paper, we evaluate our proposed method's prediction performance of the intelligibility of enhanced noisy speech signals. Our experiments show that when ASR systems are trained with various noisy speech data, our proposed methods, which do not require clean reference signals, predict SI more accurately than the existing “intrusive” methods: short-time objective intelligibility (STOI), extended-STOI (eSTOI), and our previously proposed methods, which were based on deep neural network-hidden Markov model hybrid ASR systems. Our experiments also show that our method, which additionally uses clean speech for determining the speech region of evaluation signals, further improves the prediction accuracy more than the existing methods.

我们提出了基于端到端(E2E)自动语音识别(ASR)系统的识别精度的语音可理解度(SI)预测方法，该系统的ASR性能由于其最近的重大进展已经可以与人类听觉系统相媲美。这样的预测器将推动人类听众语音增强方法的发展。在本文中，我们评估了我们提出的方法对增强噪声语音信号的可理解性的预测性能。我们的实验表明，当用各种噪声语音数据训练ASR系统时，我们提出的方法不需要干净的参考信号，比现有的“侵入式”方法更准确地预测SI:短时客观可理解性(STOI)，扩展STOI (eSTOI)，以及我们之前提出的基于深度神经网络隐藏马尔可夫模型混合ASR系统的方法。实验还表明，该方法在确定评价信号语音区域的基础上，进一步提高了评价信号的预测精度。

{"title":"Intelligibility prediction of enhanced speech using recognition accuracy of end-to-end ASR systems","authors":"Kenichi Arai, A. Ogawa, S. Araki, K. Kinoshita, T. Nakatani, Naoyuki Kamo, T. Irino","doi":"10.23919/APSIPAASC55919.2022.9980257","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980257","url":null,"abstract":"We propose speech intelligibility (SI) prediction methods using the recognition accuracy of an end-to-end (E2E) automatic speech recognition (ASR) system whose ASR performance has become comparable to the human auditory system due to its recent significant progress. Such predictors will fuel the development of speech enhancement methods for human listeners. In this paper, we evaluate our proposed method's prediction performance of the intelligibility of enhanced noisy speech signals. Our experiments show that when ASR systems are trained with various noisy speech data, our proposed methods, which do not require clean reference signals, predict SI more accurately than the existing “intrusive” methods: short-time objective intelligibility (STOI), extended-STOI (eSTOI), and our previously proposed methods, which were based on deep neural network-hidden Markov model hybrid ASR systems. Our experiments also show that our method, which additionally uses clean speech for determining the speech region of evaluation signals, further improves the prediction accuracy more than the existing methods.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"280 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124487764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀