2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)最新文献

英文中文

End-to-end Visual-guided Audio Source Separation with Enhanced Losses 端到端视觉引导音频源分离与增强的损失

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980162

D. Pham, Quang-Anh Do, Thanh Thi Hien Duong, Thi-Lan Le, Phi-Le Nguyen

Visual-guided Audio Source Separation (VASS) refers to separating individual sound sources from an audio mixture of multiple simultaneous sound sources by using additional visual features that guide the separation process. For the VASS task, visual features and the correlation of audio and visual play an important role, based on which we manage to estimate better audio masks to improve the separation performance. In this paper, we propose an approach to jointly train the components of a cross-modal retrieval framework with video data and enable the network to find more optimal features. Such end-to-end framework is trained with three loss functions: 1) separation loss to limit the separated magnitude spectrogram discrepancy, 2) object-consistency loss to enforce the consistency of the separated audio with the visual information, and 3) cross-modal loss to maximize the correlation of audio and its corresponding visual sounding object while also maximize the difference between the audio and visual information of different objects. The proposed VASS model was evaluated on the benchmark dataset MUSIC, which contains a large number of videos of people playing instruments in different combinations. Experiment results confirmed the advantages of our model over previous VASS models.

视觉引导音频源分离(VASS)是指通过使用附加的视觉特征来指导分离过程，将单个声源从多个同时声源的音频混合中分离出来。对于VASS任务，视觉特征和音频与视觉的相关性起着重要作用，在此基础上，我们设法估计更好的音频掩码，以提高分离性能。在本文中，我们提出了一种与视频数据联合训练跨模态检索框架组件的方法，使网络能够找到更多的最优特征。这种端到端框架使用三个损失函数进行训练:1)分离损失，限制分离的幅度谱图差异;2)对象一致性损失，强制分离的音频与视觉信息的一致性;3)交叉模态损失，最大化音频与其对应的视觉探测对象的相关性，同时最大化不同对象的音频和视觉信息之间的差异。提出的VASS模型在基准数据集MUSIC上进行了评估，该数据集包含大量人们以不同组合演奏乐器的视频。实验结果证实了该模型相对于以往VASS模型的优越性。

{"title":"End-to-end Visual-guided Audio Source Separation with Enhanced Losses","authors":"D. Pham, Quang-Anh Do, Thanh Thi Hien Duong, Thi-Lan Le, Phi-Le Nguyen","doi":"10.23919/APSIPAASC55919.2022.9980162","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980162","url":null,"abstract":"Visual-guided Audio Source Separation (VASS) refers to separating individual sound sources from an audio mixture of multiple simultaneous sound sources by using additional visual features that guide the separation process. For the VASS task, visual features and the correlation of audio and visual play an important role, based on which we manage to estimate better audio masks to improve the separation performance. In this paper, we propose an approach to jointly train the components of a cross-modal retrieval framework with video data and enable the network to find more optimal features. Such end-to-end framework is trained with three loss functions: 1) separation loss to limit the separated magnitude spectrogram discrepancy, 2) object-consistency loss to enforce the consistency of the separated audio with the visual information, and 3) cross-modal loss to maximize the correlation of audio and its corresponding visual sounding object while also maximize the difference between the audio and visual information of different objects. The proposed VASS model was evaluated on the benchmark dataset MUSIC, which contains a large number of videos of people playing instruments in different combinations. Experiment results confirmed the advantages of our model over previous VASS models.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"93 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126137504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Continuous authentication for smartphones using face images and touch-screen operation 使用人脸图像和触摸屏操作的智能手机连续认证

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980045

Shuto Kinoshita, Yuka Watanabe, Y. Yamazaki

Conventional user authentication methods for smartphones using PINs, passwords, pattern locks, etc. have a problem in that user authentication is not performed continuously after the first authentication success; therefore, there is a risk that an authenticated smartphone might be used improperly by unauthorized individuals. We propose a novel continuous authentication method for smartphones that uses face images and touch-screen operation and evaluated its effectiveness.

使用pin、密码、模式锁等的智能手机的传统用户认证方法存在一个问题，即在首次认证成功后，用户认证不会连续执行;因此，存在一种风险，即经过身份验证的智能手机可能被未经授权的个人不当使用。我们提出了一种基于人脸图像和触摸屏操作的智能手机连续认证方法，并对其有效性进行了评估。

引用次数: 0

A Deep Proximal-Unfolding Method for Monaural Speech Dereverberation 单耳语音去噪的深层近端展开方法

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979935

Meihuang Wang, Minmin Yuan, Andong Li, C. Zheng, Xiaodong Li

Speech is often distorted by reverberation in an enclosure when the microphone is placed far away from the speech source, reducing speech quality and intelligibility. Recent years have witnessed the development of deep neural networks, and many deep learning-based methods have been proposed for dereverberation. Most deep learning-based methods remove the reverberation by directly mapping the reverberant speech to target speech, which often lacks adequate interpretability, limiting the performance upper bound. This paper proposes a deep un-folding method with an interpretable network structure. First, the dereverberation problem was reformulated based on maximum posterior criterion, and an iterative optimization algorithm was then devised by using proximal operators. Second, we unfolded the iterative optimization algorithm into multi-stage deep neural network, where each stage corresponded to a specific operation of the iterative procedure. Experiments were conducted on the WSJO-SI84 corpus, and the results on both simulated and real RIRs showed that the proposed model outperformed previous models and achieved state-of-the-art performance in terms of PESQ, ESTOI and frequency-weighted segmental SNR.

当麦克风放置在远离声源的地方时，声音往往会被混响所扭曲，从而降低语音质量和清晰度。近年来，深度神经网络得到了发展，人们提出了许多基于深度学习的去噪方法。大多数基于深度学习的方法通过将混响语音直接映射到目标语音来消除混响，这往往缺乏足够的可解释性，限制了性能上限。提出了一种具有可解释网络结构的深度解折叠方法。首先，基于最大后验准则对去噪问题进行了重新表述，并利用近端算子设计了一种迭代优化算法。其次，我们将迭代优化算法展开为多阶段深度神经网络，每一阶段对应迭代过程的一个具体操作。在WSJO-SI84语料库上进行了实验，在模拟和真实rir上的结果表明，所提出的模型在PESQ、ESTOI和频率加权段信噪比方面都优于先前的模型，取得了最先进的性能。

{"title":"A Deep Proximal-Unfolding Method for Monaural Speech Dereverberation","authors":"Meihuang Wang, Minmin Yuan, Andong Li, C. Zheng, Xiaodong Li","doi":"10.23919/APSIPAASC55919.2022.9979935","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979935","url":null,"abstract":"Speech is often distorted by reverberation in an enclosure when the microphone is placed far away from the speech source, reducing speech quality and intelligibility. Recent years have witnessed the development of deep neural networks, and many deep learning-based methods have been proposed for dereverberation. Most deep learning-based methods remove the reverberation by directly mapping the reverberant speech to target speech, which often lacks adequate interpretability, limiting the performance upper bound. This paper proposes a deep un-folding method with an interpretable network structure. First, the dereverberation problem was reformulated based on maximum posterior criterion, and an iterative optimization algorithm was then devised by using proximal operators. Second, we unfolded the iterative optimization algorithm into multi-stage deep neural network, where each stage corresponded to a specific operation of the iterative procedure. Experiments were conducted on the WSJO-SI84 corpus, and the results on both simulated and real RIRs showed that the proposed model outperformed previous models and achieved state-of-the-art performance in terms of PESQ, ESTOI and frequency-weighted segmental SNR.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122365955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

New Methods for Fast Detection for Embedded Cognitive Radio 嵌入式认知无线电快速检测新方法

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980109

Grégoire De Broglie, Louis Morge-Rollet, D. L. Jeune, F. Roy, C. Roland, Charles Canaff, J. Diguet

Spectrum Sensing is an important part of Cognitive Radio (CR) process. It can be used to determine if a Primary User (PU) (i.e. a licensed user) is emitting or not in the communication channel. This paper presents and compares three types of FFT-based detection algorithms for LTE-Advanced (LTE-A) cellular network at Orthogonal Frequency Division Multiple Access (OFDMA) level. These detectors sense the usage of the minimum time-frequency called Resource Block (RB). They are also low latency detectors and they only need one particular Orthogonal Frequency Division Multiplexing (OFDM) symbol to detect the usage of one RB. The three new detectors are based respectively on energy, correlation, and one what will be called eogration which combines energy and correlation. We analyze them with the Fisher's ratio and simulations of hypothesis test. The computing complexity of these detectors is also theoretically analyzed to provide guidance for future implementations.

频谱感知是认知无线电(CR)过程的重要组成部分。它可用于确定主用户(PU)(即许可用户)是否在通信通道中发射。提出并比较了正交频分多址(OFDMA)级LTE-A蜂窝网络中基于fft的三种检测算法。这些检测器检测被称为资源块(Resource Block, RB)的最小时间频率的使用情况。它们也是低延迟检测器，它们只需要一个特定的正交频分复用(OFDM)符号来检测一个RB的使用。这三种新的探测器分别基于能量、相关性和一种将被称为能量和相关性相结合的eogration。我们用费雪比和模拟假设检验对它们进行了分析。对这些检测器的计算复杂度进行了理论分析，为今后的实现提供指导。

引用次数: 0

Effect of Noise on the Perceptual Contribution of Cochlea-Scaled Entropy and Speech Level in Mandarin Sentence Understanding 噪声对汉语句子理解中耳蜗尺度熵感知贡献和语音水平的影响

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979873

Weikang Wu, Shangdi Liao, Fei Chen

Many studies investigated the impact of various speech segments to speech intelligibility in order to identify important information-bearing regions for the design of new speech processing methods, e.g., speech enhancement. Early findings suggested that cochlea-scaled entropy (CSE) and speech level were important indicators accounting for speech intelligibility in quiet condition. This study further compared the perceptual contributions of CSE and speech level under noisy conditions. Mandarin sentences were masked by steady-state noise and two-talker babble, edited to generate high-entropy-only and high-level-only stimuli, preserving segments with the largest CSEs and the highest levels in clean sentences respectively and replacing the rest with noise, and played to normal-hearing listeners to recognize. Results showed that high-entropy-only stimuli were more intelligible than high-level-only stimuli under noisy conditions. This intelligibility benefit may be attributed to the amount of vowel-consonant transitions, and not to differences in effective signal-to-noise ratios, between the two types of stimuli.

许多研究调查了不同语音片段对语音可理解性的影响，以便为设计新的语音处理方法(如语音增强)识别重要的承载信息的区域。早期研究表明，耳蜗尺度熵(CSE)和语音水平是安静状态下语音可理解性的重要指标。本研究进一步比较了噪声条件下CSE和语音水平的感知贡献。普通话句子被稳态噪声和两个人的胡言胡语掩盖，编辑成只产生高熵刺激和只产生高水平刺激，分别保留干净句子中cse值最大和最高的片段，其余部分用噪声代替，并播放给听力正常的听众来识别。结果表明，在噪声条件下，高熵刺激比高熵刺激更容易被理解。这种可理解性的好处可能归因于元音-辅音转换的数量，而不是两种刺激之间有效信噪比的差异。

{"title":"Effect of Noise on the Perceptual Contribution of Cochlea-Scaled Entropy and Speech Level in Mandarin Sentence Understanding","authors":"Weikang Wu, Shangdi Liao, Fei Chen","doi":"10.23919/APSIPAASC55919.2022.9979873","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979873","url":null,"abstract":"Many studies investigated the impact of various speech segments to speech intelligibility in order to identify important information-bearing regions for the design of new speech processing methods, e.g., speech enhancement. Early findings suggested that cochlea-scaled entropy (CSE) and speech level were important indicators accounting for speech intelligibility in quiet condition. This study further compared the perceptual contributions of CSE and speech level under noisy conditions. Mandarin sentences were masked by steady-state noise and two-talker babble, edited to generate high-entropy-only and high-level-only stimuli, preserving segments with the largest CSEs and the highest levels in clean sentences respectively and replacing the rest with noise, and played to normal-hearing listeners to recognize. Results showed that high-entropy-only stimuli were more intelligible than high-level-only stimuli under noisy conditions. This intelligibility benefit may be attributed to the amount of vowel-consonant transitions, and not to differences in effective signal-to-noise ratios, between the two types of stimuli.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132895723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Novel Smart Sectoring and Beam Designs in mmWave Broadcast Channels 毫米波广播信道中的新型智能分割线和波束设计

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979922

Yang He, S. Tsai, Jen-Ming Wu

This work proposes a smart sectoring scheme for the mm Wave broadcast systems to enhance the throughput by overcoming the inefficient power consumption due to transmit-ting power to undesired user directions in traditional sectoring systems. We optimize the beam pattern of the proposed scheme so that multiple users can be simultaneously served by only one RF chain. As a result, the hardware complexity can be greatly reduced. Simulation results show that the proposed sectoring scheme significantly outperforms the traditional ones under the same numbers of RF chains and antennas. In addition, the advantage of the proposed scheme is also revealed in the complexity. That is, even with only one RF chain, the proposed system can still achieve the close performance of the traditional systems with multiple RF chains (benchmark).

本文提出了一种毫米波广播系统的智能分界方案，克服了传统分界系统中由于向不需要的用户方向传输功率而导致的低功耗，从而提高了吞吐量。我们优化了该方案的波束方向，使多个用户可以同时通过一条射频链服务。因此，可以大大降低硬件的复杂性。仿真结果表明，在相同数量的射频链和天线的情况下，所提出的分路方案明显优于传统的分路方案。此外，该方案的优势也体现在其复杂性上。也就是说，即使只有一条射频链，所提出的系统仍然可以达到具有多条射频链的传统系统(基准)的接近性能。

引用次数: 0

A Study on Low-Latency Recognition-Synthesis-Based Any-to-One Voice Conversion 基于低延迟识别合成的任意对一语音转换研究

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980091

Yi-Yang Ding, Li-Juan Liu, Yu Hu, Zhenhua Ling

Some application scenarios of voice conversion, such as identity disguise in voice communication, require low-latency generation of converted speech. In traditional conversion methods, both history and future information in input speech are utilized to predict the converted acoustic features at each frame, which leads to long latency of voice conversion. Therefore, this paper proposes a low-latency recognition-synthesis-based any-to-one voice conversion method. Bottleneck (BN) features are extracted by an automatic speech recognition (ASR) acoustic model for frame-by-frame phoneme classification. A minimum mutual information (MMI) loss is introduced to reduce the speaker information in BNs caused by the low-latency configuration. The BN features are sent into a speaker-dependent low-latency LSTM-based acoustic feature predictor and the speech waveforms are reconstructed by an LPCNet vocoder from predicted acoustic features. The total latency of our proposed voice conversion method is 190ms, which is less than the delay requirement for comfortable communication in ITU-T G.114. The naturalness of converted speech is comparable with the upper-bound model trained without low-latency constraints.

语音转换的一些应用场景，如语音通信中的身份伪装，需要低延迟生成转换后的语音。在传统的转换方法中，输入语音的历史信息和未来信息都被用来预测每一帧转换的声学特征，这导致了语音转换的长延迟。因此，本文提出了一种基于低延迟识别合成的任意对一语音转换方法。采用自动语音识别(ASR)声学模型提取瓶颈(BN)特征，进行逐帧音素分类。引入最小互信息损耗(minimum mutual information loss, MMI)来减少低时延配置导致的话音人信息丢失。将BN特征发送到基于扬声器的低延迟lstm声学特征预测器中，由LPCNet声码器根据预测的声学特征重建语音波形。我们提出的语音转换方法的总延迟为190ms，小于ITU-T G.114中舒适通信的延迟要求。转换后的语音的自然度与没有低延迟约束的上界模型相当。

{"title":"A Study on Low-Latency Recognition-Synthesis-Based Any-to-One Voice Conversion","authors":"Yi-Yang Ding, Li-Juan Liu, Yu Hu, Zhenhua Ling","doi":"10.23919/APSIPAASC55919.2022.9980091","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980091","url":null,"abstract":"Some application scenarios of voice conversion, such as identity disguise in voice communication, require low-latency generation of converted speech. In traditional conversion methods, both history and future information in input speech are utilized to predict the converted acoustic features at each frame, which leads to long latency of voice conversion. Therefore, this paper proposes a low-latency recognition-synthesis-based any-to-one voice conversion method. Bottleneck (BN) features are extracted by an automatic speech recognition (ASR) acoustic model for frame-by-frame phoneme classification. A minimum mutual information (MMI) loss is introduced to reduce the speaker information in BNs caused by the low-latency configuration. The BN features are sent into a speaker-dependent low-latency LSTM-based acoustic feature predictor and the speech waveforms are reconstructed by an LPCNet vocoder from predicted acoustic features. The total latency of our proposed voice conversion method is 190ms, which is less than the delay requirement for comfortable communication in ITU-T G.114. The naturalness of converted speech is comparable with the upper-bound model trained without low-latency constraints.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122274505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Flow-Based Variational Sequence Autoencoder 基于流的变分序列自动编码器

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979970

Jen-Tzung Chien, Tien-Ching Luo

Posterior collapse, also known as the Kullback-Leibler (KL) vanishing, is a long-standing problem in variational recurrent autoencoder (VRAE) which is essentially developed for sequence generation. To alleviate the vanishing problem, a complicated latent variable is required instead of assuming it as standard Gaussian. Normalizing flow was proposed to build the bijective neural network which converts a simple distribution into a complex distribution. The resulting approximate posterior is closer to real posterior for better sequence generation. The KL divergence in learning objective is accordingly preserved to enrich the capability of generating the diverse sequences. This paper presents the flow-based VRAE to build the disentangled latent representation for sequence generation. KL preserving flows are exploited for conditional VRAE and evaluated for text representation as well as dialogue generation. In the im-plementation, the schemes of amortized regularization and skip connection are further imposed to strengthen the embedding and prediction. Experiments on different tasks show the merit of this latent variable representation for language modeling, sentiment classification and dialogue generation.

后崩溃，也称为Kullback-Leibler (KL)消失，是变分循环自编码器(VRAE)中一个长期存在的问题，它主要是为序列生成而开发的。为了减轻消失问题，需要一个复杂的潜在变量，而不是假设它是标准高斯。采用归一化流的方法构建双目标神经网络，将简单分布转化为复杂分布。所得到的近似后验更接近真实后验，从而更好地生成序列。同时保留了学习目标的KL散度，增强了生成多样化序列的能力。本文提出了一种基于流的VRAE方法，用于序列生成的解纠缠潜在表示。KL保留流用于条件VRAE，并评估文本表示和对话生成。在实现中，进一步引入了平摊正则化和跳跃连接方案，增强了嵌入和预测能力。在不同任务上的实验表明了这种潜在变量表示在语言建模、情感分类和对话生成方面的优点。

{"title":"Flow-Based Variational Sequence Autoencoder","authors":"Jen-Tzung Chien, Tien-Ching Luo","doi":"10.23919/APSIPAASC55919.2022.9979970","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979970","url":null,"abstract":"Posterior collapse, also known as the Kullback-Leibler (KL) vanishing, is a long-standing problem in variational recurrent autoencoder (VRAE) which is essentially developed for sequence generation. To alleviate the vanishing problem, a complicated latent variable is required instead of assuming it as standard Gaussian. Normalizing flow was proposed to build the bijective neural network which converts a simple distribution into a complex distribution. The resulting approximate posterior is closer to real posterior for better sequence generation. The KL divergence in learning objective is accordingly preserved to enrich the capability of generating the diverse sequences. This paper presents the flow-based VRAE to build the disentangled latent representation for sequence generation. KL preserving flows are exploited for conditional VRAE and evaluated for text representation as well as dialogue generation. In the im-plementation, the schemes of amortized regularization and skip connection are further imposed to strengthen the embedding and prediction. Experiments on different tasks show the merit of this latent variable representation for language modeling, sentiment classification and dialogue generation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114068097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Dialect-aware Semi-supervised Learning for End-to-End Multi-dialect Speech Recognition 端到端多方言语音识别的方言感知半监督学习

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980139

Sayaka Shiota, Ryo Imaizumi, Ryo Masumura, H. Kiya

In this paper, we propose dialect-aware semi- supervised learning for end-to-end automatic speech recognition (ASR) models considering multi-dialect speech. Some multi- domain ASR tasks require a large amount of training data containing additional information (e.g., language or dialect), whereas it is difficult to prepare such data with accurate transcriptions. Semi-supervised learning is a method of using a massive amount of untranscribed data effectively, and it can be applied to multi-domain ASR tasks to relax the missing transcriptions problem. However, semi-supervised learning has usually used generated pseudo-transcriptions only. The problem is that simply combining a multi-domain model with semi- supervised learning makes use of no additional information even though the information can be obtained. Therefore, in this paper, we focus on semi-supervised learning based on a multi-domain model that takes additional domain information into account. Since the accuracy of pseudo-transcriptions can be improved by using the multi-domain model and additional information, our proposed semi-supervised learning is expected to provide a reliable ASR model. In experiments, we performed Japanese multi-dialect ASR as one type of multi-domain ASR. From the results, a model trained with the proposed method yielded the lowest character error rate compared with other models trained with the conventional semi-supervised method.

本文提出了一种基于方言感知的半监督学习方法，用于考虑多方言语音的端到端自动语音识别(ASR)模型。一些多域ASR任务需要大量包含附加信息(如语言或方言)的训练数据，而很难用准确的转录来准备这些数据。半监督学习是一种有效利用大量未转录数据的方法，它可以应用于多领域的ASR任务，以缓解缺失转录的问题。然而，半监督学习通常只使用生成的伪转录。问题是，简单地将多域模型与半监督学习相结合，即使可以获得额外的信息，也不会使用额外的信息。因此，在本文中，我们将重点放在基于多领域模型的半监督学习上，该模型考虑了额外的领域信息。由于伪转录的准确性可以通过使用多域模型和附加信息来提高，我们提出的半监督学习有望提供一个可靠的ASR模型。在实验中，我们将日语多方言ASR作为多域ASR的一种。结果表明，与传统半监督方法训练的模型相比，用该方法训练的模型的字符错误率最低。

{"title":"Dialect-aware Semi-supervised Learning for End-to-End Multi-dialect Speech Recognition","authors":"Sayaka Shiota, Ryo Imaizumi, Ryo Masumura, H. Kiya","doi":"10.23919/APSIPAASC55919.2022.9980139","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980139","url":null,"abstract":"In this paper, we propose dialect-aware semi- supervised learning for end-to-end automatic speech recognition (ASR) models considering multi-dialect speech. Some multi- domain ASR tasks require a large amount of training data containing additional information (e.g., language or dialect), whereas it is difficult to prepare such data with accurate transcriptions. Semi-supervised learning is a method of using a massive amount of untranscribed data effectively, and it can be applied to multi-domain ASR tasks to relax the missing transcriptions problem. However, semi-supervised learning has usually used generated pseudo-transcriptions only. The problem is that simply combining a multi-domain model with semi- supervised learning makes use of no additional information even though the information can be obtained. Therefore, in this paper, we focus on semi-supervised learning based on a multi-domain model that takes additional domain information into account. Since the accuracy of pseudo-transcriptions can be improved by using the multi-domain model and additional information, our proposed semi-supervised learning is expected to provide a reliable ASR model. In experiments, we performed Japanese multi-dialect ASR as one type of multi-domain ASR. From the results, a model trained with the proposed method yielded the lowest character error rate compared with other models trained with the conventional semi-supervised method.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115266795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-Consistency Training with Hierarchical Temporal Aggregation for Sound Event Detection 基于层次时间聚合的自一致性训练用于声音事件检测

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980285

Yunlong Li, Xiujuan Zhu, Mingyu Wang, Ying Hu

In this paper, we propose a sound event detection (SED) method based on the self-consistency training (SCT) strategy and a hierarchical temporal aggregation (HTA) module, named SCT-HTA. This method adopts Mean Teacher (MT) semi-supervised learning method, exploiting a dual-branch convolutional recurrent neural network (CRNN) structure including the main branch and auxiliary branch. We adopt an SCT strategy to apply the self-consistency regularization in addition to the MT loss to maintain the consistency between the outputs of the auxiliary and main branches. Furthermore, an HTA module is designed to aggregate the information at different temporal resolutions. We also explored three aggregators to be applied in the HTA module and four kinds of combinations of pooling methods in the localization modules of two branches. Experimental results demonstrate that our proposed SCT-HTA method outperforms the four compared methods. The results show that the max pooling aggregator has a better ability to highlight the location of sound events. And the “linear softmax + attention” combination of the pooling method achieves the best performance.

本文提出了一种基于自一致性训练(SCT)策略和分层时间聚合(HTA)模块的声音事件检测(SED)方法，命名为SCT-HTA。该方法采用均值教师(Mean Teacher, MT)半监督学习方法，利用包含主分支和辅助分支的双分支卷积递归神经网络(CRNN)结构。我们采用SCT策略，除了MT损失外，还应用自一致性正则化来保持辅助分支和主分支输出之间的一致性。此外，还设计了一个HTA模块来聚合不同时间分辨率的信息。我们还探索了三种聚合器用于HTA模块，四种池化方法组合用于两个分支的定位模块。实验结果表明，我们提出的SCT-HTA方法优于四种比较方法。结果表明，最大池聚合器具有较好的突出声音事件位置的能力。而“线性softmax +注意力”组合的池化方法达到了最佳的性能。

{"title":"Self-Consistency Training with Hierarchical Temporal Aggregation for Sound Event Detection","authors":"Yunlong Li, Xiujuan Zhu, Mingyu Wang, Ying Hu","doi":"10.23919/APSIPAASC55919.2022.9980285","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980285","url":null,"abstract":"In this paper, we propose a sound event detection (SED) method based on the self-consistency training (SCT) strategy and a hierarchical temporal aggregation (HTA) module, named SCT-HTA. This method adopts Mean Teacher (MT) semi-supervised learning method, exploiting a dual-branch convolutional recurrent neural network (CRNN) structure including the main branch and auxiliary branch. We adopt an SCT strategy to apply the self-consistency regularization in addition to the MT loss to maintain the consistency between the outputs of the auxiliary and main branches. Furthermore, an HTA module is designed to aggregate the information at different temporal resolutions. We also explored three aggregators to be applied in the HTA module and four kinds of combinations of pooling methods in the localization modules of two branches. Experimental results demonstrate that our proposed SCT-HTA method outperforms the four compared methods. The results show that the max pooling aggregator has a better ability to highlight the location of sound events. And the “linear softmax + attention” combination of the pooling method achieves the best performance.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121025782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀