首页 > 最新文献

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

英文 中文
Universal Acoustic Modeling Using Neural Mixture Models 使用神经混合模型的通用声学建模
Amit Das, Jinyu Li, Changliang Liu, Y. Gong
Acoustic models are domain dependent and do not perform well if there is a mismatch between training and test conditions. As an alternative, the Mixture of Experts (MoE) model was introduced for multi-domain modeling. It combines the outputs of several domain specific models (or experts) using a gating network. However, one drawback is that the gating network directly uses raw features and is unaware of the state of the experts. In this work, we propose several alternatives to improve the MoE model. First, to make our MoE model state-aware, we use outputs of experts as inputs to the gating network. Then we show that vector based interpolation of the mixture weights is more effective than scalar interpolation. Second, we show that directly learning the mixture weights without using any complex gating is still effective. Finally, we introduce a hybrid attention model that uses the logits and mixture weights from the previous time step to generate the mixture weights at the current time. Our best proposed model outperforms a baseline model using LSTM based gating achieving about 20.48% relative reduction in word error rate (WER). Moreover, it beats an oracle model which picks the best expert for a given test condition.
声学模型是领域相关的,如果训练条件和测试条件不匹配,声学模型就不能很好地发挥作用。作为一种替代方法,引入了混合专家模型(MoE)进行多领域建模。它使用门控网络组合几个特定领域模型(或专家)的输出。然而,一个缺点是门控网络直接使用原始特征,并且不知道专家的状态。在这项工作中,我们提出了几种替代方案来改进MoE模型。首先,为了使我们的MoE模型能够感知状态,我们使用专家的输出作为门控网络的输入。然后,我们证明了基于矢量的混合权值插值比标量插值更有效。其次,我们证明了直接学习混合权值而不使用任何复杂门控仍然是有效的。最后,我们引入了一个混合注意力模型,该模型使用前一个时间步长的logits和混合权重来生成当前时刻的混合权重。我们提出的最佳模型优于使用基于LSTM的门控的基线模型,在单词错误率(WER)上实现了约20.48%的相对降低。而且,它胜过为给定测试条件挑选最佳专家的oracle模型。
{"title":"Universal Acoustic Modeling Using Neural Mixture Models","authors":"Amit Das, Jinyu Li, Changliang Liu, Y. Gong","doi":"10.1109/ICASSP.2019.8682403","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682403","url":null,"abstract":"Acoustic models are domain dependent and do not perform well if there is a mismatch between training and test conditions. As an alternative, the Mixture of Experts (MoE) model was introduced for multi-domain modeling. It combines the outputs of several domain specific models (or experts) using a gating network. However, one drawback is that the gating network directly uses raw features and is unaware of the state of the experts. In this work, we propose several alternatives to improve the MoE model. First, to make our MoE model state-aware, we use outputs of experts as inputs to the gating network. Then we show that vector based interpolation of the mixture weights is more effective than scalar interpolation. Second, we show that directly learning the mixture weights without using any complex gating is still effective. Finally, we introduce a hybrid attention model that uses the logits and mixture weights from the previous time step to generate the mixture weights at the current time. Our best proposed model outperforms a baseline model using LSTM based gating achieving about 20.48% relative reduction in word error rate (WER). Moreover, it beats an oracle model which picks the best expert for a given test condition.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"30 1","pages":"5681-5685"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88462698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Speech Landmark Bigrams for Depression Detection from Naturalistic Smartphone Speech 基于自然智能手机语音的抑郁检测语音地标双元图
Zhaocheng Huang, J. Epps, Dale Joachim
Detection of depression from speech has attracted significant research attention in recent years but remains a challenge, particularly for speech from diverse smartphones in natural environments. This paper proposes two sets of novel features based on speech landmark bigrams associated with abrupt speech articulatory events for depression detection from smartphone audio recordings. Combined with techniques adapted from natural language text processing, the proposed features further exploit landmark bigrams by discovering latent articulatory events. Experimental results on a large, naturalistic corpus containing various spoken tasks recorded from diverse smartphones suggest that speech landmark bigram features provide a 30.1% relative improvement in F1 (depressed) relative to an acoustic feature baseline system. As might be expected, a key finding was the importance of tailoring the choice of landmark bigrams to each elicitation task, revealing that different aspects of speech articulation are elicited by different tasks, which can be effectively captured by the landmark approaches.
近年来,从语音中检测抑郁症吸引了大量的研究关注,但仍然是一个挑战,特别是对于自然环境中各种智能手机的语音。本文提出了两组基于语音地标双特征的智能手机语音抑郁检测方法。结合自然语言文本处理技术,所提出的特征通过发现潜在的发音事件进一步利用地标双元图。在包含从不同智能手机记录的各种语音任务的大型自然语料库上的实验结果表明,相对于声学特征基线系统,语音地标双字母特征在F1(压抑)方面提供了30.1%的相对改善。正如预期的那样,一个关键的发现是为每个引出任务量身定制里程碑式图式的选择的重要性,揭示了不同任务引出语音发音的不同方面,这些方面可以通过里程碑式方法有效地捕捉到。
{"title":"Speech Landmark Bigrams for Depression Detection from Naturalistic Smartphone Speech","authors":"Zhaocheng Huang, J. Epps, Dale Joachim","doi":"10.1109/ICASSP.2019.8682916","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682916","url":null,"abstract":"Detection of depression from speech has attracted significant research attention in recent years but remains a challenge, particularly for speech from diverse smartphones in natural environments. This paper proposes two sets of novel features based on speech landmark bigrams associated with abrupt speech articulatory events for depression detection from smartphone audio recordings. Combined with techniques adapted from natural language text processing, the proposed features further exploit landmark bigrams by discovering latent articulatory events. Experimental results on a large, naturalistic corpus containing various spoken tasks recorded from diverse smartphones suggest that speech landmark bigram features provide a 30.1% relative improvement in F1 (depressed) relative to an acoustic feature baseline system. As might be expected, a key finding was the importance of tailoring the choice of landmark bigrams to each elicitation task, revealing that different aspects of speech articulation are elicited by different tasks, which can be effectively captured by the landmark approaches.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"3 10 1","pages":"5856-5860"},"PeriodicalIF":0.0,"publicationDate":"2019-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81377779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Robust M-estimation Based Matrix Completion 基于稳健m估计的矩阵补全
Michael Muma, W. Zeng, A. Zoubir
Conventional approaches to matrix completion are sensitive to outliers and impulsive noise. This paper develops robust and computationally efficient M-estimation based matrix completion algorithms. By appropriately arranging the observed entries, and then applying alternating minimization, the robust matrix completion problem is converted into a set of regression M-estimation problems. Making use of differentiable loss functions, the proposed algorithm overcomes a weakness of the ℓp-loss (p ≤ 1), which easily gets stuck in an inferior point. We prove that our algorithm converges to a stationary point of the nonconvex problem. Huber’s joint M-estimate of regression and scale can be used as a robust starting point for Tukey’s redescending M-estimator of regression based on an auxiliary scale. Numerical experiments on synthetic and real-world data demonstrate the superiority to state-of-the-art approaches.
传统的矩阵补全方法对异常值和脉冲噪声很敏感。本文开发了一种鲁棒且计算效率高的基于m估计的矩阵补全算法。将鲁棒矩阵补全问题转化为一组回归m估计问题,通过对观测项进行适当的排列,然后采用交替最小化方法。该算法利用可微损失函数,克服了p-损失(p≤1)容易陷入劣点的缺点。证明了该算法收敛于非凸问题的一个平稳点。Huber的回归和尺度的联合m估计可以作为Tukey基于辅助尺度的回归的重降m估计的稳健起点。综合数据和实际数据的数值实验证明了该方法的优越性。
{"title":"Robust M-estimation Based Matrix Completion","authors":"Michael Muma, W. Zeng, A. Zoubir","doi":"10.1109/ICASSP.2019.8682657","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682657","url":null,"abstract":"Conventional approaches to matrix completion are sensitive to outliers and impulsive noise. This paper develops robust and computationally efficient M-estimation based matrix completion algorithms. By appropriately arranging the observed entries, and then applying alternating minimization, the robust matrix completion problem is converted into a set of regression M-estimation problems. Making use of differentiable loss functions, the proposed algorithm overcomes a weakness of the ℓp-loss (p ≤ 1), which easily gets stuck in an inferior point. We prove that our algorithm converges to a stationary point of the nonconvex problem. Huber’s joint M-estimate of regression and scale can be used as a robust starting point for Tukey’s redescending M-estimator of regression based on an auxiliary scale. Numerical experiments on synthetic and real-world data demonstrate the superiority to state-of-the-art approaches.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"31 1","pages":"5476-5480"},"PeriodicalIF":0.0,"publicationDate":"2019-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81231635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
When Can a System of Subnetworks Be Registered Uniquely? 一个子网系统何时可以唯一注册?
A. V. Singh, K. Chaudhury
Consider a network with N nodes in d dimensions, and M overlapping subsets P1, ⋯,PM (subnetworks). Assume that the nodes in a given Pi are observed in a local coordinate system. We wish to register the subnetworks using the knowledge of the observed coordinates. More precisely, we want to compute the positions of the N nodes in a global coordinate system, given P1, ⋯, PM and the corresponding local coordinates. Among other applications, this problem arises in divide-and-conquer algorithms for localization of adhoc sensor networks. The network is said to be uniquely registrable if the global coordinates can be computed uniquely (up to a rigid transform). Clearly, if the network is not uniquely registrable, then any registration algorithm whatsoever is bound to fail. We formulate a necessary and sufficient condition for uniquely registra-bility in arbitrary dimensions. This condition leads to a randomized polynomial-time test for unique registrability in arbitrary dimensions, and a combinatorial linear-time test in two dimensions.
考虑一个在d维中有N个节点的网络,以及M个重叠的子集P1,⋯⋯PM(子网)。假设给定Pi中的节点是在一个局部坐标系中观察到的。我们希望利用观测到的坐标知识来注册子网。更精确地说,我们希望在给定P1,⋯⋯PM和相应的局部坐标的情况下,计算全局坐标系中N个节点的位置。在其他应用中,这个问题出现在用于自组织传感器网络定位的分治算法中。如果全局坐标可以唯一地计算(直到刚性变换),则网络被称为唯一可注册的。显然,如果网络不是唯一可注册的,那么任何注册算法都注定会失败。给出了任意维上唯一可注册性的一个充分必要条件。这个条件导致任意维度上唯一可配准性的随机多项式时间检验和二维上的组合线性时间检验。
{"title":"When Can a System of Subnetworks Be Registered Uniquely?","authors":"A. V. Singh, K. Chaudhury","doi":"10.1109/ICASSP.2019.8682680","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682680","url":null,"abstract":"Consider a network with N nodes in d dimensions, and M overlapping subsets P1, ⋯,PM (subnetworks). Assume that the nodes in a given Pi are observed in a local coordinate system. We wish to register the subnetworks using the knowledge of the observed coordinates. More precisely, we want to compute the positions of the N nodes in a global coordinate system, given P1, ⋯, PM and the corresponding local coordinates. Among other applications, this problem arises in divide-and-conquer algorithms for localization of adhoc sensor networks. The network is said to be uniquely registrable if the global coordinates can be computed uniquely (up to a rigid transform). Clearly, if the network is not uniquely registrable, then any registration algorithm whatsoever is bound to fail. We formulate a necessary and sufficient condition for uniquely registra-bility in arbitrary dimensions. This condition leads to a randomized polynomial-time test for unique registrability in arbitrary dimensions, and a combinatorial linear-time test in two dimensions.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"64 1","pages":"4564-4568"},"PeriodicalIF":0.0,"publicationDate":"2019-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84018914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Search Path for Region-level Image Matching 区域级图像匹配的学习搜索路径
Onkar Krishna, Go Irie, Xiaomeng Wu, T. Kawanishi, K. Kashino
Finding a region of an image which matches to a query from a large number of candidates is a fundamental problem in image processing. The exhaustive nature of the sliding window approach has encouraged works that can reduce the run time by skipping unnecessary windows or pixels that do not play a substantial role in search results. However, such a pruning-based approach still needs to evaluate the non-ignorable number of candidates, which leads to a limited efficiency improvement. We propose an approach to learn efficient search paths from data. Our model is based on a CNN-LSTM architecture which is designed to sequentially determine a prospective location to be searched next based on the history of the locations attended. We propose a reinforcement learning algorithm to train the model in an end-to-end manner, which allows to jointly learn the search paths and deep image features for matching. These properties together significantly reduce the number of windows to be evaluated and makes it robust to background clutters. Our model gives remarkable matching accuracy with the reduced number of windows and run time on MNIST and FlickrLogos-32 datasets.
从大量的候选图像中找到与查询匹配的图像区域是图像处理中的一个基本问题。滑动窗口方法的详尽性鼓励了一些可以通过跳过不必要的窗口或在搜索结果中不起重要作用的像素来减少运行时间的工作。然而,这种基于剪枝的方法仍然需要评估不可忽略的候选数量,这导致效率的提高有限。我们提出了一种从数据中学习有效搜索路径的方法。我们的模型基于CNN-LSTM架构,该架构旨在根据出席地点的历史顺序确定下一步要搜索的潜在地点。我们提出了一种强化学习算法,以端到端方式训练模型,允许联合学习搜索路径和深度图像特征进行匹配。这些属性一起显着减少了要评估的窗口数量,并使其对背景杂波具有鲁棒性。我们的模型在MNIST和FlickrLogos-32数据集上提供了显著的匹配精度,减少了窗口数量和运行时间。
{"title":"Learning Search Path for Region-level Image Matching","authors":"Onkar Krishna, Go Irie, Xiaomeng Wu, T. Kawanishi, K. Kashino","doi":"10.1109/ICASSP.2019.8682714","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682714","url":null,"abstract":"Finding a region of an image which matches to a query from a large number of candidates is a fundamental problem in image processing. The exhaustive nature of the sliding window approach has encouraged works that can reduce the run time by skipping unnecessary windows or pixels that do not play a substantial role in search results. However, such a pruning-based approach still needs to evaluate the non-ignorable number of candidates, which leads to a limited efficiency improvement. We propose an approach to learn efficient search paths from data. Our model is based on a CNN-LSTM architecture which is designed to sequentially determine a prospective location to be searched next based on the history of the locations attended. We propose a reinforcement learning algorithm to train the model in an end-to-end manner, which allows to jointly learn the search paths and deep image features for matching. These properties together significantly reduce the number of windows to be evaluated and makes it robust to background clutters. Our model gives remarkable matching accuracy with the reduced number of windows and run time on MNIST and FlickrLogos-32 datasets.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"1967-1971"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88674951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving Children Speech Recognition through Feature Learning from Raw Speech Signal 基于原始语音信号特征学习的儿童语音识别研究
Selen Hande Kabil, Mathew Magimai Doss
Children speech recognition based on short-term spectral features is a challenging task. One of the reasons is that children speech has high fundamental frequency that is comparable to formant frequency values. Furthermore, as children grow, their vocal apparatus also undergoes changes. This presents difficulties in extracting standard short-term spectral-based features reliably for speech recognition. In recent years, novel acoustic modeling methods have emerged that learn both the feature and phone classifier in an end-to-end manner from the raw speech signal. Through an investigation on PF-STAR corpus we show that children speech recognition can be improved using end-to-end acoustic modeling methods.
基于短时谱特征的儿童语音识别是一项具有挑战性的任务。其中一个原因是儿童言语的基频很高,与形成峰频率值相当。此外,随着孩子的成长,他们的发声器官也会发生变化。这给语音识别可靠地提取标准短期频谱特征带来了困难。近年来,出现了一种新的声学建模方法,可以从原始语音信号中端到端学习特征和电话分类器。通过对PF-STAR语料库的研究,我们发现使用端到端声学建模方法可以改善儿童语音识别。
{"title":"Improving Children Speech Recognition through Feature Learning from Raw Speech Signal","authors":"Selen Hande Kabil, Mathew Magimai Doss","doi":"10.1109/ICASSP.2019.8682826","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682826","url":null,"abstract":"Children speech recognition based on short-term spectral features is a challenging task. One of the reasons is that children speech has high fundamental frequency that is comparable to formant frequency values. Furthermore, as children grow, their vocal apparatus also undergoes changes. This presents difficulties in extracting standard short-term spectral-based features reliably for speech recognition. In recent years, novel acoustic modeling methods have emerged that learn both the feature and phone classifier in an end-to-end manner from the raw speech signal. Through an investigation on PF-STAR corpus we show that children speech recognition can be improved using end-to-end acoustic modeling methods.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"135 3 1","pages":"5736-5740"},"PeriodicalIF":0.0,"publicationDate":"2019-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82389424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Beamformer Design under Time-correlated Interference and Online Implementation: Brain-activity Reconstruction from EEG 时间相关干扰下波束形成器设计与在线实现:脑电脑活动重建
Takehiro Kono, M. Yukawa, Tomasz Piotrowski
We present a convexly-constrained beamformer design for brain activity reconstruction from non-invasive electroencephalography (EEG) signals. An intrinsic gap between the output variance and the mean squared errors is highlighted that occurs due to the presence of interfering activities correlated with the desired activity. The key idea of the proposed beamformer is reducing this gap without amplifying the noise by imposing a quadratic constraint that bounds the total power of interference leakage together with the distortionless constraint. The proposed beamformer can be implemented efficiently by the multi-domain adaptive filtering algorithm. Numerical examples show the clear advantages of the proposed beamformer over the minimum-variance distortionless response (MVDR) and nulling beamformers.
我们提出了一种凸约束波束形成器设计,用于从非侵入性脑电图(EEG)信号中重建大脑活动。输出方差和均方误差之间的内在差距被强调,这是由于与期望活动相关的干扰活动的存在而发生的。提出的波束形成器的关键思想是通过施加二次约束来限制干扰泄漏的总功率和无失真约束,从而在不放大噪声的情况下减小这种间隙。该波束形成器可以通过多域自适应滤波算法有效地实现。数值算例表明,该波束形成器明显优于最小方差无失真响应(MVDR)波束形成器和零化波束形成器。
{"title":"Beamformer Design under Time-correlated Interference and Online Implementation: Brain-activity Reconstruction from EEG","authors":"Takehiro Kono, M. Yukawa, Tomasz Piotrowski","doi":"10.1109/ICASSP.2019.8682614","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682614","url":null,"abstract":"We present a convexly-constrained beamformer design for brain activity reconstruction from non-invasive electroencephalography (EEG) signals. An intrinsic gap between the output variance and the mean squared errors is highlighted that occurs due to the presence of interfering activities correlated with the desired activity. The key idea of the proposed beamformer is reducing this gap without amplifying the noise by imposing a quadratic constraint that bounds the total power of interference leakage together with the distortionless constraint. The proposed beamformer can be implemented efficiently by the multi-domain adaptive filtering algorithm. Numerical examples show the clear advantages of the proposed beamformer over the minimum-variance distortionless response (MVDR) and nulling beamformers.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"35 1","pages":"1070-1074"},"PeriodicalIF":0.0,"publicationDate":"2019-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83722559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Event-driven Pipeline for Low-latency Low-compute Keyword Spotting and Speaker Verification System 低延迟低计算关键字识别与说话人验证系统的事件驱动管道
Enea Ceolini, Jithendar Anumula, Stefan Braun, Shih-Chii Liu
This work presents an event-driven acoustic sensor processing pipeline to power a low-resource voice-activated smart assistant. The pipeline includes four major steps; namely localization, source separation, keyword spotting (KWS) and speaker verification (SV). The pipeline is driven by a front-end binaural spiking silicon cochlea sensor. The timing information carried by the output spikes of the cochlea provide spatial cues for localization and source separation. Spike features are generated with low latencies from the separated source spikes and are used by both KWS and SV which rely on state-of-the-art deep recurrent neural network architectures with a small memory footprint. Evaluation on a self-recorded event dataset based on TIDIGITS shows accuracies of over 93% and 88% on KWS and SV respectively, with minimum system latency of 5 ms on a limited resource device.
这项工作提出了一个事件驱动的声学传感器处理管道,为低资源声控智能助手提供动力。该管道包括四个主要步骤;即定位、源分离、关键词识别(KWS)和说话人验证(SV)。该管道由前端双耳脉冲硅耳蜗传感器驱动。耳蜗输出峰携带的时间信息为定位和分离声源提供了空间线索。尖峰特征由分离源尖峰以低延迟生成,并由KWS和SV使用,它们依赖于具有小内存占用的最先进的深度循环神经网络架构。对基于TIDIGITS的自记录事件数据集的评估显示,在KWS和SV上的准确率分别超过93%和88%,在有限资源设备上的最小系统延迟为5 ms。
{"title":"Event-driven Pipeline for Low-latency Low-compute Keyword Spotting and Speaker Verification System","authors":"Enea Ceolini, Jithendar Anumula, Stefan Braun, Shih-Chii Liu","doi":"10.1109/ICASSP.2019.8683669","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683669","url":null,"abstract":"This work presents an event-driven acoustic sensor processing pipeline to power a low-resource voice-activated smart assistant. The pipeline includes four major steps; namely localization, source separation, keyword spotting (KWS) and speaker verification (SV). The pipeline is driven by a front-end binaural spiking silicon cochlea sensor. The timing information carried by the output spikes of the cochlea provide spatial cues for localization and source separation. Spike features are generated with low latencies from the separated source spikes and are used by both KWS and SV which rely on state-of-the-art deep recurrent neural network architectures with a small memory footprint. Evaluation on a self-recorded event dataset based on TIDIGITS shows accuracies of over 93% and 88% on KWS and SV respectively, with minimum system latency of 5 ms on a limited resource device.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"7953-7957"},"PeriodicalIF":0.0,"publicationDate":"2019-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73201418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Maximally Smooth Dirichlet Interpolation from Complete and Incomplete Sample Points on the Unit Circle 单位圆上完全和不完全采样点的最光滑狄利克雷插值
Stephan Weiss, M. Macleod
This paper introduces a cost function for the smoothness of a continuous periodic function, of which only some samples are given. This cost function is important e.g. when associating samples in frequency bins for problems such as analytic singular or eigenvalue decompositions. We demonstrate the utility of the cost function, and study some of its complexity and conditioning issues.
本文引入了一个连续周期函数的平滑代价函数,给出了该函数的一些样本。这个代价函数是很重要的,例如,当在分析奇异或特征值分解等问题中关联频率箱中的样本时。我们展示了成本函数的效用,并研究了它的一些复杂性和条件问题。
{"title":"Maximally Smooth Dirichlet Interpolation from Complete and Incomplete Sample Points on the Unit Circle","authors":"Stephan Weiss, M. Macleod","doi":"10.1109/ICASSP.2019.8683366","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683366","url":null,"abstract":"This paper introduces a cost function for the smoothness of a continuous periodic function, of which only some samples are given. This cost function is important e.g. when associating samples in frequency bins for problems such as analytic singular or eigenvalue decompositions. We demonstrate the utility of the cost function, and study some of its complexity and conditioning issues.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"84 1","pages":"8053-8057"},"PeriodicalIF":0.0,"publicationDate":"2019-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83857024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Importance of Analytic Phase of the Speech Signal for Detecting Replay Attacks in Automatic Speaker Verification Systems 语音信号分析相位在自动说话人验证系统中检测重放攻击中的重要性
B. M. Rafi, K. Murty
In this paper, the importance of analytic phase of the speech signal in automatic speaker verification systems is demonstrated in the context of replay spoof attacks. In order to accurately detect the replay spoof attacks, effective feature representations of speech signals are required to capture the distortion introduced due to the intermediate playback/recording devices, which is convolutive in nature. Since the convolutional distortion in time-domain translates to additive distortion in the phase-domain, we propose to use IFCC features extracted from the analytic phase of the speech signal. The IFCC features contain information from both clean speech and distortion components. The clean speech component has to be subtracted in order to highlight the distortion component introduced by the playback/recording devices. In this work, a dictionary learned from the IFCCs extracted from clean speech data is used to remove the clean speech component. The residual distortion component is used as a feature to build binary classifier for replay spoof detection. The proposed phase-based features delivered a 9% absolute improvement over the baseline system built using magnitude-based CQCC features.
本文在重播欺骗攻击的背景下,论证了语音信号分析相位在自动说话人验证系统中的重要性。为了准确检测重放欺骗攻击,需要对语音信号进行有效的特征表示,以捕获由于中间播放/记录设备而引入的失真,这在本质上是卷积的。由于时域的卷积失真转化为相域的加性失真,我们建议使用从语音信号的分析相位提取的IFCC特征。IFCC功能包含来自干净语音和失真分量的信息。为了突出播放/录制设备引入的失真分量,必须减去干净的语音分量。在这项工作中,使用从干净语音数据中提取的ifcc学习的字典来去除干净语音成分。利用残差失真分量作为特征来构建二元分类器,用于重放欺骗检测。与使用基于震级的CQCC特性构建的基线系统相比,提出的基于阶段的特性提供了9%的绝对改进。
{"title":"Importance of Analytic Phase of the Speech Signal for Detecting Replay Attacks in Automatic Speaker Verification Systems","authors":"B. M. Rafi, K. Murty","doi":"10.1109/ICASSP.2019.8683500","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683500","url":null,"abstract":"In this paper, the importance of analytic phase of the speech signal in automatic speaker verification systems is demonstrated in the context of replay spoof attacks. In order to accurately detect the replay spoof attacks, effective feature representations of speech signals are required to capture the distortion introduced due to the intermediate playback/recording devices, which is convolutive in nature. Since the convolutional distortion in time-domain translates to additive distortion in the phase-domain, we propose to use IFCC features extracted from the analytic phase of the speech signal. The IFCC features contain information from both clean speech and distortion components. The clean speech component has to be subtracted in order to highlight the distortion component introduced by the playback/recording devices. In this work, a dictionary learned from the IFCCs extracted from clean speech data is used to remove the clean speech component. The residual distortion component is used as a feature to build binary classifier for replay spoof detection. The proposed phase-based features delivered a 9% absolute improvement over the baseline system built using magnitude-based CQCC features.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"16 1","pages":"6306-6310"},"PeriodicalIF":0.0,"publicationDate":"2019-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85227808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1