首页 > 最新文献

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

英文 中文
Global Optimality in Inductive Matrix Completion 归纳矩阵补全中的全局最优性
Mohsen Ghassemi, A. Sarwate, Naveen Goela
Inductive matrix completion (IMC) is a model for incorporating side information in form of “features” of the row and column entities of an unknown matrix in the matrix completion problem. As side information, features can substantially reduce the number of observed entries required for reconstructing an unknown matrix from its given entries. The IMC problem can be formulated as a low-rank matrix recovery problem where the observed entries are seen as measurements of a smaller matrix that models the interaction between the column and row features. We take advantage of this property to study the optimization landscape of the factorized IMC problem. In particular, we show that the critical points of the objective function of this problem are either global minima that correspond to the true solution or are “escapable” saddle points. This result implies that any minimization algorithm with guaranteed convergence to a local minimum can be used for solving the factorized IMC problem.
归纳矩阵补全(IMC)是在矩阵补全问题中以未知矩阵的行、列实体的“特征”形式加入边信息的一种模型。作为副信息,特征可以大大减少从给定项重构未知矩阵所需的观察项的数量。IMC问题可以表述为一个低秩矩阵恢复问题,其中观察到的条目被视为一个较小矩阵的测量值,该矩阵模拟了列和行特征之间的相互作用。我们利用这一性质来研究因式IMC问题的优化前景。特别地,我们证明了该问题的目标函数的临界点要么是对应于真解的全局最小值,要么是“可逃避的”鞍点。这一结果表明,任何保证收敛于局部极小值的最小化算法都可以用于求解因式IMC问题。
{"title":"Global Optimality in Inductive Matrix Completion","authors":"Mohsen Ghassemi, A. Sarwate, Naveen Goela","doi":"10.1109/ICASSP.2018.8462250","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462250","url":null,"abstract":"Inductive matrix completion (IMC) is a model for incorporating side information in form of “features” of the row and column entities of an unknown matrix in the matrix completion problem. As side information, features can substantially reduce the number of observed entries required for reconstructing an unknown matrix from its given entries. The IMC problem can be formulated as a low-rank matrix recovery problem where the observed entries are seen as measurements of a smaller matrix that models the interaction between the column and row features. We take advantage of this property to study the optimization landscape of the factorized IMC problem. In particular, we show that the critical points of the objective function of this problem are either global minima that correspond to the true solution or are “escapable” saddle points. This result implies that any minimization algorithm with guaranteed convergence to a local minimum can be used for solving the factorized IMC problem.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"44 1","pages":"2226-2230"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87030175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks 基于生成对抗网络的语音驱动头部运动的新实现
Najmeh Sadoughi, C. Busso
Head movement is an integral part of face-to-face communications. It is important to investigate methodologies to generate naturalistic movements for conversational agents (CAs). The predominant method for head movement generation is using rules based on the meaning of the message. However, the variations of head movements by these methods are bounded by the predefined dictionary of gestures. Speech-driven methods offer an alternative approach, learning the relationship between speech and head movements from real recordings. However, previous studies do not generate novel realizations for a repeated speech signal. Conditional generative adversarial network (GAN) provides a framework to generate multiple realizations of head movements for each speech segment by sampling from a conditioned distribution. We build a conditional GAN with bidirectional long-short term memory (BLSTM), which is suitable for capturing the long-short term dependencies of time-continuous signals. This model learns the distribution of head movements conditioned on speech prosodic features. We compare this model with a dynamic Bayesian network (DBN) and BLSTM models optimized to reduce mean squared error (MSE) or to increase concordance correlation. The objective evaluations and subjective evaluations of the results showed better performance for the conditional GAN model compared with these baseline systems.
头部运动是面对面交流的一个组成部分。研究为会话代理(ca)生成自然运动的方法是很重要的。生成头部动作的主要方法是使用基于信息含义的规则。然而,这些方法的头部运动变化受到预定义的手势字典的限制。语言驱动的方法提供了另一种方法,从真实的录音中学习语言和头部运动之间的关系。然而,以往的研究并没有对重复语音信号产生新的实现。条件生成对抗网络(GAN)提供了一个框架,通过从条件分布中采样来生成每个语音片段的头部运动的多种实现。我们构建了一个具有双向长短期记忆(BLSTM)的条件GAN,它适用于捕获时间连续信号的长短期依赖性。该模型学习基于语音韵律特征的头部运动分布。我们将该模型与经过优化的动态贝叶斯网络(DBN)和BLSTM模型进行了比较,以减少均方误差(MSE)或增加一致性相关性。结果的客观评价和主观评价表明,条件GAN模型比这些基线系统具有更好的性能。
{"title":"Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks","authors":"Najmeh Sadoughi, C. Busso","doi":"10.1109/ICASSP.2018.8461967","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461967","url":null,"abstract":"Head movement is an integral part of face-to-face communications. It is important to investigate methodologies to generate naturalistic movements for conversational agents (CAs). The predominant method for head movement generation is using rules based on the meaning of the message. However, the variations of head movements by these methods are bounded by the predefined dictionary of gestures. Speech-driven methods offer an alternative approach, learning the relationship between speech and head movements from real recordings. However, previous studies do not generate novel realizations for a repeated speech signal. Conditional generative adversarial network (GAN) provides a framework to generate multiple realizations of head movements for each speech segment by sampling from a conditioned distribution. We build a conditional GAN with bidirectional long-short term memory (BLSTM), which is suitable for capturing the long-short term dependencies of time-continuous signals. This model learns the distribution of head movements conditioned on speech prosodic features. We compare this model with a dynamic Bayesian network (DBN) and BLSTM models optimized to reduce mean squared error (MSE) or to increase concordance correlation. The objective evaluations and subjective evaluations of the results showed better performance for the conditional GAN model compared with these baseline systems.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"10 1","pages":"6169-6173"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85109420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Unlimited Sampling of Sparse Signals 稀疏信号的无限采样
A. Bhandari, F. Krahmer, R. Raskar
In a recent paper [1], we introduced the concept of “Unlimited Sampling”. This unique approach circumvents the clipping or saturation problem in conventional analog-to-digital converters (ADCs) by considering a radically different ADC architecture which resets the input voltage before saturation. Such ADCs, also known as Self-Reset ADCs (SR-ADCs), allow for sensing modulo samples. In analogy to Shannon's sampling theorem, the unlimited sampling theorem proves that a bandlimited signal can be recovered from modulo samples provided that a certain sampling density criterion, that is independent of the ADC threshold, is satisfied. In this way, our result allows for perfect recovery of a bandlimited function whose amplitude exceeds the ADC threshold by orders of magnitude. By capitalizing on this result, in this paper, we consider the inverse problem of recovering a sparse signal from its low-pass filtered version. This problem frequently arises in several areas of science and engineering and in context of signal processing, it is studied in several flavors, namely, sparse or FRI sampling, super-resolution and sparse deconvolution. By considering the SR-ADC architecture, we develop a sampling theory for modulo sampling of lowpass filtered spikes. Our main result consists of a new sparse sampling theorem and an algorithm which stably recovers a $K$ -sparse signal from low-pass, modulo samples. We validate our results using numerical experiments.
在最近的一篇论文[1]中,我们引入了“无限采样”的概念。这种独特的方法通过考虑一种完全不同的ADC架构,在饱和之前重置输入电压,从而避免了传统模数转换器(ADC)中的剪切或饱和问题。这种adc,也称为自复位adc (sr - adc),允许检测模采样。与香农采样定理类似,无限采样定理证明,只要满足与ADC阈值无关的采样密度准则,就可以从模样本中恢复出带宽有限的信号。通过这种方式,我们的结果可以完美地恢复幅度超过ADC阈值数量级的带限函数。利用这一结果,在本文中,我们考虑了从其低通滤波版本恢复稀疏信号的逆问题。这个问题经常出现在一些科学和工程领域,在信号处理的背景下,它被研究在几个方面,即稀疏或FRI采样,超分辨率和稀疏反卷积。通过考虑SR-ADC结构,我们发展了一种低通滤波尖峰的模采样理论。我们的主要成果包括一个新的稀疏采样定理和一种从低通模数样本中稳定恢复K -稀疏信号的算法。我们用数值实验验证了我们的结果。
{"title":"Unlimited Sampling of Sparse Signals","authors":"A. Bhandari, F. Krahmer, R. Raskar","doi":"10.1109/ICASSP.2018.8462231","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462231","url":null,"abstract":"In a recent paper [1], we introduced the concept of “Unlimited Sampling”. This unique approach circumvents the clipping or saturation problem in conventional analog-to-digital converters (ADCs) by considering a radically different ADC architecture which resets the input voltage before saturation. Such ADCs, also known as Self-Reset ADCs (SR-ADCs), allow for sensing modulo samples. In analogy to Shannon's sampling theorem, the unlimited sampling theorem proves that a bandlimited signal can be recovered from modulo samples provided that a certain sampling density criterion, that is independent of the ADC threshold, is satisfied. In this way, our result allows for perfect recovery of a bandlimited function whose amplitude exceeds the ADC threshold by orders of magnitude. By capitalizing on this result, in this paper, we consider the inverse problem of recovering a sparse signal from its low-pass filtered version. This problem frequently arises in several areas of science and engineering and in context of signal processing, it is studied in several flavors, namely, sparse or FRI sampling, super-resolution and sparse deconvolution. By considering the SR-ADC architecture, we develop a sampling theory for modulo sampling of lowpass filtered spikes. Our main result consists of a new sparse sampling theorem and an algorithm which stably recovers a $K$ -sparse signal from low-pass, modulo samples. We validate our results using numerical experiments.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"49 1","pages":"4569-4573"},"PeriodicalIF":0.0,"publicationDate":"2018-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86844752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Sub-Diffraction Imaging Using Fourier Ptychography and Structured Sparsity 基于傅里叶平面摄影和结构稀疏性的亚衍射成像
Gauri Jagatap, Zhengyu Chen, C. Hegde, Namrata Vaswani
We consider the problem of super-resolution for sub-diffraction imaging. We adapt conventional Fourier ptychographic approaches, for the case where the images to be acquired have an underlying structured sparsity. We propose some sub-sampling strategies which can be easily adapted to existing ptychographic setups. We then use a novel technique called CoPRAM with some modifications, to recover sparse (and block sparse) images from sub-sampled pty-chographic measurements. We demonstrate experimentally that this algorithm performs better than existing phase retrieval techniques, in terms of quality of reconstruction, using fewer number of samples.
讨论了亚衍射成像的超分辨率问题。我们采用传统的傅立叶平面成像方法,对于要获得的图像具有潜在的结构化稀疏性的情况。我们提出了一些子采样策略,可以很容易地适应现有的心理设置。然后,我们使用一种称为CoPRAM的新技术进行一些修改,以从亚采样的pti -chographic测量中恢复稀疏(和块稀疏)图像。实验证明,该算法在重构质量方面优于现有的相位检索技术,使用较少的样本数量。
{"title":"Sub-Diffraction Imaging Using Fourier Ptychography and Structured Sparsity","authors":"Gauri Jagatap, Zhengyu Chen, C. Hegde, Namrata Vaswani","doi":"10.1109/ICASSP.2018.8461302","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461302","url":null,"abstract":"We consider the problem of super-resolution for sub-diffraction imaging. We adapt conventional Fourier ptychographic approaches, for the case where the images to be acquired have an underlying structured sparsity. We propose some sub-sampling strategies which can be easily adapted to existing ptychographic setups. We then use a novel technique called CoPRAM with some modifications, to recover sparse (and block sparse) images from sub-sampled pty-chographic measurements. We demonstrate experimentally that this algorithm performs better than existing phase retrieval techniques, in terms of quality of reconstruction, using fewer number of samples.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"71 1","pages":"6493-6497"},"PeriodicalIF":0.0,"publicationDate":"2018-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73530120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Study of Dense Network Approaches for Speech Emotion Recognition 语音情感识别的密集网络方法研究
Mohammed Abdel-Wahab, C. Busso
Deep neural networks have been proven to be very effective in various classification problems and show great promise for emotion recognition from speech. Studies have proposed various architectures that further improve the performance of emotion recognition systems. However, there are still various open questions regarding the best approach to building a speech emotion recognition system. Would the system's performance improve if we have more labeled data? How much do we benefit from data augmentation? What activation and regularization schemes are more beneficial? How does the depth of the network affect the performance? We are collecting the MSP-Podcast corpus, a large dataset with over 30 hours of data, which provides an ideal resource to address these questions. This study explores various dense architectures to predict arousal, valence and dominance scores. We investigate varying the training set size, width, and depth of the network, as well as the activation functions used during training. We also study the effect of data augmentation on the network's performance. We find that bigger training set improves the performance. Batch normalization is crucial to achieving a good performance for deeper networks. We do not observe significant differences in the performance in residual networks compared to dense networks.
深度神经网络已被证明在各种分类问题中非常有效,并在语音情感识别方面显示出巨大的前景。研究提出了各种架构,以进一步提高情绪识别系统的性能。然而,关于构建语音情感识别系统的最佳方法仍然存在各种悬而未决的问题。如果我们有更多的标记数据,系统的性能会提高吗?我们从数据增强中获益多少?哪种激活和正则化方案更有益?网络的深度如何影响性能?我们正在收集MSP-Podcast语料库,这是一个拥有超过30小时数据的大型数据集,它为解决这些问题提供了理想的资源。本研究探索了各种密集结构来预测唤醒、效价和优势得分。我们研究了网络的训练集大小、宽度和深度的变化,以及训练过程中使用的激活函数。我们还研究了数据扩充对网络性能的影响。我们发现更大的训练集可以提高性能。批处理归一化对于实现深度网络的良好性能至关重要。我们没有观察到残差网络与密集网络在性能上的显著差异。
{"title":"Study of Dense Network Approaches for Speech Emotion Recognition","authors":"Mohammed Abdel-Wahab, C. Busso","doi":"10.1109/ICASSP.2018.8461866","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461866","url":null,"abstract":"Deep neural networks have been proven to be very effective in various classification problems and show great promise for emotion recognition from speech. Studies have proposed various architectures that further improve the performance of emotion recognition systems. However, there are still various open questions regarding the best approach to building a speech emotion recognition system. Would the system's performance improve if we have more labeled data? How much do we benefit from data augmentation? What activation and regularization schemes are more beneficial? How does the depth of the network affect the performance? We are collecting the MSP-Podcast corpus, a large dataset with over 30 hours of data, which provides an ideal resource to address these questions. This study explores various dense architectures to predict arousal, valence and dominance scores. We investigate varying the training set size, width, and depth of the network, as well as the activation functions used during training. We also study the effect of data augmentation on the network's performance. We find that bigger training set improves the performance. Batch normalization is crucial to achieving a good performance for deeper networks. We do not observe significant differences in the performance in residual networks compared to dense networks.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"59 1","pages":"5084-5088"},"PeriodicalIF":0.0,"publicationDate":"2018-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79464478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Foreground Harmonic Noise Reduction for Robust Audio Fingerprinting 鲁棒音频指纹的前景谐波降噪
Matthew C. McCallum
Audio fingerprinting systems are often well designed to cope with a range of broadband noise types however they cope less well when presented with additive noise containing sinusoidal components. This is largely due to the fact that in a short-time signal representation (over periods of ≈ 20ms) these noise components are largely indistinguishable from salient components of the desirable signal that is to be fingerprinted. In this paper a front -end sinusoidal noise reduction procedure is introduced that is able to remove the most detrimental of the sinusoidal noise components thereby improving the audio fingerprinting system's performance. This is achievable by grouping short-time sinusoidal components into pitch contours via magnitude, frequency and phase characteristics, and identifying noisy contours as those with characteristics that are outliers in the distribution of all pitch contours in the signal. With this paper's contribution, the recognition rate in an industrial scale fingerprinting system is increased by up to 8.4%.
音频指纹识别系统通常设计得很好,可以处理一系列宽带噪声类型,但是当出现包含正弦分量的附加噪声时,它们处理得就不那么好了。这在很大程度上是由于这样一个事实,即在短时间信号表示(超过约20ms的周期)中,这些噪声成分在很大程度上与需要指纹识别的信号的显著成分无法区分。本文介绍了一种前端正弦降噪方法,能够去除最有害的正弦噪声成分,从而提高音频指纹识别系统的性能。这可以通过将短时间正弦分量根据幅度、频率和相位特征分组为基音轮廓,并将噪声轮廓识别为信号中所有基音轮廓分布中的异常值特征来实现。通过本文的研究,工业规模指纹识别系统的识别率提高了8.4%。
{"title":"Foreground Harmonic Noise Reduction for Robust Audio Fingerprinting","authors":"Matthew C. McCallum","doi":"10.1109/ICASSP.2018.8462636","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462636","url":null,"abstract":"Audio fingerprinting systems are often well designed to cope with a range of broadband noise types however they cope less well when presented with additive noise containing sinusoidal components. This is largely due to the fact that in a short-time signal representation (over periods of ≈ 20ms) these noise components are largely indistinguishable from salient components of the desirable signal that is to be fingerprinted. In this paper a front -end sinusoidal noise reduction procedure is introduced that is able to remove the most detrimental of the sinusoidal noise components thereby improving the audio fingerprinting system's performance. This is achievable by grouping short-time sinusoidal components into pitch contours via magnitude, frequency and phase characteristics, and identifying noisy contours as those with characteristics that are outliers in the distribution of all pitch contours in the signal. With this paper's contribution, the recognition rate in an industrial scale fingerprinting system is increased by up to 8.4%.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"15 2","pages":"3146-3150"},"PeriodicalIF":0.0,"publicationDate":"2018-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91400852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech Prediction Using an Adaptive Recurrent Neural Network with Application to Packet Loss Concealment 基于自适应递归神经网络的语音预测及其丢包隐藏应用
Reza Lotfidereshgi, P. Gournay
This paper proposes a novel approach for speech signal prediction based on a recurrent neural network (RNN). Unlike existing RNN-based predictors, which operate on parametric features and are trained offline on a large collection of such features, the proposed predictor operates directly on speech samples and is trained online on the recent past of the speech signal. Optionally, the network can be pre-trained offline to speed-up convergence at start-up. The proposed predictor is a single end-to-end network that captures all sorts of dependencies between samples, and therefore has the potential to outperform classicallinear/non-linear and short-termllong-term speech predictor structures. We apply it to the packet loss concealment (PLC) problem and show that it outperforms the standard ITU G.711 Appendix I PLC technique.
提出了一种基于递归神经网络(RNN)的语音信号预测新方法。现有的基于rnn的预测器在参数特征上运行,并在大量这样的特征上进行离线训练,与之不同的是,本文提出的预测器直接在语音样本上运行,并在语音信号的最近历史上进行在线训练。可选的是,网络可以离线预训练,以加速启动时的收敛。提出的预测器是一个单一的端到端网络,可以捕获样本之间的各种依赖关系,因此有可能优于经典的线性/非线性和短期/长期语音预测器结构。我们将其应用于丢包隐藏(PLC)问题,并表明它优于标准ITU G.711附录I PLC技术。
{"title":"Speech Prediction Using an Adaptive Recurrent Neural Network with Application to Packet Loss Concealment","authors":"Reza Lotfidereshgi, P. Gournay","doi":"10.1109/ICASSP.2018.8462185","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462185","url":null,"abstract":"This paper proposes a novel approach for speech signal prediction based on a recurrent neural network (RNN). Unlike existing RNN-based predictors, which operate on parametric features and are trained offline on a large collection of such features, the proposed predictor operates directly on speech samples and is trained online on the recent past of the speech signal. Optionally, the network can be pre-trained offline to speed-up convergence at start-up. The proposed predictor is a single end-to-end network that captures all sorts of dependencies between samples, and therefore has the potential to outperform classicallinear/non-linear and short-termllong-term speech predictor structures. We apply it to the packet loss concealment (PLC) problem and show that it outperforms the standard ITU G.711 Appendix I PLC technique.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"16 1","pages":"5394-5398"},"PeriodicalIF":0.0,"publicationDate":"2018-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84175903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Cyborg Speech: Deep Multilingual Speech Synthesis for Generating Segmental Foreign Accent with Natural Prosody 半机械人语音:用自然韵律产生分段外国口音的深度多语言语音合成
G. Henter, Jaime Lorenzo-Trueba, Xin Wang, M. Kondo, J. Yamagishi
We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm “cyborg speech” as it combines human and machine speech parameters. Segmentally accented speech is produced by interpolating specific quinphone linguistic features towards phones from the other language that represent non-native mispronunciations. Experiments on synthetic American-English-accented Japanese speech show that subjective synthesis quality matches monolingual synthesis, that natural pitch is maintained, and that naturalistic phone substitutions generate output that is perceived as having an American foreign accent, even though only non-accented training data was used.
我们描述了一种基于深度学习的语音合成的新应用,即用于生成可控外国口音的多语言语音合成。具体来说,我们训练了一个基于dblstm的声学模型,该模型基于母语为几种语言的说话者的无口音多语言语音录音。通过从预先录制的所需提示音中复制持续时间和音高轮廓,可以实现自然韵律。我们称这种模式为“半机械人语音”,因为它结合了人类和机器的语音参数。分段重音语音是通过向来自其他语言的电话插入特定的五音位语言特征而产生的,这些特征表示非母语发音错误。对美国-英国口音日语语音的合成实验表明,主观合成质量与单语合成相匹配,自然音高得到维持,自然的电话替换产生的输出被认为具有美国外国口音,即使只使用了非口音的训练数据。
{"title":"Cyborg Speech: Deep Multilingual Speech Synthesis for Generating Segmental Foreign Accent with Natural Prosody","authors":"G. Henter, Jaime Lorenzo-Trueba, Xin Wang, M. Kondo, J. Yamagishi","doi":"10.1109/ICASSP.2018.8462470","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462470","url":null,"abstract":"We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm “cyborg speech” as it combines human and machine speech parameters. Segmentally accented speech is produced by interpolating specific quinphone linguistic features towards phones from the other language that represent non-native mispronunciations. Experiments on synthetic American-English-accented Japanese speech show that subjective synthesis quality matches monolingual synthesis, that natural pitch is maintained, and that naturalistic phone substitutions generate output that is perceived as having an American foreign accent, even though only non-accented training data was used.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"157 1","pages":"4799-4803"},"PeriodicalIF":0.0,"publicationDate":"2018-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82463412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Learned Forensic Source Similarity for Unknown Camera Models 未知相机模型的学习取证源相似度
O. Mayer, M. Stamm
Information about an image's source camera model is important knowledge in many forensic investigations. In this paper we propose a system that compares two image patches to determine if they were captured by the same camera model. To do this, we first train a CNN based feature extractor to output generic, high level features which encode information about the source camera model of an image patch. Then, we learn a similarity measure that maps pairs of these features to a score indicating whether the two image patches were captured by the same or different camera models. We show that our proposed system accurately determines if two patches were captured by the same or different camera models, even when the camera models are unknown to the investigator. We also demonstrate the utility of this approach for image splicing detection and localization.
关于图像源相机模型的信息在许多法医调查中是重要的知识。在本文中,我们提出了一个系统,比较两个图像补丁,以确定它们是否被同一相机模型捕获。为了做到这一点,我们首先训练一个基于CNN的特征提取器来输出通用的高级特征,这些特征编码了图像补丁的源相机模型的信息。然后,我们学习一种相似性度量,将这些特征对映射到一个分数,表明两个图像补丁是由相同还是不同的相机模型捕获的。我们表明,我们提出的系统可以准确地确定两个斑块是由相同或不同的相机模型捕获的,即使研究者不知道相机模型。我们还演示了该方法在图像拼接检测和定位中的应用。
{"title":"Learned Forensic Source Similarity for Unknown Camera Models","authors":"O. Mayer, M. Stamm","doi":"10.1109/ICASSP.2018.8462585","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462585","url":null,"abstract":"Information about an image's source camera model is important knowledge in many forensic investigations. In this paper we propose a system that compares two image patches to determine if they were captured by the same camera model. To do this, we first train a CNN based feature extractor to output generic, high level features which encode information about the source camera model of an image patch. Then, we learn a similarity measure that maps pairs of these features to a score indicating whether the two image patches were captured by the same or different camera models. We show that our proposed system accurately determines if two patches were captured by the same or different camera models, even when the camera models are unknown to the investigator. We also demonstrate the utility of this approach for image splicing detection and localization.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"35 1","pages":"2012-2016"},"PeriodicalIF":0.0,"publicationDate":"2018-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76238280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Effective Noise Removal and Unified Model of Hybrid Feature Space Optimization for Automated Cardiac Anomaly Detection Using Phonocardiogarm Signals 心音信号自动心脏异常检测的有效降噪和混合特征空间优化统一模型
A. Ukil, S. Bandyopadhyay, Chetanya Puri, Rituraj Singh, A. Pal
In this paper, we present completely automated cardiac anomaly detection for remote screening of cardio-vascular abnormality using Phonocardiogram (PCG) or heart sound signal. Even though PCG contains significant and vital cardiac health information and cardiac abnormality signature, the presence of substantial noise does not guarantee highly effective analysis of cardiac condition. Our proposed method intelligently identifies and eliminates noisy PCG signal and consequently detects pathological abnormality condition. We further present a unified model of hybrid feature selection method. Our feature selection model is diversity optimized and cost-sensitive over conditional likelihood of the training and validation examples that maximizes classification model performance. We employ multi-stage hybrid feature selection process involving first level filter method and second level wrapper method. We achieve 85% detection accuracy by using publicly available MIT-Physionet challenge 2016 datasets consisting of more than 3000 annotated PCG signals.
在本文中,我们提出了完全自动化的心脏异常检测,用于利用心音图(PCG)或心音信号远程筛查心血管异常。尽管PCG包含重要的心脏健康信息和心脏异常特征,但大量噪声的存在并不能保证对心脏状况进行高效分析。该方法能够智能地识别和消除PCG信号中的噪声,从而检测出病理异常情况。进一步提出了一种统一模型的混合特征选择方法。我们的特征选择模型对多样性进行了优化,并且对训练和验证示例的条件似然具有成本敏感性,从而最大化了分类模型的性能。采用多阶段混合特征选择方法,包括一级滤波法和二级包装法。通过使用公开可用的MIT-Physionet challenge 2016数据集(包含3000多个带标注的PCG信号),我们实现了85%的检测准确率。
{"title":"Effective Noise Removal and Unified Model of Hybrid Feature Space Optimization for Automated Cardiac Anomaly Detection Using Phonocardiogarm Signals","authors":"A. Ukil, S. Bandyopadhyay, Chetanya Puri, Rituraj Singh, A. Pal","doi":"10.1109/ICASSP.2018.8461765","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461765","url":null,"abstract":"In this paper, we present completely automated cardiac anomaly detection for remote screening of cardio-vascular abnormality using Phonocardiogram (PCG) or heart sound signal. Even though PCG contains significant and vital cardiac health information and cardiac abnormality signature, the presence of substantial noise does not guarantee highly effective analysis of cardiac condition. Our proposed method intelligently identifies and eliminates noisy PCG signal and consequently detects pathological abnormality condition. We further present a unified model of hybrid feature selection method. Our feature selection model is diversity optimized and cost-sensitive over conditional likelihood of the training and validation examples that maximizes classification model performance. We employ multi-stage hybrid feature selection process involving first level filter method and second level wrapper method. We achieve 85% detection accuracy by using publicly available MIT-Physionet challenge 2016 datasets consisting of more than 3000 annotated PCG signals.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"62 1","pages":"866-870"},"PeriodicalIF":0.0,"publicationDate":"2018-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74399479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1