首页 > 最新文献

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

英文 中文
Convolutional Dropout and Wordpiece Augmentation for End-to-End Speech Recognition 端到端语音识别的卷积Dropout和词块增强
Hainan Xu, Yinghui Huang, Yun Zhu, Kartik Audhkhasi, B. Ramabhadran
Regularization and data augmentation are crucial to training end-to-end automatic speech recognition systems. Dropout is a popular regularization technique, which operates on each neuron independently by multiplying it with a Bernoulli random variable. We propose a generalization of dropout, called "convolutional dropout", where each neuron’s activation is replaced with a randomly-weighted linear combination of neuron values in its neighborhood. We believe that this formulation combines the regularizing effect of dropout with the smoothing effects of the convolution operation. In addition to convolutional dropout, this paper also proposes using random word-piece segmentations as a data augmentation scheme during training, inspired by results in neural machine translation. We adopt both these methods during the training of transformer-transducer speech recognition models, and show consistent WER improvements on Librispeech as well as across different languages.
正则化和数据增强是训练端到端自动语音识别系统的关键。Dropout是一种流行的正则化技术,它通过将每个神经元与伯努利随机变量相乘来独立操作每个神经元。我们提出了一种dropout的泛化方法,称为“卷积dropout”,其中每个神经元的激活被替换为其邻近神经元值的随机加权线性组合。我们认为这个公式结合了dropout的正则化效果和卷积运算的平滑效果。除了卷积dropout之外,受神经机器翻译结果的启发,本文还提出了在训练过程中使用随机分词作为数据增强方案。我们在变压器-换能器语音识别模型的训练中采用了这两种方法,并在librisspeech和不同语言之间显示出一致的WER改进。
{"title":"Convolutional Dropout and Wordpiece Augmentation for End-to-End Speech Recognition","authors":"Hainan Xu, Yinghui Huang, Yun Zhu, Kartik Audhkhasi, B. Ramabhadran","doi":"10.1109/ICASSP39728.2021.9415004","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9415004","url":null,"abstract":"Regularization and data augmentation are crucial to training end-to-end automatic speech recognition systems. Dropout is a popular regularization technique, which operates on each neuron independently by multiplying it with a Bernoulli random variable. We propose a generalization of dropout, called \"convolutional dropout\", where each neuron’s activation is replaced with a randomly-weighted linear combination of neuron values in its neighborhood. We believe that this formulation combines the regularizing effect of dropout with the smoothing effects of the convolution operation. In addition to convolutional dropout, this paper also proposes using random word-piece segmentations as a data augmentation scheme during training, inspired by results in neural machine translation. We adopt both these methods during the training of transformer-transducer speech recognition models, and show consistent WER improvements on Librispeech as well as across different languages.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127276237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Sparsity in Max-Plus Algebra and Applications in Multivariate Convex Regression Max-Plus代数的稀疏性及其在多元凸回归中的应用
Nikos Tsilivis, Anastasios Tsiamis, P. Maragos
In this paper, we study concepts of sparsity in the max-plus algebra and apply them to the problem of multivariate convex regression. We show how to efficiently find sparse (containing many −∞ elements) approximate solutions to max-plus equations by leveraging notions from submodular optimization. Subsequently, we propose a novel method for piecewise-linear surface fitting of convex multivariate functions, with optimality guarantees for the model parameters and an approximately minimum number of affine regions.
本文研究了max-plus代数中的稀疏性概念,并将其应用于多元凸回归问题。我们展示了如何通过利用子模优化的概念有效地找到max-plus方程的稀疏(包含许多−∞元素)近似解。随后,我们提出了一种新的凸多元函数分段线性曲面拟合方法,该方法保证了模型参数的最优性和仿射区域的近似最小数量。
{"title":"Sparsity in Max-Plus Algebra and Applications in Multivariate Convex Regression","authors":"Nikos Tsilivis, Anastasios Tsiamis, P. Maragos","doi":"10.1109/ICASSP39728.2021.9414139","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414139","url":null,"abstract":"In this paper, we study concepts of sparsity in the max-plus algebra and apply them to the problem of multivariate convex regression. We show how to efficiently find sparse (containing many −∞ elements) approximate solutions to max-plus equations by leveraging notions from submodular optimization. Subsequently, we propose a novel method for piecewise-linear surface fitting of convex multivariate functions, with optimality guarantees for the model parameters and an approximately minimum number of affine regions.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127341557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving Identification of System-Directed Speech Utterances by Deep Learning of ASR-Based Word Embeddings and Confidence Metrics 通过深度学习基于 ASR 的单词嵌入和置信度指标,改进系统引导的语音片段识别
Vilayphone Vilaysouk, Amr H. Nour-Eldin, Dermot Connolly
In this paper, we extend our previous work on the detection of system-directed speech utterances. This type of binary classification can be used by virtual assistants to create a more natural and fluid interaction between the system and the user. We explore two methods that both improve the Equal-Error-Rate (EER) performance of the previous model. The first exploits the supplementary information independently captured by ASR models through integrating ASR decoder-based features as additional inputs to the final classification stage of the model. This relatively improves EER performance by 13%. The second proposed method further integrates word embeddings into the architecture and, when combined with the first method, achieves a significant EER performance improvement of 48%, relative to that of the baseline.
在本文中,我们扩展了之前在系统引导语音语篇检测方面的工作。虚拟助手可以利用这种二进制分类在系统和用户之间创建更自然流畅的交互。我们探索了两种方法,它们都提高了先前模型的等误率(EER)性能。第一种方法通过整合基于 ASR 解码器的特征,将其作为模型最终分类阶段的额外输入,从而利用 ASR 模型独立捕获的补充信息。这相对将 EER 性能提高了 13%。第二种方法将单词嵌入进一步整合到架构中,与第一种方法相结合后,EER 性能比基线方法显著提高了 48%。
{"title":"Improving Identification of System-Directed Speech Utterances by Deep Learning of ASR-Based Word Embeddings and Confidence Metrics","authors":"Vilayphone Vilaysouk, Amr H. Nour-Eldin, Dermot Connolly","doi":"10.1109/ICASSP39728.2021.9414330","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414330","url":null,"abstract":"In this paper, we extend our previous work on the detection of system-directed speech utterances. This type of binary classification can be used by virtual assistants to create a more natural and fluid interaction between the system and the user. We explore two methods that both improve the Equal-Error-Rate (EER) performance of the previous model. The first exploits the supplementary information independently captured by ASR models through integrating ASR decoder-based features as additional inputs to the final classification stage of the model. This relatively improves EER performance by 13%. The second proposed method further integrates word embeddings into the architecture and, when combined with the first method, achieves a significant EER performance improvement of 48%, relative to that of the baseline.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127471574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image Coding with Neural Network-Based Colorization 基于神经网络着色的图像编码
Diogo Lopes, J. Ascenso, Catarina Brites, Fernando Pereira
Automatic colorization is a process with the objective of inferring the color of grayscale images. This process is frequently used for artistic purposes and to restore the color in old or damaged images. Motivated by the excellent results obtained with deep learning-based solutions in the area of automatic colorization, this paper proposes an image coding solution integrating a deep learning-based colorization process to estimate the chrominance components based on the decoded luminance which is regularly encoded with a conventional image coding standard. In this case, the chrominance components are not coded and transmitted as usual, notably after some subsampling, as only some color hints, i.e. chrominance values for specific pixel locations, may be sent to the decoder to help it creating more accurate colorizations. To boost the colorization and final compression performance, intelligent ways to select the color hints are proposed. Experimental results show performance improvements with the increased level of intelligence in the color hints extraction process and a good subjective quality of the final decoded (and colorized) images.
自动着色是一种以推断灰度图像颜色为目的的过程。这个过程经常用于艺术目的和恢复旧的或损坏的图像的颜色。鉴于基于深度学习的解决方案在自动着色领域取得的优异效果,本文提出了一种集成基于深度学习的着色过程的图像编码解决方案,该方案基于解码后的亮度估计颜色分量,并使用常规图像编码标准对亮度进行定期编码。在这种情况下,色度分量不会像往常一样编码和传输,特别是在一些子采样之后,因为只有一些颜色提示,即特定像素位置的色度值,可能会被发送到解码器,以帮助它创建更准确的着色。为了提高图像的着色性能和最终压缩性能,提出了一种选择颜色提示的智能方法。实验结果表明,在颜色提示提取过程中,随着智能水平的提高,性能有所提高,最终解码(和着色)图像的主观质量也很好。
{"title":"Image Coding with Neural Network-Based Colorization","authors":"Diogo Lopes, J. Ascenso, Catarina Brites, Fernando Pereira","doi":"10.1109/ICASSP39728.2021.9413816","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413816","url":null,"abstract":"Automatic colorization is a process with the objective of inferring the color of grayscale images. This process is frequently used for artistic purposes and to restore the color in old or damaged images. Motivated by the excellent results obtained with deep learning-based solutions in the area of automatic colorization, this paper proposes an image coding solution integrating a deep learning-based colorization process to estimate the chrominance components based on the decoded luminance which is regularly encoded with a conventional image coding standard. In this case, the chrominance components are not coded and transmitted as usual, notably after some subsampling, as only some color hints, i.e. chrominance values for specific pixel locations, may be sent to the decoder to help it creating more accurate colorizations. To boost the colorization and final compression performance, intelligent ways to select the color hints are proposed. Experimental results show performance improvements with the increased level of intelligence in the color hints extraction process and a good subjective quality of the final decoded (and colorized) images.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124859985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ATVIO: Attention Guided Visual-Inertial Odometry 注意引导视觉惯性里程计
Li Liu, Ge Li, Thomas H. Li
Visual-inertial odometry (VIO) aims to predict trajectory by ego- motion estimation. In recent years, end-to-end VIO has made great progress. However, how to handle visual and inertial measurements and make full use of the complementarity of cameras and inertial sensors remains a challenge. In the paper, we propose a novel attention guided deep framework for visual-inertial odometry (ATVIO) to improve the performance of VIO. Specifically, we extraordinarily concentrate on the effective utilization of the Inertial Measurement Unit (IMU) information. Therefore, we carefully design a one-dimension inertial feature encoder for IMU data processing. The network can extract inertial features quickly and effectively. Meanwhile, we should prevent the inconsistency problem when fusing inertial and visual features. Hence, we explore a novel cross-domain channel attention block to combine the extracted features in a more adaptive manner. Extensive experiments demonstrate that our method achieves competitive performance against state-of-the-art VIO methods.
视觉惯性里程计(VIO)旨在通过自我运动估计来预测轨迹。近年来,端到端VIO取得了很大的进展。然而,如何处理视觉和惯性测量,并充分利用相机和惯性传感器的互补性仍然是一个挑战。为了提高视觉惯性里程计(ATVIO)的性能,提出了一种新的注意力引导深度框架。具体来说,我们非常专注于惯性测量单元(IMU)信息的有效利用。因此,我们精心设计了一种用于IMU数据处理的一维惯性特征编码器。该网络能够快速有效地提取惯性特征。同时,要防止惯性特征和视觉特征融合时出现不一致的问题。因此,我们探索了一种新的跨域通道注意力块,以一种更自适应的方式组合提取的特征。大量的实验表明,我们的方法与最先进的VIO方法相比具有竞争力。
{"title":"ATVIO: Attention Guided Visual-Inertial Odometry","authors":"Li Liu, Ge Li, Thomas H. Li","doi":"10.1109/ICASSP39728.2021.9413912","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413912","url":null,"abstract":"Visual-inertial odometry (VIO) aims to predict trajectory by ego- motion estimation. In recent years, end-to-end VIO has made great progress. However, how to handle visual and inertial measurements and make full use of the complementarity of cameras and inertial sensors remains a challenge. In the paper, we propose a novel attention guided deep framework for visual-inertial odometry (ATVIO) to improve the performance of VIO. Specifically, we extraordinarily concentrate on the effective utilization of the Inertial Measurement Unit (IMU) information. Therefore, we carefully design a one-dimension inertial feature encoder for IMU data processing. The network can extract inertial features quickly and effectively. Meanwhile, we should prevent the inconsistency problem when fusing inertial and visual features. Hence, we explore a novel cross-domain channel attention block to combine the extracted features in a more adaptive manner. Extensive experiments demonstrate that our method achieves competitive performance against state-of-the-art VIO methods.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124956268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Segmental Dtw: A Parallelizable Alternative to Dynamic Time Warping 分段Dtw:动态时间翘曲的可并行选择
T. Tsai
In this work we explore parallelizable alternatives to DTW for globally aligning two feature sequences. One of the main practical limitations of DTW is its quadratic computation and memory cost. Previous works have sought to reduce the computational cost in various ways, such as imposing bands in the cost matrix or using a multiresolution approach. In this work, we utilize the fact that computation is an abundant resource and focus instead on exploring alternatives that approximate the inherently sequential DTW algorithm with one that is parallelizable. We describe two variations of an algorithm called Segmental DTW, in which the global cost matrix is broken into smaller sub-matrices, subsequence DTW is performed on each sub-matrix, and the results are used to solve a segment-level dynamic programming problem that specifies a globally optimal alignment path. We evaluate the proposed alignment algorithms on an audio-audio alignment task using the Chopin Mazurka dataset, and we show that they closely match the performance of regular DTW. We further demonstrate that almost all of the computations in Segmental DTW are parallelizable, and that one of the variants is unilaterally better than the other for both empirical and theoretical reasons.
在这项工作中,我们探索了DTW的并行替代方案,用于全局对齐两个特征序列。DTW的一个主要的实际限制是它的二次计算和内存开销。以前的工作试图以各种方式降低计算成本,例如在成本矩阵中施加频带或使用多分辨率方法。在这项工作中,我们利用计算是一个丰富的资源这一事实,并将重点放在探索替代方案上,这些替代方案近似于固有的顺序DTW算法,并具有可并行性。我们描述了一种称为分段DTW的算法的两种变体,其中将全局代价矩阵分解为更小的子矩阵,在每个子矩阵上执行子DTW,并将结果用于解决指定全局最优对齐路径的段级动态规划问题。我们使用肖邦马祖卡数据集在音频-音频对齐任务上评估了所提出的对齐算法,并表明它们与常规DTW的性能非常接近。我们进一步证明,几乎所有分段DTW的计算都是可并行的,并且由于经验和理论原因,其中一种变体单方面优于另一种。
{"title":"Segmental Dtw: A Parallelizable Alternative to Dynamic Time Warping","authors":"T. Tsai","doi":"10.1109/ICASSP39728.2021.9413827","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413827","url":null,"abstract":"In this work we explore parallelizable alternatives to DTW for globally aligning two feature sequences. One of the main practical limitations of DTW is its quadratic computation and memory cost. Previous works have sought to reduce the computational cost in various ways, such as imposing bands in the cost matrix or using a multiresolution approach. In this work, we utilize the fact that computation is an abundant resource and focus instead on exploring alternatives that approximate the inherently sequential DTW algorithm with one that is parallelizable. We describe two variations of an algorithm called Segmental DTW, in which the global cost matrix is broken into smaller sub-matrices, subsequence DTW is performed on each sub-matrix, and the results are used to solve a segment-level dynamic programming problem that specifies a globally optimal alignment path. We evaluate the proposed alignment algorithms on an audio-audio alignment task using the Chopin Mazurka dataset, and we show that they closely match the performance of regular DTW. We further demonstrate that almost all of the computations in Segmental DTW are parallelizable, and that one of the variants is unilaterally better than the other for both empirical and theoretical reasons.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124976832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On the Detection of Pitch-Shifted Voice: Machines and Human Listeners 音高移位语音的检测:机器与人类听者
D. Looney, N. Gaubitch
We present a performance comparison between human listeners and a simple algorithm for the task of speech anomaly detection. The algorithm utilises an intentionally small set of features derived from the source-filter model, with the aim of validating that key components of source-filter theory characterise how humans perceive anomalies. We furthermore recognise that humans are adept at detecting anomalies without prior exposure to a given anomaly class. To that end, we also consider the algorithm performance when operating via the principle of unsupervised learning where a null model is derived from normal speech recordings. We evaluate both the algorithm and human listeners for pitch-shift detection where the pitch of a speech sample is intentionally modified using software, a phenomenon of relevance to the fields of fraud detection and forensics. Our results show that humans can only detect pitch-shift reliably at more extreme levels, and that the performance of the algorithm matches closely with that of humans.
我们提出了人类听众和语音异常检测任务的简单算法之间的性能比较。该算法利用了从源滤波器模型中提取的少量特征,目的是验证源滤波器理论的关键组成部分描述了人类如何感知异常。我们进一步认识到,人类善于在没有事先暴露于给定异常类别的情况下检测异常。为此,我们还考虑了通过无监督学习原理操作时的算法性能,其中零模型来自正常语音记录。我们评估了算法和人类听众的音高偏移检测,其中语音样本的音高是使用软件故意修改的,这是一种与欺诈检测和取证领域相关的现象。我们的研究结果表明,人类只能在更极端的水平上可靠地检测到音调变化,并且算法的性能与人类的性能非常接近。
{"title":"On the Detection of Pitch-Shifted Voice: Machines and Human Listeners","authors":"D. Looney, N. Gaubitch","doi":"10.1109/ICASSP39728.2021.9414890","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414890","url":null,"abstract":"We present a performance comparison between human listeners and a simple algorithm for the task of speech anomaly detection. The algorithm utilises an intentionally small set of features derived from the source-filter model, with the aim of validating that key components of source-filter theory characterise how humans perceive anomalies. We furthermore recognise that humans are adept at detecting anomalies without prior exposure to a given anomaly class. To that end, we also consider the algorithm performance when operating via the principle of unsupervised learning where a null model is derived from normal speech recordings. We evaluate both the algorithm and human listeners for pitch-shift detection where the pitch of a speech sample is intentionally modified using software, a phenomenon of relevance to the fields of fraud detection and forensics. Our results show that humans can only detect pitch-shift reliably at more extreme levels, and that the performance of the algorithm matches closely with that of humans.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125024621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Task Estimation of Age and Cognitive Decline from Speech 多任务估计年龄和言语认知衰退
Yilin Pan, Venkata Srikanth Nallanthighal, D. Blackburn, H. Christensen, Aki Härmä
Speech is a common physiological signal that can be affected by both ageing and cognitive decline. Often the effect can be confounding, as would be the case for people at, e.g., very early stages of cognitive decline due to dementia. Despite this, the automatic predictions of age and cognitive decline based on cues found in the speech signal are generally treated as two separate tasks. In this paper, multi-task learning is applied for the joint estimation of age and the Mini-Mental Status Evaluation criteria (MMSE) commonly used to assess cognitive decline. To explore the relationship between age and MMSE, two neural network architectures are evaluated: a SincNet-based end-to-end architecture, and a system comprising of a feature extractor followed by a shallow neural network. Both are trained with single-task or multi-task targets. To compare, an SVM-based regressor is trained in a single-task setup. i-vector, x-vector and ComParE features are explored. Results are obtained on systems trained on the DementiaBank dataset and tested on an in-house dataset as well as the ADReSS dataset. The results show that both the age and MMSE estimation is improved by applying multitask learning, with state-of-the-art results achieved on the ADReSS dataset acoustic-only task.
语言是一种常见的生理信号,会受到衰老和认知能力下降的影响。通常情况下,这种影响是令人困惑的,例如,对于那些由于痴呆症而认知能力下降的非常早期阶段的人来说。尽管如此,基于语音信号中发现的线索对年龄和认知能力下降的自动预测通常被视为两个独立的任务。本文将多任务学习应用于年龄的联合估计和常用的认知衰退评估标准——最小心智状态评估标准(MMSE)。为了探索年龄和MMSE之间的关系,我们评估了两种神经网络架构:基于sincnet的端到端架构,以及由特征提取器和浅神经网络组成的系统。两者都接受过单任务或多任务目标的训练。为了进行比较,在单任务设置中训练基于svm的回归器。探索了i向量、x向量和ComParE特征。在DementiaBank数据集上训练的系统上获得结果,并在内部数据集和address数据集上进行测试。结果表明,通过应用多任务学习,年龄和MMSE估计都得到了改善,在address数据集声学任务上取得了最先进的结果。
{"title":"Multi-Task Estimation of Age and Cognitive Decline from Speech","authors":"Yilin Pan, Venkata Srikanth Nallanthighal, D. Blackburn, H. Christensen, Aki Härmä","doi":"10.1109/ICASSP39728.2021.9414642","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414642","url":null,"abstract":"Speech is a common physiological signal that can be affected by both ageing and cognitive decline. Often the effect can be confounding, as would be the case for people at, e.g., very early stages of cognitive decline due to dementia. Despite this, the automatic predictions of age and cognitive decline based on cues found in the speech signal are generally treated as two separate tasks. In this paper, multi-task learning is applied for the joint estimation of age and the Mini-Mental Status Evaluation criteria (MMSE) commonly used to assess cognitive decline. To explore the relationship between age and MMSE, two neural network architectures are evaluated: a SincNet-based end-to-end architecture, and a system comprising of a feature extractor followed by a shallow neural network. Both are trained with single-task or multi-task targets. To compare, an SVM-based regressor is trained in a single-task setup. i-vector, x-vector and ComParE features are explored. Results are obtained on systems trained on the DementiaBank dataset and tested on an in-house dataset as well as the ADReSS dataset. The results show that both the age and MMSE estimation is improved by applying multitask learning, with state-of-the-art results achieved on the ADReSS dataset acoustic-only task.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125175690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
GDTW: A Novel Differentiable DTW Loss for Time Series Tasks 时间序列任务中一种新的可微DTW损失
Xiang Liu, Naiqi Li, Shutao Xia
Dynamic time warping (DTW) is one of the most successful methods that addresses the challenge of measuring the discrepancy between two series, which is robust to shift and distortion along the time axis of the sequence. Based on DTW, we propose a novel loss function for time series data called Gumbel-Softmin based fast DTW (GDTW). To the best of our knowledge, this is the first differentiable DTW loss for series data that scales linearly with the sequence length. The proposed Gumbel-Softmin replaces the simple minimization operator in DTW so as to better integrate the acceleration technology. We also design a deep learning model combining GDTW as a feature extractor. Thorough experiments over a broad range of time series analysis tasks were performed, showing the efficiency and effectiveness of our method.
动态时间规整(DTW)是解决两个序列之间差异测量挑战的最成功的方法之一,它对序列沿时间轴的偏移和失真具有鲁棒性。在此基础上,提出了一种新的时间序列数据损失函数Gumbel-Softmin Based fast DTW (GDTW)。据我们所知,这是与序列长度线性扩展的序列数据的第一个可微DTW损失。本文提出的Gumbel-Softmin取代了DTW中简单的最小化算子,从而更好地集成了加速技术。我们还设计了一个结合GDTW作为特征提取器的深度学习模型。在广泛的时间序列分析任务中进行了彻底的实验,显示了我们的方法的效率和有效性。
{"title":"GDTW: A Novel Differentiable DTW Loss for Time Series Tasks","authors":"Xiang Liu, Naiqi Li, Shutao Xia","doi":"10.1109/ICASSP39728.2021.9413895","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413895","url":null,"abstract":"Dynamic time warping (DTW) is one of the most successful methods that addresses the challenge of measuring the discrepancy between two series, which is robust to shift and distortion along the time axis of the sequence. Based on DTW, we propose a novel loss function for time series data called Gumbel-Softmin based fast DTW (GDTW). To the best of our knowledge, this is the first differentiable DTW loss for series data that scales linearly with the sequence length. The proposed Gumbel-Softmin replaces the simple minimization operator in DTW so as to better integrate the acceleration technology. We also design a deep learning model combining GDTW as a feature extractor. Thorough experiments over a broad range of time series analysis tasks were performed, showing the efficiency and effectiveness of our method.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125831739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
CMIM: Cross-Modal Information Maximization For Medical Imaging 医学影像的跨模态信息最大化
Tristan Sylvain, Francis Dutil, T. Berthier, Lisa Di-Jorio, M. Luck, R. Devon Hjelm, Y. Bengio
In hospitals, data are siloed to specific information systems that make the same information available under different modalities such as the different medical imaging exams the patient undergoes (CT scans, MRI, PET, Ultrasound, etc.) and their associated radiology reports. This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.In this paper, we propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time, using recent advances in mutual information maximization. By maximizing cross-modal information at train time, we are able to outperform several state-of-the-art baselines in two different settings, medical image classification, and segmentation. In particular, our method is shown to have a strong impact on the inference-time performance of weaker modalities.
在医院,数据被隔离在特定的信息系统中,这些信息系统在不同的模式下提供相同的信息,例如患者接受的不同医学成像检查(CT扫描、MRI、PET、超声波等)及其相关的放射学报告。这为在训练时获取和使用相同信息的多个视图提供了独特的机会,而这些信息在测试时可能并不总是可用的。在本文中,我们提出了一个创新的框架,该框架通过学习多模态输入的良好表示,利用互信息最大化的最新进展,在测试时对模态下降具有弹性,从而充分利用可用数据。通过最大化列车时间的跨模式信息,我们能够在两种不同的设置(医学图像分类和分割)中优于几种最先进的基线。特别是,我们的方法被证明对较弱模态的推理时间性能有很强的影响。
{"title":"CMIM: Cross-Modal Information Maximization For Medical Imaging","authors":"Tristan Sylvain, Francis Dutil, T. Berthier, Lisa Di-Jorio, M. Luck, R. Devon Hjelm, Y. Bengio","doi":"10.1109/ICASSP39728.2021.9414132","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414132","url":null,"abstract":"In hospitals, data are siloed to specific information systems that make the same information available under different modalities such as the different medical imaging exams the patient undergoes (CT scans, MRI, PET, Ultrasound, etc.) and their associated radiology reports. This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.In this paper, we propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time, using recent advances in mutual information maximization. By maximizing cross-modal information at train time, we are able to outperform several state-of-the-art baselines in two different settings, medical image classification, and segmentation. In particular, our method is shown to have a strong impact on the inference-time performance of weaker modalities.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123273050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1