首页 > 最新文献

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

英文 中文
A Novel Convolutional Neural Network Based on Adaptive Multi-Scale Aggregation and Boundary-Aware for Lateral Ventricle Segmentation on MR images 基于自适应多尺度聚集和边界感知的卷积神经网络在MR侧脑室图像分割中的应用
Fei Ye, Zhiqiang Wang, Sheng Zhu, Xuanya Li, Kai Hu
In this paper, we propose a novel convolutional neural network based on adaptive multi-scale feature aggregation and boundary-aware for lateral ventricle segmentation (MB-Net), which mainly includes three parts, i.e., an adaptive multi-scale feature aggregation module (AMSFM), an embedded boundary refinement module (EBRM), and a local feature extraction module (LFM). Specifically, the AMSFM is used to extract multi-scale features through the different receptive fields to effectively solve the problem of distinct target regions on magnetic resonance (MR) images. The EBRM is intended to extract boundary information to effectively solve blurred boundary problems. The LFM can make the extraction of local information based on spatial and channel attention mechanisms to solve the problem of irregular shapes. Finally, extensive experiments are conducted from different perspectives to evaluate the performance of the proposed MB-Net. Furthermore, we also verify the robustness of the model on other public datasets, i.e., COVID-SemiSeg and CHASE DB1. The results show that our MB-Net can achieve competitive results when compared with state-of-the-art methods.
本文提出了一种基于自适应多尺度特征聚合和边界感知的侧脑室分割卷积神经网络(MB-Net),该网络主要包括自适应多尺度特征聚合模块(AMSFM)、嵌入式边界细化模块(EBRM)和局部特征提取模块(LFM)三个部分。具体来说,利用AMSFM通过不同的感受野提取多尺度特征,有效地解决了磁共振图像上不同目标区域的问题。EBRM旨在提取边界信息,有效解决边界模糊问题。LFM可以基于空间注意机制和通道注意机制提取局部信息,以解决不规则形状的问题。最后,从不同的角度进行了大量的实验来评估所提出的MB-Net的性能。此外,我们还验证了模型在其他公共数据集(即covid - semieg和CHASE DB1)上的鲁棒性。结果表明,与目前最先进的方法相比,我们的MB-Net可以取得具有竞争力的结果。
{"title":"A Novel Convolutional Neural Network Based on Adaptive Multi-Scale Aggregation and Boundary-Aware for Lateral Ventricle Segmentation on MR images","authors":"Fei Ye, Zhiqiang Wang, Sheng Zhu, Xuanya Li, Kai Hu","doi":"10.1109/icassp43922.2022.9747266","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747266","url":null,"abstract":"In this paper, we propose a novel convolutional neural network based on adaptive multi-scale feature aggregation and boundary-aware for lateral ventricle segmentation (MB-Net), which mainly includes three parts, i.e., an adaptive multi-scale feature aggregation module (AMSFM), an embedded boundary refinement module (EBRM), and a local feature extraction module (LFM). Specifically, the AMSFM is used to extract multi-scale features through the different receptive fields to effectively solve the problem of distinct target regions on magnetic resonance (MR) images. The EBRM is intended to extract boundary information to effectively solve blurred boundary problems. The LFM can make the extraction of local information based on spatial and channel attention mechanisms to solve the problem of irregular shapes. Finally, extensive experiments are conducted from different perspectives to evaluate the performance of the proposed MB-Net. Furthermore, we also verify the robustness of the model on other public datasets, i.e., COVID-SemiSeg and CHASE DB1. The results show that our MB-Net can achieve competitive results when compared with state-of-the-art methods.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"559 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114791894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling The Detection Capability Of High-Speed Spiking Cameras 高速脉冲摄像机检测能力建模
Junwei Zhao, Zhaofei Yu, Lei Ma, Ziluo Ding, Shiliang Zhang, Yonghong Tian, Tiejun Huang
The novel working principle enables spiking cameras to capture high-speed moving objects. However, the applications of spiking cameras can be affected by many factors, such as brightness intensity, detectable distance, and the maximum speed of moving targets. Improper settings such as weak ambient brightness and too short object-camera distance, will lead to failure in the application of such cameras. To address the issue, this paper proposes a modeling algorithm that studies the detection capability of spiking cameras. The algorithm deduces the maximum detectable speed of spiking cameras corresponding to different scenario settings (e.g., brightness intensity, camera lens, and object-camera distance) based on the basic technical parameters of cameras (e.g., pixel size, spatial and temporal resolution). Thereby, the proper camera settings for various applications can be determined. Extensive experiments verify the effectiveness of the modeling algorithm. To our best knowledge, it is the first work to investigate the detection capability of spiking cameras.
这种新颖的工作原理使尖峰相机能够捕捉高速移动的物体。然而,脉冲摄像机的应用会受到许多因素的影响,如亮度强度、可探测距离和移动目标的最大速度。环境亮度过弱、物体与相机距离过短等设置不当会导致此类相机的应用失败。为了解决这一问题,本文提出了一种研究尖峰摄像机检测能力的建模算法。该算法根据摄像头的基本技术参数(如像素大小、时空分辨率),推导出不同场景设置(如亮度强度、摄像头镜头、物机距离等)对应的峰值摄像头最大可检测速度。因此,可以确定用于各种应用的适当的相机设置。大量的实验验证了该建模算法的有效性。据我们所知,这是第一个研究脉冲摄像机探测能力的工作。
{"title":"Modeling The Detection Capability Of High-Speed Spiking Cameras","authors":"Junwei Zhao, Zhaofei Yu, Lei Ma, Ziluo Ding, Shiliang Zhang, Yonghong Tian, Tiejun Huang","doi":"10.1109/ICASSP43922.2022.9747018","DOIUrl":"https://doi.org/10.1109/ICASSP43922.2022.9747018","url":null,"abstract":"The novel working principle enables spiking cameras to capture high-speed moving objects. However, the applications of spiking cameras can be affected by many factors, such as brightness intensity, detectable distance, and the maximum speed of moving targets. Improper settings such as weak ambient brightness and too short object-camera distance, will lead to failure in the application of such cameras. To address the issue, this paper proposes a modeling algorithm that studies the detection capability of spiking cameras. The algorithm deduces the maximum detectable speed of spiking cameras corresponding to different scenario settings (e.g., brightness intensity, camera lens, and object-camera distance) based on the basic technical parameters of cameras (e.g., pixel size, spatial and temporal resolution). Thereby, the proper camera settings for various applications can be determined. Extensive experiments verify the effectiveness of the modeling algorithm. To our best knowledge, it is the first work to investigate the detection capability of spiking cameras.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124337961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models 基于互补神经语言模型大集合的点阵评分
A. Ogawa, Naohiro Tawara, Marc Delcroix, S. Araki
We investigate the effectiveness of using a large ensemble of advanced neural language models (NLMs) for lattice rescoring on automatic speech recognition (ASR) hypotheses. Previous studies have reported the effectiveness of combining a small number of NLMs. In contrast, in this study, we combine up to eight NLMs, i.e., forward/backward long short-term memory/Transformer-LMs that are trained with two different random initialization seeds. We combine these NLMs through iterative lattice generation. Since these NLMs work complementarily with each other, by combining them one by one at each rescoring iteration, language scores attached to given lattice arcs can be gradually refined. Consequently, errors of the ASR hypotheses can be gradually reduced. We also investigate the effectiveness of carrying over contextual information (previous rescoring results) across a lattice sequence of a long speech such as a lecture speech. In experiments using a lecture speech corpus, by combining the eight NLMs and using context carry-over, we obtained a 24.4% relative word error rate reduction from the ASR 1-best baseline. For further comparison, we performed simultaneous (i.e., non-iterative) NLM combination and 100-best rescoring using the large ensemble of NLMs, which confirmed the advantage of lattice rescoring with iterative NLM combination.
我们研究了在自动语音识别(ASR)假设上使用大量高级神经语言模型(nlm)进行点阵评分的有效性。先前的研究已经报道了联合使用少量nlm的有效性。相比之下,在本研究中,我们结合了多达8个nlm,即前向/后向长短期记忆/转换lms,它们由两种不同的随机初始化种子训练。我们通过迭代晶格生成来组合这些nlm。由于这些nlm相互补充,通过在每次评分迭代中将它们一个接一个地组合在一起,可以逐渐改进给定点阵弧的语言分数。因此,ASR假设的误差可以逐渐减小。我们还研究了在长演讲(如演讲)的晶格序列中传递上下文信息(先前的评分结果)的有效性。在使用演讲语料库的实验中,通过结合8个nlm并使用上下文结转,我们获得了相对于ASR 1最佳基线的24.4%的相对单词错误率降低。为了进一步比较,我们使用NLM的大集合进行了同时(即非迭代)NLM组合和100-best评分,这证实了迭代NLM组合的点阵评分的优势。
{"title":"Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models","authors":"A. Ogawa, Naohiro Tawara, Marc Delcroix, S. Araki","doi":"10.1109/ICASSP43922.2022.9747745","DOIUrl":"https://doi.org/10.1109/ICASSP43922.2022.9747745","url":null,"abstract":"We investigate the effectiveness of using a large ensemble of advanced neural language models (NLMs) for lattice rescoring on automatic speech recognition (ASR) hypotheses. Previous studies have reported the effectiveness of combining a small number of NLMs. In contrast, in this study, we combine up to eight NLMs, i.e., forward/backward long short-term memory/Transformer-LMs that are trained with two different random initialization seeds. We combine these NLMs through iterative lattice generation. Since these NLMs work complementarily with each other, by combining them one by one at each rescoring iteration, language scores attached to given lattice arcs can be gradually refined. Consequently, errors of the ASR hypotheses can be gradually reduced. We also investigate the effectiveness of carrying over contextual information (previous rescoring results) across a lattice sequence of a long speech such as a lecture speech. In experiments using a lecture speech corpus, by combining the eight NLMs and using context carry-over, we obtained a 24.4% relative word error rate reduction from the ASR 1-best baseline. For further comparison, we performed simultaneous (i.e., non-iterative) NLM combination and 100-best rescoring using the large ensemble of NLMs, which confirmed the advantage of lattice rescoring with iterative NLM combination.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127627202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
HiFiDenoise: High-Fidelity Denoising Text to Speech with Adversarial Networks hifidenise:使用对抗网络对文本到语音进行高保真降噪
Lichao Zhang, Yi Ren, Liqun Deng, Zhou Zhao
Building a high-fidelity speech synthesis system with noisy speech data is a challenging but valuable task, which could significantly reduce the cost of data collection. Existing methods usually train speech synthesis systems based on the speech denoised with an enhancement model or feed noise information as a condition into the system. These methods certainly have some effect on inhibiting noise, but the quality and the prosody of their synthesized speech are still far away from natural speech. In this paper, we propose HiFiDenoise, a speech synthesis system with adversarial networks that can synthesize high-fidelity speech with low-quality and noisy speech data. Specifically, 1) to tackle the difficulty of noise modeling, we introduce multi-length adversarial training in the noise condition module. 2) To handle the problem of inaccurate pitch extraction caused by noise, we remove the pitch predictor in the acoustic model and also add discriminators on the mel-spectrogram generator. 3) In addition, we also apply HiFiDenoise to singing voice synthesis with a noisy singing dataset. Experiments show that our model outperforms the baseline by 0.36 and 0.44 in terms of MOS on speech and singing respectively.
利用噪声语音数据构建高保真语音合成系统是一项具有挑战性但有价值的任务,它可以显著降低数据采集成本。现有的方法通常是基于增强模型降噪后的语音训练语音合成系统,或者将噪声信息作为条件输入到系统中。这些方法在抑制噪声方面确实有一定的效果,但其合成语音的质量和韵律与自然语音还有很大的差距。在本文中,我们提出了HiFiDenoise,这是一个具有对抗网络的语音合成系统,可以用低质量和嘈杂的语音数据合成高保真语音。具体来说,1)为了解决噪声建模的困难,我们在噪声条件模块中引入了多长度对抗训练。2)为了解决噪声导致的基音提取不准确的问题,我们在声学模型中去掉了基音预测器,并在梅尔谱图生成器上增加了鉴别器。3)此外,我们还将HiFiDenoise应用于有噪声歌唱数据集的歌唱语音合成。实验表明,我们的模型在语音和唱歌方面的MOS分别比基线高0.36和0.44。
{"title":"HiFiDenoise: High-Fidelity Denoising Text to Speech with Adversarial Networks","authors":"Lichao Zhang, Yi Ren, Liqun Deng, Zhou Zhao","doi":"10.1109/icassp43922.2022.9747155","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747155","url":null,"abstract":"Building a high-fidelity speech synthesis system with noisy speech data is a challenging but valuable task, which could significantly reduce the cost of data collection. Existing methods usually train speech synthesis systems based on the speech denoised with an enhancement model or feed noise information as a condition into the system. These methods certainly have some effect on inhibiting noise, but the quality and the prosody of their synthesized speech are still far away from natural speech. In this paper, we propose HiFiDenoise, a speech synthesis system with adversarial networks that can synthesize high-fidelity speech with low-quality and noisy speech data. Specifically, 1) to tackle the difficulty of noise modeling, we introduce multi-length adversarial training in the noise condition module. 2) To handle the problem of inaccurate pitch extraction caused by noise, we remove the pitch predictor in the acoustic model and also add discriminators on the mel-spectrogram generator. 3) In addition, we also apply HiFiDenoise to singing voice synthesis with a noisy singing dataset. Experiments show that our model outperforms the baseline by 0.36 and 0.44 in terms of MOS on speech and singing respectively.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127761644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
PMP-NET: Rethinking Visual Context for Scene Graph Generation PMP-NET:重新思考场景图形生成的视觉环境
Xuezhi Tong, Rui Wang, Chuan Wang, Sanyi Zhang, Xiaochun Cao
Scene graph generation aims to describe the contents in scenes by identifying the objects and their relationships. In previous works, visual context is widely utilized in message passing networks to generate the representations for classification. However, the noisy estimation of visual context limits model performance. In this paper, we revisit the concept of incorporating visual context via a randomly ordered bidirectional Long Short Temporal Memory (biLSTM) based baseline, and show that noisy estimation is worse than random. To alleviate the problem, we propose a new method, dubbed Progressive Message Passing Network (PMP-Net) that better estimates the visual context in a coarse to fine manner. Specifically, we first estimate the visual context with a random initiated scene graph, then refine it with multi-head attention. The experimental results on the benchmark dataset Visual Genome show that PMP-Net achieves better or comparable performance on all three tasks: scene graph generation (SGGen), scene graph classification (SGCls), and predicate classification (PredCls).
场景图生成旨在通过识别对象及其关系来描述场景中的内容。在以往的工作中,视觉上下文被广泛地应用于消息传递网络中,以生成用于分类的表示。然而,视觉环境的噪声估计限制了模型的性能。在本文中,我们重新审视了通过基于随机有序的双向长短时记忆(biLSTM)基线纳入视觉上下文的概念,并表明噪声估计比随机估计更差。为了解决这个问题,我们提出了一种新的方法,称为渐进式消息传递网络(PMP-Net),它可以更好地以粗到细的方式估计视觉上下文。具体来说,我们首先使用随机启动的场景图估计视觉上下文,然后使用多头注意对其进行改进。在基准数据集Visual Genome上的实验结果表明,PMP-Net在场景图生成(SGGen)、场景图分类(SGCls)和谓词分类(PredCls)这三个任务上都取得了更好或相当的性能。
{"title":"PMP-NET: Rethinking Visual Context for Scene Graph Generation","authors":"Xuezhi Tong, Rui Wang, Chuan Wang, Sanyi Zhang, Xiaochun Cao","doi":"10.1109/icassp43922.2022.9747415","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747415","url":null,"abstract":"Scene graph generation aims to describe the contents in scenes by identifying the objects and their relationships. In previous works, visual context is widely utilized in message passing networks to generate the representations for classification. However, the noisy estimation of visual context limits model performance. In this paper, we revisit the concept of incorporating visual context via a randomly ordered bidirectional Long Short Temporal Memory (biLSTM) based baseline, and show that noisy estimation is worse than random. To alleviate the problem, we propose a new method, dubbed Progressive Message Passing Network (PMP-Net) that better estimates the visual context in a coarse to fine manner. Specifically, we first estimate the visual context with a random initiated scene graph, then refine it with multi-head attention. The experimental results on the benchmark dataset Visual Genome show that PMP-Net achieves better or comparable performance on all three tasks: scene graph generation (SGGen), scene graph classification (SGCls), and predicate classification (PredCls).","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126297600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Cramer-Rao Bound for the Time-Varying Poisson 时变泊松的Cramer-Rao界
Xinhui Rong, V. Solo
Point processes are finding increasing applications in neuroscience, genomics, and social media. But basic modelling properties are little studied. Here we consider a periodic time-varying Poisson model and develop the asymptotic Cramer-Rao bound. We also develop, for the first time, a maximum likelihood algorithm for parameter estimation.
点处理在神经科学、基因组学和社交媒体上的应用越来越多。但基本的建模特性研究很少。本文考虑一个周期时变泊松模型,并给出渐近的Cramer-Rao界。我们还首次提出了一种参数估计的极大似然算法。
{"title":"Cramer-Rao Bound for the Time-Varying Poisson","authors":"Xinhui Rong, V. Solo","doi":"10.1109/ICASSP43922.2022.9746658","DOIUrl":"https://doi.org/10.1109/ICASSP43922.2022.9746658","url":null,"abstract":"Point processes are finding increasing applications in neuroscience, genomics, and social media. But basic modelling properties are little studied. Here we consider a periodic time-varying Poisson model and develop the asymptotic Cramer-Rao bound. We also develop, for the first time, a maximum likelihood algorithm for parameter estimation.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126306329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Commonsense Knowledge Enhanced Network with Retrospective Loss for Emotion Recognition in Spoken Dialog 具有回顾损失的常识知识增强网络用于口语对话中的情绪识别
Yunhe Xie, Chengjie Sun, Zhenzhou Ji
The recent surges in the open conversational data caused Emotion Recognition in Spoken Dialog (ERSD) to gain much attention. However, the existing ERSD datasets’ scale limits the model’s complete reasoning. Moreover, the artificial dialogue agent is ideally able to reference past dialogue experiences. This paper proposes a Commonsense Knowledge Enhanced Network with a retrospective loss, namely CKE-Net, to hierarchically perform dialog modeling, external knowledge integration, and historical state retrospect. Specifically, we first adopt a transformer-based encoder to model context in multi-view by elaborating different mask matrices. Then, the graph attention network is used to introduce commonsense knowledge, which benefits the complex emotional reasoning. Finally, a retrospective loss is added to utilize the model’s prior experience during training. Experiments on IEMOCAP and MELD datasets demonstrate that every designed module is consistently beneficial to the performance. Extensive experimental results show that our model outperforms the state-of-the-art models across the two benchmark datasets.
近年来,开放会话数据的激增使得语音对话中的情绪识别(ERSD)受到了广泛的关注。然而,现有ERSD数据集的规模限制了模型的完整推理。此外,理想情况下,人工对话代理能够参考过去的对话经验。本文提出了一种具有回溯损失的常识知识增强网络,即CKE-Net,它可以分层次地进行对话建模、外部知识集成和历史状态回顾。具体而言,我们首先采用基于变压器的编码器,通过精心设计不同的掩模矩阵来模拟多视图下的上下文。然后利用图注意网络引入常识性知识,有利于复杂的情感推理。最后,加入回顾性损失以利用模型在训练期间的先验经验。在IEMOCAP和MELD数据集上的实验表明,所设计的每个模块都能一致地提高性能。大量的实验结果表明,我们的模型在两个基准数据集上优于最先进的模型。
{"title":"A Commonsense Knowledge Enhanced Network with Retrospective Loss for Emotion Recognition in Spoken Dialog","authors":"Yunhe Xie, Chengjie Sun, Zhenzhou Ji","doi":"10.1109/icassp43922.2022.9746909","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746909","url":null,"abstract":"The recent surges in the open conversational data caused Emotion Recognition in Spoken Dialog (ERSD) to gain much attention. However, the existing ERSD datasets’ scale limits the model’s complete reasoning. Moreover, the artificial dialogue agent is ideally able to reference past dialogue experiences. This paper proposes a Commonsense Knowledge Enhanced Network with a retrospective loss, namely CKE-Net, to hierarchically perform dialog modeling, external knowledge integration, and historical state retrospect. Specifically, we first adopt a transformer-based encoder to model context in multi-view by elaborating different mask matrices. Then, the graph attention network is used to introduce commonsense knowledge, which benefits the complex emotional reasoning. Finally, a retrospective loss is added to utilize the model’s prior experience during training. Experiments on IEMOCAP and MELD datasets demonstrate that every designed module is consistently beneficial to the performance. Extensive experimental results show that our model outperforms the state-of-the-art models across the two benchmark datasets.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126306338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Category-Adapted Sound Event Enhancement with Weakly Labeled Data 弱标记数据下的类别适应声音事件增强
Guangwei Li, Xuenan Xu, Heinrich Dinkel, Mengyue Wu, K. Yu
Previous audio enhancement training usually requires clean signals with additive noises; hence commonly focuses on speech enhancement, where clean speech is easy to access. This paper goes beyond a broader sound event enhancement by using a weakly supervised approach via sound event detection (SED) to approximate the location and presence of a specific sound event. We propose a category-adapted system to enable enhancement on any selected sound category, where we first familiarize the model to all common sound classes and followed by a category-specific fine-tune procedure to enhance the targeted sound class. Evaluation is conducted on ten common sound classes, with a comparison to traditional and weakly supervised enhancement methods. Results indicate an average 2.86 dB SDR increase, with more significant improvement on speech (9.15 dB), music (5.01 dB), and typewriter (3.68 dB) under SNR of 0 dB. All enhancement metrics outperform previous weakly supervised methods and achieve comparable results to the state-of-the-art method that requires clean signals.
以前的音频增强训练通常需要带有附加噪声的干净信号;因此,通常侧重于语音增强,其中干净的语音易于访问。本文超越了更广泛的声音事件增强,通过声音事件检测(SED)使用弱监督方法来近似特定声音事件的位置和存在。我们提出了一个类别适应系统来增强任何选定的声音类别,我们首先让模型熟悉所有常见的声音类别,然后通过特定类别的微调程序来增强目标声音类别。对十个常见的声音类进行了评价,并与传统的弱监督增强方法进行了比较。结果表明,在信噪比为0 dB时,SDR平均增加2.86 dB,其中语音(9.15 dB)、音乐(5.01 dB)和打字机(3.68 dB)的改善更为显著。所有增强指标都优于以前的弱监督方法,并获得与需要干净信号的最先进方法相当的结果。
{"title":"Category-Adapted Sound Event Enhancement with Weakly Labeled Data","authors":"Guangwei Li, Xuenan Xu, Heinrich Dinkel, Mengyue Wu, K. Yu","doi":"10.1109/ICASSP43922.2022.9747722","DOIUrl":"https://doi.org/10.1109/ICASSP43922.2022.9747722","url":null,"abstract":"Previous audio enhancement training usually requires clean signals with additive noises; hence commonly focuses on speech enhancement, where clean speech is easy to access. This paper goes beyond a broader sound event enhancement by using a weakly supervised approach via sound event detection (SED) to approximate the location and presence of a specific sound event. We propose a category-adapted system to enable enhancement on any selected sound category, where we first familiarize the model to all common sound classes and followed by a category-specific fine-tune procedure to enhance the targeted sound class. Evaluation is conducted on ten common sound classes, with a comparison to traditional and weakly supervised enhancement methods. Results indicate an average 2.86 dB SDR increase, with more significant improvement on speech (9.15 dB), music (5.01 dB), and typewriter (3.68 dB) under SNR of 0 dB. All enhancement metrics outperform previous weakly supervised methods and achieve comparable results to the state-of-the-art method that requires clean signals.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126313138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving BCI-based Color Vision Assessment Using Gaussian Process Regression 基于高斯过程回归的bci色觉评价改进
Hadi Habibzadeh, Kevin J. Long, Allyson Atkins, Daphney-Stavroula Zois, James J. S. Norton
We present metamer identification plus (metaID+), an algorithm that enhances the performance of brain-computer interface (BCI)-based color vision assessment. BCI-based color vision assessment uses steady-state visual evoked potentials (SSVEPs) elicited during a grid search of colors to identify metamers—light sources with different spectral distributions that appear to be the same color. Present BCI-based color vision assessment methods are slow; they require extensive data collection for each color in the grid search to reduce measurement noise. metaID+ suppresses measurement noise using Gaussian process regression (i.e., a covariance function is used to replace each measurement with the weighted sum of all of the measurements). Thus, metaID+ reduces the amount of data required for each measurement. We evaluated metaID+ using data collected from ten participants and compared the sum-of-squared errors (SSE; relative to the average grid of each participant) between our algorithm and metaID (an existing algorithm). metaID+ significantly reduced the SSE. In addition, metaID+ achieved metaID’s minimum SSE while using 61.3% less data. By using less data to achieve the same level of error, metaID+ improves the performance of BCI-based color vision assessment.
本文提出了一种基于脑机接口(BCI)的色视觉识别算法metaID+。基于脑机接口(bci)的色觉评估使用在颜色网格搜索过程中激发的稳态视觉诱发电位(ssvep)来识别异聚物——具有不同光谱分布的光源,这些光源看起来是相同的颜色。目前基于脑机接口的色觉评价方法速度较慢;它们需要在网格搜索中对每种颜色进行大量的数据收集,以减少测量噪声。metaID+使用高斯过程回归抑制测量噪声(即使用协方差函数将每个测量值替换为所有测量值的加权和)。因此,metaID+减少了每次测量所需的数据量。我们使用从10名参与者收集的数据来评估metaID+,并比较了平方和误差(SSE;相对于每个参与者的平均网格)在我们的算法和metaID(一个现有的算法)之间。metaID+显著降低了SSE。此外,metaID+实现了metaID的最小SSE,同时使用的数据量减少了61.3%。通过使用更少的数据来达到相同的误差水平,metaID+提高了基于bci的色觉评估的性能。
{"title":"Improving BCI-based Color Vision Assessment Using Gaussian Process Regression","authors":"Hadi Habibzadeh, Kevin J. Long, Allyson Atkins, Daphney-Stavroula Zois, James J. S. Norton","doi":"10.1109/icassp43922.2022.9747015","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747015","url":null,"abstract":"We present metamer identification plus (metaID+), an algorithm that enhances the performance of brain-computer interface (BCI)-based color vision assessment. BCI-based color vision assessment uses steady-state visual evoked potentials (SSVEPs) elicited during a grid search of colors to identify metamers—light sources with different spectral distributions that appear to be the same color. Present BCI-based color vision assessment methods are slow; they require extensive data collection for each color in the grid search to reduce measurement noise. metaID+ suppresses measurement noise using Gaussian process regression (i.e., a covariance function is used to replace each measurement with the weighted sum of all of the measurements). Thus, metaID+ reduces the amount of data required for each measurement. We evaluated metaID+ using data collected from ten participants and compared the sum-of-squared errors (SSE; relative to the average grid of each participant) between our algorithm and metaID (an existing algorithm). metaID+ significantly reduced the SSE. In addition, metaID+ achieved metaID’s minimum SSE while using 61.3% less data. By using less data to achieve the same level of error, metaID+ improves the performance of BCI-based color vision assessment.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126327092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Remedy For Distributional Shifts Through Expected Domain Translation 一种通过预期域转换来解决分布偏移的方法
Jean-Christophe Gagnon-Audet, Soroosh Shahtalebi, Frank Rudzicz, I. Rish
Machine learning models often fail to generalize to unseen domains due to the distributional shifts. A family of such shifts, “correlation shifts,” is caused by spurious correlations in the data. It is studied under the overarching topic of “domain generalization.” In this work, we employ multi-modal translation networks to tackle the correlation shifts that appear when data is sampled out-of-distribution. Learning a generative model from training domains enables us to translate each training sample under the special characteristics of other possible domains. We show that by training a predictor solely on the generated samples, the spurious correlations in training domains average out, and the invariant features corresponding to true correlations emerge. Our proposed technique, Expected Domain Translation (EDT), is benchmarked on the Colored MNIST dataset and drastically improves the state-of-the-art classification accuracy by 38% with train-domain validation model selection.
由于分布的变化,机器学习模型往往不能推广到不可见的领域。一类这样的偏移,“相关偏移”,是由数据中的虚假相关性引起的。它是在“领域泛化”的总体主题下研究的。在这项工作中,我们采用多模态翻译网络来处理数据在分布外采样时出现的相关偏移。从训练域学习生成模型使我们能够在其他可能域的特殊特征下翻译每个训练样本。我们表明,通过仅在生成的样本上训练预测器,训练域中的虚假相关性被平均掉了,并且出现了与真实相关性对应的不变特征。我们提出的期望域翻译(EDT)技术在有色MNIST数据集上进行了基准测试,并通过训练域验证模型的选择将最先进的分类精度大大提高了38%。
{"title":"A Remedy For Distributional Shifts Through Expected Domain Translation","authors":"Jean-Christophe Gagnon-Audet, Soroosh Shahtalebi, Frank Rudzicz, I. Rish","doi":"10.1109/icassp43922.2022.9746434","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746434","url":null,"abstract":"Machine learning models often fail to generalize to unseen domains due to the distributional shifts. A family of such shifts, “correlation shifts,” is caused by spurious correlations in the data. It is studied under the overarching topic of “domain generalization.” In this work, we employ multi-modal translation networks to tackle the correlation shifts that appear when data is sampled out-of-distribution. Learning a generative model from training domains enables us to translate each training sample under the special characteristics of other possible domains. We show that by training a predictor solely on the generated samples, the spurious correlations in training domains average out, and the invariant features corresponding to true correlations emerge. Our proposed technique, Expected Domain Translation (EDT), is benchmarked on the Colored MNIST dataset and drastically improves the state-of-the-art classification accuracy by 38% with train-domain validation model selection.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125631903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1