首页 > 最新文献

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

英文 中文
Agent-Environment Network for Temporal Action Proposal Generation 时间动作提案生成的agent -环境网络
Viet-Khoa Vo-Ho, Ngan T. H. Le, Kashu Yamazaki, A. Sugimoto, Minh-Triet Tran
Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.
时间动作建议生成是一项重要且具有挑战性的任务,其目的是在未修剪的视频中定位包含人类动作的时间间隔。现有的大多数方法由于缺乏关注机制来表达动作或执行动作的主体或主体与环境之间的相互作用的概念,因此无法遵循人类理解视频上下文的认知过程。基于人类(称为代理)与环境交互并执行影响环境的操作的行为定义,我们提出了上下文代理-环境网络。我们提出的上下文AEN包括(i)代理途径,在局部层面运行,以告知哪些人类/代理正在行动;(ii)环境途径,在全球层面运行,以告知代理如何与环境相互作用。对不同骨干网(即C3D和SlowFast)的20-action THUMOS-14和200-action ActivityNet-1.3数据集的综合评估表明,无论采用何种骨干网,我们的方法都比最先进的方法表现出更强的性能。
{"title":"Agent-Environment Network for Temporal Action Proposal Generation","authors":"Viet-Khoa Vo-Ho, Ngan T. H. Le, Kashu Yamazaki, A. Sugimoto, Minh-Triet Tran","doi":"10.1109/ICASSP39728.2021.9415101","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9415101","url":null,"abstract":"Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114101872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
An Improved Mean Teacher Based Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection 一种改进的基于平均教师的大规模弱标记半监督声音事件检测方法
Xu Zheng, Yan Song, I. Mcloughlin, Lin Liu, Lirong Dai
This paper presents an improved mean teacher (MT) based method for large-scale weakly labeled semi-supervised sound event detection (SED), by focusing on learning a better student model. Two main improvements are proposed based on the authors’ previous perturbation based MT method. Firstly, an event-aware module is de-signed to allow multiple branches with different kernel sizes to be fused via an attention mechanism. By inserting this module after the convolutional layer, each neuron can adaptively adjust its receptive field to suit different sound events. Secondly, instead of using the teacher model to provide a consistency cost term, we propose using a stochastic inference of unlabeled examples to generate high quality pseudo-targets by averaging multiple predictions from the perturbed student model. MixUp of both labeled and unlabeled data is further exploited to improve the effectiveness of student model. Finally, the teacher model can be obtained via exponential moving average (EMA) of the student model, which generates final predictions for SED during inference. Experiments on the DCASE2018 task4 dataset demonstrate the ability of the proposed method. Specifically, an F1-score of 42.1% is achieved, significantly outperforming the 32.4% achieved by the winning system, or the 39.3% by the previous perturbation based method.
本文提出了一种改进的基于平均教师(MT)的大规模弱标记半监督声音事件检测(SED)方法,重点是学习一个更好的学生模型。在前人基于微扰的MT方法的基础上,提出了两个主要改进。首先,设计了一个事件感知模块,通过关注机制将不同内核大小的多个分支融合在一起;通过在卷积层之后插入该模块,每个神经元可以自适应地调整其接受野以适应不同的声音事件。其次,我们不使用教师模型来提供一致性成本项,而是使用未标记样本的随机推理,通过对扰动学生模型的多个预测进行平均来生成高质量的伪目标。进一步利用标记和未标记数据的混淆来提高学生模型的有效性。最后,教师模型可以通过学生模型的指数移动平均(EMA)得到,该模型在推理过程中生成SED的最终预测。在DCASE2018 task4数据集上的实验验证了该方法的有效性。具体来说,f1得分达到了42.1%,大大超过了获胜系统的32.4%,或之前基于扰动的方法的39.3%。
{"title":"An Improved Mean Teacher Based Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection","authors":"Xu Zheng, Yan Song, I. Mcloughlin, Lin Liu, Lirong Dai","doi":"10.1109/ICASSP39728.2021.9414931","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414931","url":null,"abstract":"This paper presents an improved mean teacher (MT) based method for large-scale weakly labeled semi-supervised sound event detection (SED), by focusing on learning a better student model. Two main improvements are proposed based on the authors’ previous perturbation based MT method. Firstly, an event-aware module is de-signed to allow multiple branches with different kernel sizes to be fused via an attention mechanism. By inserting this module after the convolutional layer, each neuron can adaptively adjust its receptive field to suit different sound events. Secondly, instead of using the teacher model to provide a consistency cost term, we propose using a stochastic inference of unlabeled examples to generate high quality pseudo-targets by averaging multiple predictions from the perturbed student model. MixUp of both labeled and unlabeled data is further exploited to improve the effectiveness of student model. Finally, the teacher model can be obtained via exponential moving average (EMA) of the student model, which generates final predictions for SED during inference. Experiments on the DCASE2018 task4 dataset demonstrate the ability of the proposed method. Specifically, an F1-score of 42.1% is achieved, significantly outperforming the 32.4% achieved by the winning system, or the 39.3% by the previous perturbation based method.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"55 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114114106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Improved Atomic Norm Based Channel Estimation for Time-Varying Narrowband Leaked Channels 时变窄带泄漏信道的改进原子范数信道估计
Jianxiu Li, U. Mitra
In this paper, improved channel gain delay estimation strategies are investigated when practical pulse shapes with finite block length and transmission bandwidth are employed. Pilot-aided channel estimation with an improved atomic norm based approach is proposed to promote the low rank structure of the channel. All the channel parameters, i.e., delays, Doppler shifts and channel gains are recovered. Design choices which ensure unique estimates of channel parameters for root-raised-cosine pulse shapes are examined. Furthermore, a perturbation analysis is conducted. Finally, numerical results verify the theoretical analysis and show performance improvements over the previously proposed method.
本文研究了在使用有限块长和传输带宽的实际脉冲形状时,改进的信道增益延迟估计策略。为了提高信道的低阶结构,提出了一种改进的基于原子范数的导频辅助信道估计方法。所有的信道参数,即延迟,多普勒频移和信道增益被恢复。设计选择,以确保唯一的估计通道参数的根提升余弦脉冲形状进行检查。此外,还进行了微扰分析。最后,数值结果验证了理论分析的正确性,并证明了该方法在性能上的改进。
{"title":"Improved Atomic Norm Based Channel Estimation for Time-Varying Narrowband Leaked Channels","authors":"Jianxiu Li, U. Mitra","doi":"10.1109/ICASSP39728.2021.9413804","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413804","url":null,"abstract":"In this paper, improved channel gain delay estimation strategies are investigated when practical pulse shapes with finite block length and transmission bandwidth are employed. Pilot-aided channel estimation with an improved atomic norm based approach is proposed to promote the low rank structure of the channel. All the channel parameters, i.e., delays, Doppler shifts and channel gains are recovered. Design choices which ensure unique estimates of channel parameters for root-raised-cosine pulse shapes are examined. Furthermore, a perturbation analysis is conducted. Finally, numerical results verify the theoretical analysis and show performance improvements over the previously proposed method.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114331335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast Inverse Mapping of Face GANs 人脸gan的快速逆映射
N. Bayat, Vahid Reza Khazaie, Y. Mohsenzadeh
Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. While many studies have explored various training configurations and architectures for GANs, the problem of inverting the generator of GANs has been inadequately investigated. We train a ResNet architecture to map given faces to latent vectors that can be used to generate faces nearly identical to the target. We use a perceptual loss to embed face details in the recovered latent vector while maintaining visual quality using a pixel loss. The vast majority of studies on latent vector recovery are very slow and perform well only on generated images. We argue that our method can be used to determine a fast mapping between real human faces and latent-space vectors that contain most of the important face style details. At last, we demonstrate the performance of our approach on both real and generated faces.
生成对抗网络(GANs)从随机潜在向量合成真实图像。虽然许多研究已经探索了gan的各种训练配置和架构,但对gan生成器的反向问题的研究还不够充分。我们训练了一个ResNet架构,将给定的人脸映射到潜在向量,这些潜在向量可用于生成与目标几乎相同的人脸。我们使用感知损失在恢复的潜在向量中嵌入人脸细节,同时使用像素损失保持视觉质量。绝大多数关于潜在向量恢复的研究都非常缓慢,并且仅在生成的图像上表现良好。我们认为,我们的方法可以用来确定真实人脸和包含大多数重要面部风格细节的潜在空间向量之间的快速映射。最后,我们展示了我们的方法在真实人脸和生成人脸上的性能。
{"title":"Fast Inverse Mapping of Face GANs","authors":"N. Bayat, Vahid Reza Khazaie, Y. Mohsenzadeh","doi":"10.1109/ICASSP39728.2021.9413532","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413532","url":null,"abstract":"Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. While many studies have explored various training configurations and architectures for GANs, the problem of inverting the generator of GANs has been inadequately investigated. We train a ResNet architecture to map given faces to latent vectors that can be used to generate faces nearly identical to the target. We use a perceptual loss to embed face details in the recovered latent vector while maintaining visual quality using a pixel loss. The vast majority of studies on latent vector recovery are very slow and perform well only on generated images. We argue that our method can be used to determine a fast mapping between real human faces and latent-space vectors that contain most of the important face style details. At last, we demonstrate the performance of our approach on both real and generated faces.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114374961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
What And Where To Focus In Person Search 在个人搜索中关注什么和在哪里
Tong Zhou, Kun Tian
Person search aims to locate and identify the query person from a gallery of original scene images. Almost all previous methods only consider single high-level semantic information, ignoring that the essence of identification task is to learn rich and expressive features. Additionally, large pose variations and occlusions of the target person significantly increase the difficulty of search task. For these two findings, we first propose multilevel semantic aggregation algorithm for more discriminative feature descriptors. Then, a pose-assisted attention module is designed to highlight fine-grained area of the target and simultaneously capture valuable clues for identification. Extensive experiments confirm that our framework can coordinate multilevel semantics of persons and effectively alleviate the adverse effects of occlusion and various pose. We also achieve state-of-the-art performance on two challenging datasets CUHK-SYSU and PRW.
人物搜索旨在从原始场景图像库中定位和识别查询人物。以往的方法几乎都只考虑单一的高级语义信息,而忽略了识别任务的本质是学习丰富而富有表现力的特征。此外,大的姿态变化和目标人的遮挡显著增加了搜索任务的难度。针对这两个发现,我们首先提出了针对更具判别性的特征描述符的多层语义聚合算法。然后,设计一个姿态辅助注意模块,突出目标的细粒度区域,同时捕捉有价值的线索进行识别。大量的实验证明,我们的框架可以协调人的多层语义,有效地缓解遮挡和各种姿势的不利影响。我们也在两个具有挑战性的数据集上取得了最先进的性能。
{"title":"What And Where To Focus In Person Search","authors":"Tong Zhou, Kun Tian","doi":"10.1109/ICASSP39728.2021.9414439","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414439","url":null,"abstract":"Person search aims to locate and identify the query person from a gallery of original scene images. Almost all previous methods only consider single high-level semantic information, ignoring that the essence of identification task is to learn rich and expressive features. Additionally, large pose variations and occlusions of the target person significantly increase the difficulty of search task. For these two findings, we first propose multilevel semantic aggregation algorithm for more discriminative feature descriptors. Then, a pose-assisted attention module is designed to highlight fine-grained area of the target and simultaneously capture valuable clues for identification. Extensive experiments confirm that our framework can coordinate multilevel semantics of persons and effectively alleviate the adverse effects of occlusion and various pose. We also achieve state-of-the-art performance on two challenging datasets CUHK-SYSU and PRW.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114533795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Framework Based on Transfer Learning for Cross-Database Pneumonia Detection 基于迁移学习的跨数据库肺炎检测新框架
Xinxin Shan, Y. Wen
Cross-database classification means that the model is able to apply to the serious disequilibrium of data distributions, and it is trained by one database while tested by another database. Thus, cross-database pneumonia detection is a challenging task. In this paper, we proposed a new framework based on transfer learning for cross-database pneumonia detection. First, based on transfer learning, we fine-tune a backbone that pre-trained on non-medical data by using a small amount of pneumonia images, which improves the detection performance on homogeneous dataset. Then in order to make the fine-tuned model applicable to cross-database classification, the adaptation layer combined with a self-learning strategy is proposed to retrain the model. The adaptation layer is to make the heterogeneous data distributions approximate and the self-learning strategy helps to tweak the model by generating pseudo-labels. Experiments on three pneumonia databases show that our proposed model completes the cross-database detection of pneumonia and shows good performance.
跨数据库分类意味着该模型能够适用于数据分布严重不平衡的情况,在一个数据库中训练,在另一个数据库中进行测试。因此,跨数据库肺炎检测是一项具有挑战性的任务。本文提出了一种基于迁移学习的跨数据库肺炎检测新框架。首先,在迁移学习的基础上,利用少量肺炎图像对非医疗数据预训练的主干进行微调,提高了在同质数据集上的检测性能。然后,为了使调整后的模型适用于跨数据库分类,提出了结合自学习策略的自适应层对模型进行再训练。自适应层使异构数据的分布近似化,自学习策略通过生成伪标签对模型进行微调。在三个肺炎数据库上的实验表明,我们提出的模型完成了肺炎的跨数据库检测,并显示出良好的性能。
{"title":"A New Framework Based on Transfer Learning for Cross-Database Pneumonia Detection","authors":"Xinxin Shan, Y. Wen","doi":"10.1109/ICASSP39728.2021.9414997","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414997","url":null,"abstract":"Cross-database classification means that the model is able to apply to the serious disequilibrium of data distributions, and it is trained by one database while tested by another database. Thus, cross-database pneumonia detection is a challenging task. In this paper, we proposed a new framework based on transfer learning for cross-database pneumonia detection. First, based on transfer learning, we fine-tune a backbone that pre-trained on non-medical data by using a small amount of pneumonia images, which improves the detection performance on homogeneous dataset. Then in order to make the fine-tuned model applicable to cross-database classification, the adaptation layer combined with a self-learning strategy is proposed to retrain the model. The adaptation layer is to make the heterogeneous data distributions approximate and the self-learning strategy helps to tweak the model by generating pseudo-labels. Experiments on three pneumonia databases show that our proposed model completes the cross-database detection of pneumonia and shows good performance.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121486059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-Scale and Multi-Region Facial Discriminative Representation for Automatic Depression Level Prediction 基于多尺度多区域面部判别表征的抑郁水平自动预测
Mingyue Niu, J. Tao, B. Liu
Physiological studies have shown that differences in facial activities between depressed patients and normal individuals are manifested in different local facial regions and the durations of these activities are not the same. But most previous works extract features from the entire facial region at a fixed time scale to predict the individual depression level. Thus, they are inadequate in capturing dynamic facial changes. For these reasons, we propose a multi-scale and multi-region fa-cial dynamic representation method to improve the prediction performance. In particular, we firstly use multiple time scales to divide the original long-term video into segments containing different facial regions. Secondly, the segment-level feature is extracted by 3D convolution neural network to characterize the facial activities with different durations in different facial regions. Thirdly, this paper adopts eigen evolution pooling and gradient boosting decision tree to aggregate these segment-level features and select discriminative elements to generate the video-level feature. Finally, the depression level is predicted using support vector regression. Experiments are conducted on AVEC2013 and AVEC2014. The results demonstrate that our method achieves better performance than the previous works.
生理学研究表明,抑郁症患者与正常人面部活动的差异表现在不同的局部面部区域,这些活动的持续时间也不相同。但以往的研究大多是在固定的时间尺度上提取整个面部区域的特征来预测个体的抑郁程度。因此,它们在捕捉动态面部变化方面是不够的。为此,我们提出了一种多尺度、多区域的人脸动态表示方法来提高预测性能。特别是,我们首先使用多个时间尺度将原始的长期视频分割成包含不同面部区域的片段。其次,利用三维卷积神经网络提取片段级特征,表征不同面部区域不同持续时间的面部活动;再次,采用特征进化池和梯度增强决策树对这些片段级特征进行聚合,选择判别元素生成视频级特征。最后,利用支持向量回归对抑郁程度进行预测。在AVEC2013和AVEC2014上进行了实验。结果表明,该方法取得了较好的效果。
{"title":"Multi-Scale and Multi-Region Facial Discriminative Representation for Automatic Depression Level Prediction","authors":"Mingyue Niu, J. Tao, B. Liu","doi":"10.1109/ICASSP39728.2021.9413504","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413504","url":null,"abstract":"Physiological studies have shown that differences in facial activities between depressed patients and normal individuals are manifested in different local facial regions and the durations of these activities are not the same. But most previous works extract features from the entire facial region at a fixed time scale to predict the individual depression level. Thus, they are inadequate in capturing dynamic facial changes. For these reasons, we propose a multi-scale and multi-region fa-cial dynamic representation method to improve the prediction performance. In particular, we firstly use multiple time scales to divide the original long-term video into segments containing different facial regions. Secondly, the segment-level feature is extracted by 3D convolution neural network to characterize the facial activities with different durations in different facial regions. Thirdly, this paper adopts eigen evolution pooling and gradient boosting decision tree to aggregate these segment-level features and select discriminative elements to generate the video-level feature. Finally, the depression level is predicted using support vector regression. Experiments are conducted on AVEC2013 and AVEC2014. The results demonstrate that our method achieves better performance than the previous works.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121496866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Teacher-Student Learning for Low-Latency Online Speech Enhancement Using Wave-U-Net 利用Wave-U-Net进行低延迟在线语音增强的师生学习
Sotaro Nakaoka, Li Li, S. Inoue, S. Makino
In this paper, we propose a low-latency online extension of wave-U-net for single-channel speech enhancement, which utilizes teacher-student learning to reduce the system latency while keeping the enhancement performance high. Wave-U-net is a recently proposed end-to-end source separation method, which achieved remarkable performance in singing voice separation and speech enhancement tasks. Since the enhancement is performed in the time domain, wave-U-net can efficiently model phase information and address the domain transformation limitation, where the time-frequency domain is normally adopted. In this paper, we apply wave-U-net to face-to-face applications such as hearing aids and in-car communication systems, where a strictly low-latency of less than 10 ms is required. To this end, we investigate online versions of wave-U-net and propose the use of teacher-student learning to prevent the performance degradation caused by the reduction in input segment length such that the system delay in a CPU is less than 10 ms. The experimental results revealed that the proposed model could perform in real-time with low-latency and high performance, achieving a signal-to-distortion ratio improvement of about 8.73 dB.
在本文中,我们提出了一种用于单通道语音增强的低延迟wave-U-net在线扩展,该扩展利用师生学习来降低系统延迟,同时保持高增强性能。Wave-U-net是最近提出的端到端源分离方法,在歌唱语音分离和语音增强任务中取得了显著的效果。由于增强是在时域进行的,因此wave-U-net可以有效地模拟相位信息并解决通常采用时频域的域变换限制。在本文中,我们将wave-U-net应用于面对面的应用,如助听器和车载通信系统,这些应用需要严格的低于10毫秒的低延迟。为此,我们研究了wave-U-net的在线版本,并建议使用师生学习来防止由于输入段长度减少而导致的性能下降,从而使CPU中的系统延迟小于10 ms。实验结果表明,该模型具有低延迟、高性能的实时性,信失真比提高约8.73 dB。
{"title":"Teacher-Student Learning for Low-Latency Online Speech Enhancement Using Wave-U-Net","authors":"Sotaro Nakaoka, Li Li, S. Inoue, S. Makino","doi":"10.1109/ICASSP39728.2021.9414280","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414280","url":null,"abstract":"In this paper, we propose a low-latency online extension of wave-U-net for single-channel speech enhancement, which utilizes teacher-student learning to reduce the system latency while keeping the enhancement performance high. Wave-U-net is a recently proposed end-to-end source separation method, which achieved remarkable performance in singing voice separation and speech enhancement tasks. Since the enhancement is performed in the time domain, wave-U-net can efficiently model phase information and address the domain transformation limitation, where the time-frequency domain is normally adopted. In this paper, we apply wave-U-net to face-to-face applications such as hearing aids and in-car communication systems, where a strictly low-latency of less than 10 ms is required. To this end, we investigate online versions of wave-U-net and propose the use of teacher-student learning to prevent the performance degradation caused by the reduction in input segment length such that the system delay in a CPU is less than 10 ms. The experimental results revealed that the proposed model could perform in real-time with low-latency and high performance, achieving a signal-to-distortion ratio improvement of about 8.73 dB.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121570125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Coughwatch: Real-World Cough Detection using Smartwatches Coughwatch:使用智能手表进行真实咳嗽检测
D. Liaqat, S. Liaqat, Jun Lin Chen, Tina Sedaghat, Moshe Gabel, Frank Rudzicz, E. D. Lara
Continuous monitoring of cough may provide insights into the health of individuals as well as the effectiveness of treatments. Smart-watches, in particular, are highly promising for such monitoring: they are inexpensive, unobtrusive, programmable, and have a variety of sensors. However, current mobile cough detection systems are not designed for smartwatches, and perform poorly when applied to real-world smartwatch data since they are often evaluated on data collected in the lab.In this work we propose CoughWatch, a lightweight cough detector for smartwatches that uses audio and movement data for in-the-wild cough detection. On our in-the-wild data, CoughWatch achieves a precision of 82% and recall of 55%, compared to 6% precision and 19% recall achieved by the current state-of-the-art approach. Furthermore, by incorporating gyroscope and accelerometer data, CoughWatch improves precision by up to 15.5 percentage points compared to an audio-only model.
对咳嗽的持续监测可以提供对个人健康状况以及治疗效果的深入了解。尤其是智能手表,在这种监测方面非常有前途:它们价格低廉、不显眼、可编程,而且有各种各样的传感器。然而,目前的移动咳嗽检测系统并不是为智能手表设计的,并且在应用于现实世界的智能手表数据时表现不佳,因为它们通常是根据实验室收集的数据进行评估的。在这项工作中,我们提出了CoughWatch,一个轻量级的咳嗽检测器,用于智能手表,使用音频和运动数据进行野外咳嗽检测。在我们的野外数据中,CoughWatch的准确率为82%,召回率为55%,而目前最先进的方法的准确率为6%,召回率为19%。此外,通过整合陀螺仪和加速度计数据,与纯音频模型相比,CoughWatch的精度提高了15.5个百分点。
{"title":"Coughwatch: Real-World Cough Detection using Smartwatches","authors":"D. Liaqat, S. Liaqat, Jun Lin Chen, Tina Sedaghat, Moshe Gabel, Frank Rudzicz, E. D. Lara","doi":"10.1109/ICASSP39728.2021.9414881","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414881","url":null,"abstract":"Continuous monitoring of cough may provide insights into the health of individuals as well as the effectiveness of treatments. Smart-watches, in particular, are highly promising for such monitoring: they are inexpensive, unobtrusive, programmable, and have a variety of sensors. However, current mobile cough detection systems are not designed for smartwatches, and perform poorly when applied to real-world smartwatch data since they are often evaluated on data collected in the lab.In this work we propose CoughWatch, a lightweight cough detector for smartwatches that uses audio and movement data for in-the-wild cough detection. On our in-the-wild data, CoughWatch achieves a precision of 82% and recall of 55%, compared to 6% precision and 19% recall achieved by the current state-of-the-art approach. Furthermore, by incorporating gyroscope and accelerometer data, CoughWatch improves precision by up to 15.5 percentage points compared to an audio-only model.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114711564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
History Utterance Embedding Transformer LM for Speech Recognition 用于语音识别的历史话语嵌入变换LM
Keqi Deng, Gaofeng Cheng, Haoran Miao, Pengyuan Zhang, Yonghong Yan
History utterances contain rich contextual information; however, better extracting information from the history utterances and using it to improve the language model (LM) is still challenging. In this paper, we propose the history utterance embedding Transformer LM (HTLM), which includes an embedding generation network for extracting contextual information contained in the history utterances and a main Transformer LM for current prediction. In addition, the two-stage attention (TSA) is proposed to encode richer contextual information into the embedding of history utterances (h-emb) while supporting GPU parallel training. Furthermore, we combine the extracted h-emb and embedding of current utterance (c-emb) through the dot-product attention and a fusion method for HTLM's current prediction. Experiments are conducted on the HKUST dataset and achieve a 23.4% character error rate (CER) on the test set. Compared with the baseline, the proposed method yields 12.86 absolute perplexity reduction and 0.8% absolute CER reduction.
历史话语包含着丰富的语境信息;然而,如何更好地从历史话语中提取信息并利用它来改进语言模型(LM)仍然是一个挑战。在本文中,我们提出了历史话语嵌入变形LM (HTLM),它包括一个用于提取历史话语中包含的上下文信息的嵌入生成网络和一个用于当前预测的主变形LM。此外,在支持GPU并行训练的同时,提出了两阶段注意(TSA)方法,将更丰富的上下文信息编码到历史话语(h-emb)的嵌入中。此外,我们通过点积关注和HTLM当前预测的融合方法,将提取的h-emb与当前话语的嵌入(c-emb)相结合。在香港科技大学数据集上进行实验,测试集的字符错误率为23.4%。与基线相比,该方法的绝对perplexity降低了12.86,绝对CER降低了0.8%。
{"title":"History Utterance Embedding Transformer LM for Speech Recognition","authors":"Keqi Deng, Gaofeng Cheng, Haoran Miao, Pengyuan Zhang, Yonghong Yan","doi":"10.1109/ICASSP39728.2021.9414575","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414575","url":null,"abstract":"History utterances contain rich contextual information; however, better extracting information from the history utterances and using it to improve the language model (LM) is still challenging. In this paper, we propose the history utterance embedding Transformer LM (HTLM), which includes an embedding generation network for extracting contextual information contained in the history utterances and a main Transformer LM for current prediction. In addition, the two-stage attention (TSA) is proposed to encode richer contextual information into the embedding of history utterances (h-emb) while supporting GPU parallel training. Furthermore, we combine the extracted h-emb and embedding of current utterance (c-emb) through the dot-product attention and a fusion method for HTLM's current prediction. Experiments are conducted on the HKUST dataset and achieve a 23.4% character error rate (CER) on the test set. Compared with the baseline, the proposed method yields 12.86 absolute perplexity reduction and 0.8% absolute CER reduction.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114763095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1