Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9415101
Viet-Khoa Vo-Ho, Ngan T. H. Le, Kashu Yamazaki, A. Sugimoto, Minh-Triet Tran
Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.
{"title":"Agent-Environment Network for Temporal Action Proposal Generation","authors":"Viet-Khoa Vo-Ho, Ngan T. H. Le, Kashu Yamazaki, A. Sugimoto, Minh-Triet Tran","doi":"10.1109/ICASSP39728.2021.9415101","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9415101","url":null,"abstract":"Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114101872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414931
Xu Zheng, Yan Song, I. Mcloughlin, Lin Liu, Lirong Dai
This paper presents an improved mean teacher (MT) based method for large-scale weakly labeled semi-supervised sound event detection (SED), by focusing on learning a better student model. Two main improvements are proposed based on the authors’ previous perturbation based MT method. Firstly, an event-aware module is de-signed to allow multiple branches with different kernel sizes to be fused via an attention mechanism. By inserting this module after the convolutional layer, each neuron can adaptively adjust its receptive field to suit different sound events. Secondly, instead of using the teacher model to provide a consistency cost term, we propose using a stochastic inference of unlabeled examples to generate high quality pseudo-targets by averaging multiple predictions from the perturbed student model. MixUp of both labeled and unlabeled data is further exploited to improve the effectiveness of student model. Finally, the teacher model can be obtained via exponential moving average (EMA) of the student model, which generates final predictions for SED during inference. Experiments on the DCASE2018 task4 dataset demonstrate the ability of the proposed method. Specifically, an F1-score of 42.1% is achieved, significantly outperforming the 32.4% achieved by the winning system, or the 39.3% by the previous perturbation based method.
{"title":"An Improved Mean Teacher Based Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection","authors":"Xu Zheng, Yan Song, I. Mcloughlin, Lin Liu, Lirong Dai","doi":"10.1109/ICASSP39728.2021.9414931","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414931","url":null,"abstract":"This paper presents an improved mean teacher (MT) based method for large-scale weakly labeled semi-supervised sound event detection (SED), by focusing on learning a better student model. Two main improvements are proposed based on the authors’ previous perturbation based MT method. Firstly, an event-aware module is de-signed to allow multiple branches with different kernel sizes to be fused via an attention mechanism. By inserting this module after the convolutional layer, each neuron can adaptively adjust its receptive field to suit different sound events. Secondly, instead of using the teacher model to provide a consistency cost term, we propose using a stochastic inference of unlabeled examples to generate high quality pseudo-targets by averaging multiple predictions from the perturbed student model. MixUp of both labeled and unlabeled data is further exploited to improve the effectiveness of student model. Finally, the teacher model can be obtained via exponential moving average (EMA) of the student model, which generates final predictions for SED during inference. Experiments on the DCASE2018 task4 dataset demonstrate the ability of the proposed method. Specifically, an F1-score of 42.1% is achieved, significantly outperforming the 32.4% achieved by the winning system, or the 39.3% by the previous perturbation based method.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"55 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114114106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413804
Jianxiu Li, U. Mitra
In this paper, improved channel gain delay estimation strategies are investigated when practical pulse shapes with finite block length and transmission bandwidth are employed. Pilot-aided channel estimation with an improved atomic norm based approach is proposed to promote the low rank structure of the channel. All the channel parameters, i.e., delays, Doppler shifts and channel gains are recovered. Design choices which ensure unique estimates of channel parameters for root-raised-cosine pulse shapes are examined. Furthermore, a perturbation analysis is conducted. Finally, numerical results verify the theoretical analysis and show performance improvements over the previously proposed method.
{"title":"Improved Atomic Norm Based Channel Estimation for Time-Varying Narrowband Leaked Channels","authors":"Jianxiu Li, U. Mitra","doi":"10.1109/ICASSP39728.2021.9413804","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413804","url":null,"abstract":"In this paper, improved channel gain delay estimation strategies are investigated when practical pulse shapes with finite block length and transmission bandwidth are employed. Pilot-aided channel estimation with an improved atomic norm based approach is proposed to promote the low rank structure of the channel. All the channel parameters, i.e., delays, Doppler shifts and channel gains are recovered. Design choices which ensure unique estimates of channel parameters for root-raised-cosine pulse shapes are examined. Furthermore, a perturbation analysis is conducted. Finally, numerical results verify the theoretical analysis and show performance improvements over the previously proposed method.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114331335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413532
N. Bayat, Vahid Reza Khazaie, Y. Mohsenzadeh
Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. While many studies have explored various training configurations and architectures for GANs, the problem of inverting the generator of GANs has been inadequately investigated. We train a ResNet architecture to map given faces to latent vectors that can be used to generate faces nearly identical to the target. We use a perceptual loss to embed face details in the recovered latent vector while maintaining visual quality using a pixel loss. The vast majority of studies on latent vector recovery are very slow and perform well only on generated images. We argue that our method can be used to determine a fast mapping between real human faces and latent-space vectors that contain most of the important face style details. At last, we demonstrate the performance of our approach on both real and generated faces.
{"title":"Fast Inverse Mapping of Face GANs","authors":"N. Bayat, Vahid Reza Khazaie, Y. Mohsenzadeh","doi":"10.1109/ICASSP39728.2021.9413532","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413532","url":null,"abstract":"Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. While many studies have explored various training configurations and architectures for GANs, the problem of inverting the generator of GANs has been inadequately investigated. We train a ResNet architecture to map given faces to latent vectors that can be used to generate faces nearly identical to the target. We use a perceptual loss to embed face details in the recovered latent vector while maintaining visual quality using a pixel loss. The vast majority of studies on latent vector recovery are very slow and perform well only on generated images. We argue that our method can be used to determine a fast mapping between real human faces and latent-space vectors that contain most of the important face style details. At last, we demonstrate the performance of our approach on both real and generated faces.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114374961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414439
Tong Zhou, Kun Tian
Person search aims to locate and identify the query person from a gallery of original scene images. Almost all previous methods only consider single high-level semantic information, ignoring that the essence of identification task is to learn rich and expressive features. Additionally, large pose variations and occlusions of the target person significantly increase the difficulty of search task. For these two findings, we first propose multilevel semantic aggregation algorithm for more discriminative feature descriptors. Then, a pose-assisted attention module is designed to highlight fine-grained area of the target and simultaneously capture valuable clues for identification. Extensive experiments confirm that our framework can coordinate multilevel semantics of persons and effectively alleviate the adverse effects of occlusion and various pose. We also achieve state-of-the-art performance on two challenging datasets CUHK-SYSU and PRW.
{"title":"What And Where To Focus In Person Search","authors":"Tong Zhou, Kun Tian","doi":"10.1109/ICASSP39728.2021.9414439","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414439","url":null,"abstract":"Person search aims to locate and identify the query person from a gallery of original scene images. Almost all previous methods only consider single high-level semantic information, ignoring that the essence of identification task is to learn rich and expressive features. Additionally, large pose variations and occlusions of the target person significantly increase the difficulty of search task. For these two findings, we first propose multilevel semantic aggregation algorithm for more discriminative feature descriptors. Then, a pose-assisted attention module is designed to highlight fine-grained area of the target and simultaneously capture valuable clues for identification. Extensive experiments confirm that our framework can coordinate multilevel semantics of persons and effectively alleviate the adverse effects of occlusion and various pose. We also achieve state-of-the-art performance on two challenging datasets CUHK-SYSU and PRW.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114533795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414997
Xinxin Shan, Y. Wen
Cross-database classification means that the model is able to apply to the serious disequilibrium of data distributions, and it is trained by one database while tested by another database. Thus, cross-database pneumonia detection is a challenging task. In this paper, we proposed a new framework based on transfer learning for cross-database pneumonia detection. First, based on transfer learning, we fine-tune a backbone that pre-trained on non-medical data by using a small amount of pneumonia images, which improves the detection performance on homogeneous dataset. Then in order to make the fine-tuned model applicable to cross-database classification, the adaptation layer combined with a self-learning strategy is proposed to retrain the model. The adaptation layer is to make the heterogeneous data distributions approximate and the self-learning strategy helps to tweak the model by generating pseudo-labels. Experiments on three pneumonia databases show that our proposed model completes the cross-database detection of pneumonia and shows good performance.
{"title":"A New Framework Based on Transfer Learning for Cross-Database Pneumonia Detection","authors":"Xinxin Shan, Y. Wen","doi":"10.1109/ICASSP39728.2021.9414997","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414997","url":null,"abstract":"Cross-database classification means that the model is able to apply to the serious disequilibrium of data distributions, and it is trained by one database while tested by another database. Thus, cross-database pneumonia detection is a challenging task. In this paper, we proposed a new framework based on transfer learning for cross-database pneumonia detection. First, based on transfer learning, we fine-tune a backbone that pre-trained on non-medical data by using a small amount of pneumonia images, which improves the detection performance on homogeneous dataset. Then in order to make the fine-tuned model applicable to cross-database classification, the adaptation layer combined with a self-learning strategy is proposed to retrain the model. The adaptation layer is to make the heterogeneous data distributions approximate and the self-learning strategy helps to tweak the model by generating pseudo-labels. Experiments on three pneumonia databases show that our proposed model completes the cross-database detection of pneumonia and shows good performance.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121486059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413504
Mingyue Niu, J. Tao, B. Liu
Physiological studies have shown that differences in facial activities between depressed patients and normal individuals are manifested in different local facial regions and the durations of these activities are not the same. But most previous works extract features from the entire facial region at a fixed time scale to predict the individual depression level. Thus, they are inadequate in capturing dynamic facial changes. For these reasons, we propose a multi-scale and multi-region fa-cial dynamic representation method to improve the prediction performance. In particular, we firstly use multiple time scales to divide the original long-term video into segments containing different facial regions. Secondly, the segment-level feature is extracted by 3D convolution neural network to characterize the facial activities with different durations in different facial regions. Thirdly, this paper adopts eigen evolution pooling and gradient boosting decision tree to aggregate these segment-level features and select discriminative elements to generate the video-level feature. Finally, the depression level is predicted using support vector regression. Experiments are conducted on AVEC2013 and AVEC2014. The results demonstrate that our method achieves better performance than the previous works.
{"title":"Multi-Scale and Multi-Region Facial Discriminative Representation for Automatic Depression Level Prediction","authors":"Mingyue Niu, J. Tao, B. Liu","doi":"10.1109/ICASSP39728.2021.9413504","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413504","url":null,"abstract":"Physiological studies have shown that differences in facial activities between depressed patients and normal individuals are manifested in different local facial regions and the durations of these activities are not the same. But most previous works extract features from the entire facial region at a fixed time scale to predict the individual depression level. Thus, they are inadequate in capturing dynamic facial changes. For these reasons, we propose a multi-scale and multi-region fa-cial dynamic representation method to improve the prediction performance. In particular, we firstly use multiple time scales to divide the original long-term video into segments containing different facial regions. Secondly, the segment-level feature is extracted by 3D convolution neural network to characterize the facial activities with different durations in different facial regions. Thirdly, this paper adopts eigen evolution pooling and gradient boosting decision tree to aggregate these segment-level features and select discriminative elements to generate the video-level feature. Finally, the depression level is predicted using support vector regression. Experiments are conducted on AVEC2013 and AVEC2014. The results demonstrate that our method achieves better performance than the previous works.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121496866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414280
Sotaro Nakaoka, Li Li, S. Inoue, S. Makino
In this paper, we propose a low-latency online extension of wave-U-net for single-channel speech enhancement, which utilizes teacher-student learning to reduce the system latency while keeping the enhancement performance high. Wave-U-net is a recently proposed end-to-end source separation method, which achieved remarkable performance in singing voice separation and speech enhancement tasks. Since the enhancement is performed in the time domain, wave-U-net can efficiently model phase information and address the domain transformation limitation, where the time-frequency domain is normally adopted. In this paper, we apply wave-U-net to face-to-face applications such as hearing aids and in-car communication systems, where a strictly low-latency of less than 10 ms is required. To this end, we investigate online versions of wave-U-net and propose the use of teacher-student learning to prevent the performance degradation caused by the reduction in input segment length such that the system delay in a CPU is less than 10 ms. The experimental results revealed that the proposed model could perform in real-time with low-latency and high performance, achieving a signal-to-distortion ratio improvement of about 8.73 dB.
{"title":"Teacher-Student Learning for Low-Latency Online Speech Enhancement Using Wave-U-Net","authors":"Sotaro Nakaoka, Li Li, S. Inoue, S. Makino","doi":"10.1109/ICASSP39728.2021.9414280","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414280","url":null,"abstract":"In this paper, we propose a low-latency online extension of wave-U-net for single-channel speech enhancement, which utilizes teacher-student learning to reduce the system latency while keeping the enhancement performance high. Wave-U-net is a recently proposed end-to-end source separation method, which achieved remarkable performance in singing voice separation and speech enhancement tasks. Since the enhancement is performed in the time domain, wave-U-net can efficiently model phase information and address the domain transformation limitation, where the time-frequency domain is normally adopted. In this paper, we apply wave-U-net to face-to-face applications such as hearing aids and in-car communication systems, where a strictly low-latency of less than 10 ms is required. To this end, we investigate online versions of wave-U-net and propose the use of teacher-student learning to prevent the performance degradation caused by the reduction in input segment length such that the system delay in a CPU is less than 10 ms. The experimental results revealed that the proposed model could perform in real-time with low-latency and high performance, achieving a signal-to-distortion ratio improvement of about 8.73 dB.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121570125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414881
D. Liaqat, S. Liaqat, Jun Lin Chen, Tina Sedaghat, Moshe Gabel, Frank Rudzicz, E. D. Lara
Continuous monitoring of cough may provide insights into the health of individuals as well as the effectiveness of treatments. Smart-watches, in particular, are highly promising for such monitoring: they are inexpensive, unobtrusive, programmable, and have a variety of sensors. However, current mobile cough detection systems are not designed for smartwatches, and perform poorly when applied to real-world smartwatch data since they are often evaluated on data collected in the lab.In this work we propose CoughWatch, a lightweight cough detector for smartwatches that uses audio and movement data for in-the-wild cough detection. On our in-the-wild data, CoughWatch achieves a precision of 82% and recall of 55%, compared to 6% precision and 19% recall achieved by the current state-of-the-art approach. Furthermore, by incorporating gyroscope and accelerometer data, CoughWatch improves precision by up to 15.5 percentage points compared to an audio-only model.
{"title":"Coughwatch: Real-World Cough Detection using Smartwatches","authors":"D. Liaqat, S. Liaqat, Jun Lin Chen, Tina Sedaghat, Moshe Gabel, Frank Rudzicz, E. D. Lara","doi":"10.1109/ICASSP39728.2021.9414881","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414881","url":null,"abstract":"Continuous monitoring of cough may provide insights into the health of individuals as well as the effectiveness of treatments. Smart-watches, in particular, are highly promising for such monitoring: they are inexpensive, unobtrusive, programmable, and have a variety of sensors. However, current mobile cough detection systems are not designed for smartwatches, and perform poorly when applied to real-world smartwatch data since they are often evaluated on data collected in the lab.In this work we propose CoughWatch, a lightweight cough detector for smartwatches that uses audio and movement data for in-the-wild cough detection. On our in-the-wild data, CoughWatch achieves a precision of 82% and recall of 55%, compared to 6% precision and 19% recall achieved by the current state-of-the-art approach. Furthermore, by incorporating gyroscope and accelerometer data, CoughWatch improves precision by up to 15.5 percentage points compared to an audio-only model.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114711564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414575
Keqi Deng, Gaofeng Cheng, Haoran Miao, Pengyuan Zhang, Yonghong Yan
History utterances contain rich contextual information; however, better extracting information from the history utterances and using it to improve the language model (LM) is still challenging. In this paper, we propose the history utterance embedding Transformer LM (HTLM), which includes an embedding generation network for extracting contextual information contained in the history utterances and a main Transformer LM for current prediction. In addition, the two-stage attention (TSA) is proposed to encode richer contextual information into the embedding of history utterances (h-emb) while supporting GPU parallel training. Furthermore, we combine the extracted h-emb and embedding of current utterance (c-emb) through the dot-product attention and a fusion method for HTLM's current prediction. Experiments are conducted on the HKUST dataset and achieve a 23.4% character error rate (CER) on the test set. Compared with the baseline, the proposed method yields 12.86 absolute perplexity reduction and 0.8% absolute CER reduction.
{"title":"History Utterance Embedding Transformer LM for Speech Recognition","authors":"Keqi Deng, Gaofeng Cheng, Haoran Miao, Pengyuan Zhang, Yonghong Yan","doi":"10.1109/ICASSP39728.2021.9414575","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414575","url":null,"abstract":"History utterances contain rich contextual information; however, better extracting information from the history utterances and using it to improve the language model (LM) is still challenging. In this paper, we propose the history utterance embedding Transformer LM (HTLM), which includes an embedding generation network for extracting contextual information contained in the history utterances and a main Transformer LM for current prediction. In addition, the two-stage attention (TSA) is proposed to encode richer contextual information into the embedding of history utterances (h-emb) while supporting GPU parallel training. Furthermore, we combine the extracted h-emb and embedding of current utterance (c-emb) through the dot-product attention and a fusion method for HTLM's current prediction. Experiments are conducted on the HKUST dataset and achieve a 23.4% character error rate (CER) on the test set. Compared with the baseline, the proposed method yields 12.86 absolute perplexity reduction and 0.8% absolute CER reduction.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114763095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}