Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413725
Jaakko Marin, Micael Bernhardt, T. Riihonen
This paper introduces and justifies a novel system concept that consists of full-duplex transceivers and uses a multifunction signal for simultaneous two-way communication, jamming and sensing tasks. The proposed device structure and wave-form enable simple-yet-effective interference suppression at the cost of being limited to constant-envelope transmission— this is a weakness only for the communication functionality that becomes limited to frequency-shift keying (FSK) while frequency-modulated continuous wave (FMCW) waveforms are effective for jamming and sensing purposes. We show how the transmission and reception as well as different interference and distortion compensation procedures are implemented in such multifunction transceivers. The system could be also applied for simultaneous spectrum monitoring with the above functions. Finally, we showcase the expected performance of such a system through numerical results.
{"title":"Full-Duplex Multifunction Transceiver with Joint Constant Envelope Transmission and Wideband Reception","authors":"Jaakko Marin, Micael Bernhardt, T. Riihonen","doi":"10.1109/ICASSP39728.2021.9413725","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413725","url":null,"abstract":"This paper introduces and justifies a novel system concept that consists of full-duplex transceivers and uses a multifunction signal for simultaneous two-way communication, jamming and sensing tasks. The proposed device structure and wave-form enable simple-yet-effective interference suppression at the cost of being limited to constant-envelope transmission— this is a weakness only for the communication functionality that becomes limited to frequency-shift keying (FSK) while frequency-modulated continuous wave (FMCW) waveforms are effective for jamming and sensing purposes. We show how the transmission and reception as well as different interference and distortion compensation procedures are implemented in such multifunction transceivers. The system could be also applied for simultaneous spectrum monitoring with the above functions. Finally, we showcase the expected performance of such a system through numerical results.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127918619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414810
Qiwen Jin, Yong Ma, Xiaoguang Mei, Hao Li, Jiayi Ma
Recently, the learning-based method has received much attention in the unsupervised hyperspectral unmixing, yet their ability to extract physically meaningful endmembers remains limited and the performance has not been satisfactory. In this paper, we propose a novel two-stream Dirichlet-net, termed as uTDN, to address the above problems. The weight-sharing architecture makes it possible to transfer the intrinsic properties of the endmembers during the process of unmixing, which can help to correct the network converging towards a more accurate and interpretable unmixing solution. Besides, the stick-breaking process is adopted to encourage the latent representation to follow a Dirichlet distribution, where the physical property of the estimated abundance can be naturally incorporated. Extensive experiments on both synthetic and real hyperspectral data demonstrate that the proposed uTDN can outperform the other state-of-the-art approaches.
{"title":"UTDN: An Unsupervised Two-Stream Dirichlet-Net for Hyperspectral Unmixing","authors":"Qiwen Jin, Yong Ma, Xiaoguang Mei, Hao Li, Jiayi Ma","doi":"10.1109/ICASSP39728.2021.9414810","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414810","url":null,"abstract":"Recently, the learning-based method has received much attention in the unsupervised hyperspectral unmixing, yet their ability to extract physically meaningful endmembers remains limited and the performance has not been satisfactory. In this paper, we propose a novel two-stream Dirichlet-net, termed as uTDN, to address the above problems. The weight-sharing architecture makes it possible to transfer the intrinsic properties of the endmembers during the process of unmixing, which can help to correct the network converging towards a more accurate and interpretable unmixing solution. Besides, the stick-breaking process is adopted to encourage the latent representation to follow a Dirichlet distribution, where the physical property of the estimated abundance can be naturally incorporated. Extensive experiments on both synthetic and real hyperspectral data demonstrate that the proposed uTDN can outperform the other state-of-the-art approaches.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128177823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413566
Zhaoci Liu, Zhiqiang Guo, Zhenhua Ling, Yunxia Li
This paper presents a method of detecting Alzheimer’s disease (AD) from the spontaneous speech of subjects in a picture description task using neural networks. This method does not rely on the manual transcriptions and annotations of a subject’s speech, but utilizes the bottleneck features extracted from audio using an ASR model. The neural network contains convolutional neural network (CNN) layers for local context modeling, bidirectional long shortterm memory (BiLSTM) layers for global context modeling and an attention pooling layer for classification. Furthermore, a masking- based data augmentation method is designed to deal with the data scarcity problem. Experiments on the DementiaBank dataset show that the detection accuracy of our proposed method is 82.59%, which is better than the baseline method based on manually-designed acoustic features and support vector machines (SVM), and achieves the state-of-the-art performance of detecting AD using only audio data on this dataset.
{"title":"Detecting Alzheimer’s Disease from Speech Using Neural Networks with Bottleneck Features and Data Augmentation","authors":"Zhaoci Liu, Zhiqiang Guo, Zhenhua Ling, Yunxia Li","doi":"10.1109/ICASSP39728.2021.9413566","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413566","url":null,"abstract":"This paper presents a method of detecting Alzheimer’s disease (AD) from the spontaneous speech of subjects in a picture description task using neural networks. This method does not rely on the manual transcriptions and annotations of a subject’s speech, but utilizes the bottleneck features extracted from audio using an ASR model. The neural network contains convolutional neural network (CNN) layers for local context modeling, bidirectional long shortterm memory (BiLSTM) layers for global context modeling and an attention pooling layer for classification. Furthermore, a masking- based data augmentation method is designed to deal with the data scarcity problem. Experiments on the DementiaBank dataset show that the detection accuracy of our proposed method is 82.59%, which is better than the baseline method based on manually-designed acoustic features and support vector machines (SVM), and achieves the state-of-the-art performance of detecting AD using only audio data on this dataset.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128196130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413909
Yuan Hou, A. Cuyt, Wen-shin Lee, Deepayan Bhowmik
Decomposition is integral to most image processing algorithms and often required in texture analysis. We present a new approach using a recent 2-dimensional exponential analysis technique. Exponential analysis offers the advantage of sparsity in the model and continuity in the parameters. This results in a much more compact representation of textures when compared to traditional Fourier or wavelet transform techniques. Our experiments include synthetic as well as real texture images from standard benchmark datasets. The results outperform FFT in representing texture patterns with significantly fewer terms while retaining RMSE values after reconstruction. The underlying periodic complex exponential model works best for texture patterns that are homogeneous. We demonstrate the usefulness of the method in two common vision processing application examples, namely texture classification and defect detection.
{"title":"Decomposing Textures using Exponential Analysis","authors":"Yuan Hou, A. Cuyt, Wen-shin Lee, Deepayan Bhowmik","doi":"10.1109/ICASSP39728.2021.9413909","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413909","url":null,"abstract":"Decomposition is integral to most image processing algorithms and often required in texture analysis. We present a new approach using a recent 2-dimensional exponential analysis technique. Exponential analysis offers the advantage of sparsity in the model and continuity in the parameters. This results in a much more compact representation of textures when compared to traditional Fourier or wavelet transform techniques. Our experiments include synthetic as well as real texture images from standard benchmark datasets. The results outperform FFT in representing texture patterns with significantly fewer terms while retaining RMSE values after reconstruction. The underlying periodic complex exponential model works best for texture patterns that are homogeneous. We demonstrate the usefulness of the method in two common vision processing application examples, namely texture classification and defect detection.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115856615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413689
Pranav Kulkarni, P. Vaidyanathan
Ramanujan filter banks (RFB) have in the past been used to identify periodicities in data. These are analysis filter banks with no synthesis counterpart for perfect reconstruction of the original signal, so they have not been useful for denoising periodic signals. This paper proposes to use a hybrid analysis-synthesis framework for denoising discrete-time periodic signals. The synthesis occurs via a pruned dictionary designed based on the output energies of the RFB analysis filters. A unique property of the framework is that the denoised output signal is guaranteed to be periodic unlike any of the other methods. For a large range of input noise levels, the proposed approach achieves a stable and high SNR gain outperforming many traditional denoising techniques.
{"title":"Periodic Signal Denoising: An Analysis-Synthesis Framework Based on Ramanujan Filter Banks and Dictionaries","authors":"Pranav Kulkarni, P. Vaidyanathan","doi":"10.1109/ICASSP39728.2021.9413689","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413689","url":null,"abstract":"Ramanujan filter banks (RFB) have in the past been used to identify periodicities in data. These are analysis filter banks with no synthesis counterpart for perfect reconstruction of the original signal, so they have not been useful for denoising periodic signals. This paper proposes to use a hybrid analysis-synthesis framework for denoising discrete-time periodic signals. The synthesis occurs via a pruned dictionary designed based on the output energies of the RFB analysis filters. A unique property of the framework is that the denoised output signal is guaranteed to be periodic unlike any of the other methods. For a large range of input noise levels, the proposed approach achieves a stable and high SNR gain outperforming many traditional denoising techniques.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131985440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414946
Kai Chen, Guanyu Fu, Qingcai Chen, Baotian Hu
Recently, large-scale datasets have vastly facilitated the development in nearly domains of Natural Language Processing. However, lacking large scale Chinese corpus is still a critical bottleneck for further research on deep text summarization methods. In this paper, we publish a large-scale Chinese Long-text Extractive Summarization corpus named CLES. The CLES contains about 104K pairs, which is originally collected from Sina Weibo1. To verify the quality of the corpus, we also manually tagged the relevance score of 5,000 pairs. Our benchmark models on the proposed corpus include conventional deep learning based extractive models and several pre-trained Bert-based algorithms. Their performances are reported and briefly analyzed to facilitate further research on the corpus. We will release this corpus for further research2.
{"title":"A Large-Scale Chinese Long-Text Extractive Summarization Corpus","authors":"Kai Chen, Guanyu Fu, Qingcai Chen, Baotian Hu","doi":"10.1109/ICASSP39728.2021.9414946","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414946","url":null,"abstract":"Recently, large-scale datasets have vastly facilitated the development in nearly domains of Natural Language Processing. However, lacking large scale Chinese corpus is still a critical bottleneck for further research on deep text summarization methods. In this paper, we publish a large-scale Chinese Long-text Extractive Summarization corpus named CLES. The CLES contains about 104K pairs, which is originally collected from Sina Weibo1. To verify the quality of the corpus, we also manually tagged the relevance score of 5,000 pairs. Our benchmark models on the proposed corpus include conventional deep learning based extractive models and several pre-trained Bert-based algorithms. Their performances are reported and briefly analyzed to facilitate further research on the corpus. We will release this corpus for further research2.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132211851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In spite of widely discussed, drawing order recovery (DOR) from static images is still a great challenge task. Based on the idea that drawing trajectories are able to be recovered by connecting their trajectory components in correct orders, this work proposes a novel DOR method from static images. The method contains two steps: firstly, we adopt a convolution neural network (CNN) to predict the next possible drawing components, which is able to covert the components in images to their reasonable sequences. We denote this architecture as Im2Seq-CNN; secondly, considering possible errors exist in the reasonable sequences generated by the first step, we construct a sequence to order structure (Seq2Order) to adjust the sequences to the correct orders. The main contributions include: (1) the Img2Seq-CNN step considers DOR from components instead of traditional pixels one by one along trajectories, which contributes to static images to component sequences; (2) the Seq2Order step adopts image position codes instead of traditional points’ coordinates in its encoder-decoder gated recurrent neural network (GRU-RNN). The proposed method is experienced on two well-known open handwriting databases, and yields robust and competitive results on handwriting DOR tasks compared to the state-of-arts.
{"title":"Drawing Order Recovery from Trajectory Components","authors":"Minghao Yang, Xukang Zhou, Yangchang Sun, Jinglong Chen, Baohua Qiang","doi":"10.1109/ICASSP39728.2021.9413542","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413542","url":null,"abstract":"In spite of widely discussed, drawing order recovery (DOR) from static images is still a great challenge task. Based on the idea that drawing trajectories are able to be recovered by connecting their trajectory components in correct orders, this work proposes a novel DOR method from static images. The method contains two steps: firstly, we adopt a convolution neural network (CNN) to predict the next possible drawing components, which is able to covert the components in images to their reasonable sequences. We denote this architecture as Im2Seq-CNN; secondly, considering possible errors exist in the reasonable sequences generated by the first step, we construct a sequence to order structure (Seq2Order) to adjust the sequences to the correct orders. The main contributions include: (1) the Img2Seq-CNN step considers DOR from components instead of traditional pixels one by one along trajectories, which contributes to static images to component sequences; (2) the Seq2Order step adopts image position codes instead of traditional points’ coordinates in its encoder-decoder gated recurrent neural network (GRU-RNN). The proposed method is experienced on two well-known open handwriting databases, and yields robust and competitive results on handwriting DOR tasks compared to the state-of-arts.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132349415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9415071
B. She, Yaojun Wang, Guang Hu
Existing seismic inversion methods are usually 1D, mainly focusing on improving the vertical resolution of inversion results. A few 2D or 3D inversion techniques are either too simple and lack the consideration of stratigraphic structures, or are too complicated which need to extract dip information and solve a complex constrained optimization problem. In this work, with the help of gradient structure tensor (GST) and dictionary learning and sparse representation (DLSR) technologies, we propose a 3D inversion approach (GST-DLSR) that considers both vertical and horizontal structural constraints. In the vertical direction, we investigate the vertical structural features of subsurface models from well-log data by DLSR. In the horizontal direction, we obtain the stratigraphic structural features from a 3D seismic image by GST. We then apply the acquired structural features to constraint the entire inversion procedure. The experiments show that GST-DLSR takes good advantages of both techniques, enabling to produce inversion results with high resolution, good lateral continuity, and enhanced structural features.
{"title":"A Structure-Guided and Sparse-Representation-Based 3d Seismic Inversion Method","authors":"B. She, Yaojun Wang, Guang Hu","doi":"10.1109/ICASSP39728.2021.9415071","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9415071","url":null,"abstract":"Existing seismic inversion methods are usually 1D, mainly focusing on improving the vertical resolution of inversion results. A few 2D or 3D inversion techniques are either too simple and lack the consideration of stratigraphic structures, or are too complicated which need to extract dip information and solve a complex constrained optimization problem. In this work, with the help of gradient structure tensor (GST) and dictionary learning and sparse representation (DLSR) technologies, we propose a 3D inversion approach (GST-DLSR) that considers both vertical and horizontal structural constraints. In the vertical direction, we investigate the vertical structural features of subsurface models from well-log data by DLSR. In the horizontal direction, we obtain the stratigraphic structural features from a 3D seismic image by GST. We then apply the acquired structural features to constraint the entire inversion procedure. The experiments show that GST-DLSR takes good advantages of both techniques, enabling to produce inversion results with high resolution, good lateral continuity, and enhanced structural features.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132366413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413519
Yong Wang, Xiaojing Wang, Xiaoyu He
Recently, many model quantization approaches have been investigated to reduce the model size and improve the inference speed of convolutional neural networks (CNNs). However, these approaches usually inevitably lead to a decrease in classification accuracy. To address this problem, this paper proposes a mixed precision quantization method combined with channel expansion of CNNs by using a multi-objective genetic algorithm, called MOGAQNN. In MOGAQNN, each individual in the population is used to encode a mixed precision quantization policy and a channel expansion policy. During the evolution process, the two polices are optimized simultaneously by the non-dominated sorting genetic algorithm II (NSGA-II). Finally, we choose the best individual in the last population and evaluate its performance on the test set as the final performance. The experimental results of five popular CNNs on two benchmark datasets demonstrate that MOGAQNN can greatly reduce the model size and improve the classification accuracy at the same time.
{"title":"Evolving Quantized Neural Networks for Image Classification Using A Multi-Objective Genetic Algorithm","authors":"Yong Wang, Xiaojing Wang, Xiaoyu He","doi":"10.1109/ICASSP39728.2021.9413519","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413519","url":null,"abstract":"Recently, many model quantization approaches have been investigated to reduce the model size and improve the inference speed of convolutional neural networks (CNNs). However, these approaches usually inevitably lead to a decrease in classification accuracy. To address this problem, this paper proposes a mixed precision quantization method combined with channel expansion of CNNs by using a multi-objective genetic algorithm, called MOGAQNN. In MOGAQNN, each individual in the population is used to encode a mixed precision quantization policy and a channel expansion policy. During the evolution process, the two polices are optimized simultaneously by the non-dominated sorting genetic algorithm II (NSGA-II). Finally, we choose the best individual in the last population and evaluate its performance on the test set as the final performance. The experimental results of five popular CNNs on two benchmark datasets demonstrate that MOGAQNN can greatly reduce the model size and improve the classification accuracy at the same time.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132416133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414598
Shunfei Chen, Xinhui Hu, Sheng Li, Xinkang Xu
The acoustic modeling unit is crucial for an end-to-end speech recognition system, especially for the Mandarin language. Until now, most of the studies on Mandarin speech recognition focused on individual units, and few of them paid attention to using a combination of these units. This paper uses a hybrid of the syllable, Chinese character, and subword as the modeling units for the end-to-end speech recognition system based on the CTC/attention multi-task learning. In this approach, the character-subword unit is assigned to train the transformer model in the main task learning stage. In contrast, the syllable unit is assigned to enhance the transformer’s shared encoder in the auxiliary task stage with the Connectionist Temporal Classification (CTC) loss function. The recognition experiments were conducted on AISHELL-1 and an open data set of 1200-hour Mandarin speech corpus collected from the OpenSLR, respectively. The experimental results demonstrated that using the syllable-char-subword hybrid modeling unit can achieve better performances than the conventional units of char-subword, and 6.6% relative CER reduction on our 1200-hour data. The substitution error also achieves a considerable reduction.
{"title":"An Investigation of Using Hybrid Modeling Units for Improving End-to-End Speech Recognition System","authors":"Shunfei Chen, Xinhui Hu, Sheng Li, Xinkang Xu","doi":"10.1109/ICASSP39728.2021.9414598","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414598","url":null,"abstract":"The acoustic modeling unit is crucial for an end-to-end speech recognition system, especially for the Mandarin language. Until now, most of the studies on Mandarin speech recognition focused on individual units, and few of them paid attention to using a combination of these units. This paper uses a hybrid of the syllable, Chinese character, and subword as the modeling units for the end-to-end speech recognition system based on the CTC/attention multi-task learning. In this approach, the character-subword unit is assigned to train the transformer model in the main task learning stage. In contrast, the syllable unit is assigned to enhance the transformer’s shared encoder in the auxiliary task stage with the Connectionist Temporal Classification (CTC) loss function. The recognition experiments were conducted on AISHELL-1 and an open data set of 1200-hour Mandarin speech corpus collected from the OpenSLR, respectively. The experimental results demonstrated that using the syllable-char-subword hybrid modeling unit can achieve better performances than the conventional units of char-subword, and 6.6% relative CER reduction on our 1200-hour data. The substitution error also achieves a considerable reduction.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132462606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}