Convolutional Neural Networks have been widely studied for the super-resolution (SR) and other image restoration tasks. In this paper, we propose an additional error-compensational convolutional neural network (EC-CNN) that is trained based on the concept of iterative back projection (IBP). The residuals between interpolation images and ground truth images are used to train the network. This CNN model can compensate the residual projection in the IBP more accurately. This CNN- based IBP can be further combined with the super-resolution CNN(SRCNN). Experimental results show that our method can significantly enhance the quality of scale images as a post-processing method. The approach can averagely outperform SRCNN by 0.14 dB and SRCNN-EX by 0.08 dB in PSNR with scaling factor 3.
{"title":"Image super-resolution based on error compensation with convolutional neural network","authors":"Wei-Ting Lu, Chien-Wei Lin, Chih-Hung Kuo, Ying-Chan Tung","doi":"10.1109/APSIPA.2017.8282203","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282203","url":null,"abstract":"Convolutional Neural Networks have been widely studied for the super-resolution (SR) and other image restoration tasks. In this paper, we propose an additional error-compensational convolutional neural network (EC-CNN) that is trained based on the concept of iterative back projection (IBP). The residuals between interpolation images and ground truth images are used to train the network. This CNN model can compensate the residual projection in the IBP more accurately. This CNN- based IBP can be further combined with the super-resolution CNN(SRCNN). Experimental results show that our method can significantly enhance the quality of scale images as a post-processing method. The approach can averagely outperform SRCNN by 0.14 dB and SRCNN-EX by 0.08 dB in PSNR with scaling factor 3.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124972181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.1109/APSIPA.2017.8282109
Vishnu Vidyadhara Raju Vegesna, Hari Krishna Vydana, S. Gangashetty, A. Vuppala
A mismatch in training and operating environments causes a performance degradation in speech recognition systems (ASR). One major reason for this mismatch is due to the presence of expressive (emotive) speech in operational environments. Emotions in speech majorly inflict the changes in the prosody parameters of pitch, duration and energy. This work is aimed at improving the performance of speech recognition systems in the presence of emotive speech. This work focuses on improving the speech recognition performance without disturbing the existing ASR system. The prosody modification of pitch, duration and energy is achieved by tuning the modification factors values for the relative differences between the neutral and emotional data sets. The neutral version of emotive speech is generated using uniform and non-uniform prosody modification methods for speech recognition. During the study, IITKGP-SESC corpus is used for building the ASR system. The speech recognition system for the emotions (anger, happy and compassion) is evaluated. An improvement in the performance of ASR is observed when the prosody modified emotive utterance is used for speech recognition in place of original emotive utterance. An average improvement around 5% in accuracy is observed due to the use of non-uniform prosody modification methods.
{"title":"Importance of non-uniform prosody modification for speech recognition in emotion conditions","authors":"Vishnu Vidyadhara Raju Vegesna, Hari Krishna Vydana, S. Gangashetty, A. Vuppala","doi":"10.1109/APSIPA.2017.8282109","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282109","url":null,"abstract":"A mismatch in training and operating environments causes a performance degradation in speech recognition systems (ASR). One major reason for this mismatch is due to the presence of expressive (emotive) speech in operational environments. Emotions in speech majorly inflict the changes in the prosody parameters of pitch, duration and energy. This work is aimed at improving the performance of speech recognition systems in the presence of emotive speech. This work focuses on improving the speech recognition performance without disturbing the existing ASR system. The prosody modification of pitch, duration and energy is achieved by tuning the modification factors values for the relative differences between the neutral and emotional data sets. The neutral version of emotive speech is generated using uniform and non-uniform prosody modification methods for speech recognition. During the study, IITKGP-SESC corpus is used for building the ASR system. The speech recognition system for the emotions (anger, happy and compassion) is evaluated. An improvement in the performance of ASR is observed when the prosody modified emotive utterance is used for speech recognition in place of original emotive utterance. An average improvement around 5% in accuracy is observed due to the use of non-uniform prosody modification methods.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126837194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.1109/APSIPA.2017.8282299
S. Khan, S. Yong
Deep learning architectures particularly Convolutional Neural Network (CNN) have shown an intrinsic ability to automatically extract the high level representations from big data. CNN has produced impressive results in natural image classification, but there is a major hurdle to their deployment in medical domain because of the relatively lack of training data as compared to general imaging benchmarks such as ImageNet. In this paper we present a comparative evaluation of the three milestone architectures i.e. LeNet, AlexNet and GoogLeNet and propose our CNN architecture for classifying medical anatomy images. Based on the experiments, it is shown that the proposed Convolutional Neural Network architecture outperforms the three milestone architectures in classifying medical images of anatomy object.
{"title":"A deep learning architecture for classifying medical images of anatomy object","authors":"S. Khan, S. Yong","doi":"10.1109/APSIPA.2017.8282299","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282299","url":null,"abstract":"Deep learning architectures particularly Convolutional Neural Network (CNN) have shown an intrinsic ability to automatically extract the high level representations from big data. CNN has produced impressive results in natural image classification, but there is a major hurdle to their deployment in medical domain because of the relatively lack of training data as compared to general imaging benchmarks such as ImageNet. In this paper we present a comparative evaluation of the three milestone architectures i.e. LeNet, AlexNet and GoogLeNet and propose our CNN architecture for classifying medical anatomy images. Based on the experiments, it is shown that the proposed Convolutional Neural Network architecture outperforms the three milestone architectures in classifying medical images of anatomy object.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129072598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.1109/APSIPA.2017.8282146
Tzu-Chiao Lin, See-May Phoong
Carrier frequency offset (CFO) is an important issue in the study of orthogonal frequency division multiplexing (OFDM) systems. It is well known that CFO destroys the orthogonality of the subcarriers and it significantly degrades the bit error rate (BER) performance of OFDM systems. In this paper, an algorithm based on cyclic prefix (CP) is proposed for blind CFO estimation in OFDM transmission over multipath channels. The proposed method minimizes the theoretical mean square error (MSE). A closed form formula is derived. Simulation results show that the proposed method performs very well.
{"title":"MSE-optimized CP-based CFO estimation in OFDM systems over multipath channels","authors":"Tzu-Chiao Lin, See-May Phoong","doi":"10.1109/APSIPA.2017.8282146","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282146","url":null,"abstract":"Carrier frequency offset (CFO) is an important issue in the study of orthogonal frequency division multiplexing (OFDM) systems. It is well known that CFO destroys the orthogonality of the subcarriers and it significantly degrades the bit error rate (BER) performance of OFDM systems. In this paper, an algorithm based on cyclic prefix (CP) is proposed for blind CFO estimation in OFDM transmission over multipath channels. The proposed method minimizes the theoretical mean square error (MSE). A closed form formula is derived. Simulation results show that the proposed method performs very well.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121177553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.1109/APSIPA.2017.8282097
Kazuho Morikawa, T. Toda
Towards the development of a singing aid system for laryngectomees, we propose a method for converting electro-laryngeal (EL) speech produced by using an electrolarynx into more naturally sounding singing voices. Singing by using the electrolarynx is less flexible because the pitch of EL speech is determined by the source excitation signal mechanically produced by the electrolarynx, and therefore, it is necessary to embed melodies of songs to be sung in advance to the electrolarynx. In addition, sound quality of singing voices produced by the electrolarynx is severely degraded by an adverse effect of its mechanical excitation sounds emitted outside as noise. To address these problems, the proposed conversion method uses 1) pitch control by playing a musical instrument and 2) noise suppression. In the pitch control, pitch patterns of music sounds played simultaneously in singing with the electrolaryx are modified so that they have specific characteristics usually observed in singing voices, and then, the modified pitch patterns are used as the target pitch patterns in the conversion from EL speech into singing voices. In the noise suppression, spectral subtraction is used to suppress the leaked excitation sounds. The experimental results demonstrate that 1) naturalness of singing voices is significantly improved by the noise suppression and 2) the pitch pattern modification is not necessarily effective in the conversion from EL speech into singing voices.
{"title":"Electrolaryngeal speech modification towards singing aid system for laryngectomees","authors":"Kazuho Morikawa, T. Toda","doi":"10.1109/APSIPA.2017.8282097","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282097","url":null,"abstract":"Towards the development of a singing aid system for laryngectomees, we propose a method for converting electro-laryngeal (EL) speech produced by using an electrolarynx into more naturally sounding singing voices. Singing by using the electrolarynx is less flexible because the pitch of EL speech is determined by the source excitation signal mechanically produced by the electrolarynx, and therefore, it is necessary to embed melodies of songs to be sung in advance to the electrolarynx. In addition, sound quality of singing voices produced by the electrolarynx is severely degraded by an adverse effect of its mechanical excitation sounds emitted outside as noise. To address these problems, the proposed conversion method uses 1) pitch control by playing a musical instrument and 2) noise suppression. In the pitch control, pitch patterns of music sounds played simultaneously in singing with the electrolaryx are modified so that they have specific characteristics usually observed in singing voices, and then, the modified pitch patterns are used as the target pitch patterns in the conversion from EL speech into singing voices. In the noise suppression, spectral subtraction is used to suppress the leaked excitation sounds. The experimental results demonstrate that 1) naturalness of singing voices is significantly improved by the noise suppression and 2) the pitch pattern modification is not necessarily effective in the conversion from EL speech into singing voices.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114162684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.1109/APSIPA.2017.8282044
R. Miyagi, Masaki Aono
We propose a sliced voxel representation, which we call Sliced Square Voxels (SSV), based on LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network) for three-dimensional shape recognition. Given an arbitrary 3D model, we first convert it into binary voxel of size 32×32×32. Then, after a view position is fixed, we slice the binary voxel data vertically in the depth direction. To utilize the 2D projected shape information of the sliced voxels, CNN has been applied. The output of CNN is fed into LSTM, which is our main idea, where the spatial topology is supposed to be favored with LSTM. From our experiments, our proposed method turns out to be superior to the baseline method which we prepared using 3DCNN. We further compared with related previous methods, using large-scale 3D model dataset (ModelNet10 and ModelNet40), and our proposed methods outperformed them.
{"title":"Sliced voxel representations with LSTM and CNN for 3D shape recognition","authors":"R. Miyagi, Masaki Aono","doi":"10.1109/APSIPA.2017.8282044","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282044","url":null,"abstract":"We propose a sliced voxel representation, which we call Sliced Square Voxels (SSV), based on LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network) for three-dimensional shape recognition. Given an arbitrary 3D model, we first convert it into binary voxel of size 32×32×32. Then, after a view position is fixed, we slice the binary voxel data vertically in the depth direction. To utilize the 2D projected shape information of the sliced voxels, CNN has been applied. The output of CNN is fed into LSTM, which is our main idea, where the spatial topology is supposed to be favored with LSTM. From our experiments, our proposed method turns out to be superior to the baseline method which we prepared using 3DCNN. We further compared with related previous methods, using large-scale 3D model dataset (ModelNet10 and ModelNet40), and our proposed methods outperformed them.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124067827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.1109/APSIPA.2017.8282315
Nattapong Kurpukdee, Tomoki Koriyama, Takao Kobayashi, S. Kasuriya, C. Wutiwiwatchai, P. Lamsrichan
In this paper, we propose a speech emotion recognition technique using convolutional long short-term memory (LSTM) recurrent neural network (ConvLSTM-RNN) as a phoneme-based feature extractor from raw input speech signal. In the proposed technique, ConvLSTM-RNN outputs phoneme- based emotion probabilities to every frame of an input utterance. Then these probabilities are converted into statistical features of the input utterance and used for the input features of support vector machines (SVMs) or linear discriminant analysis (LDA) system to classify the utterance-level emotions. To assess the effectiveness of the proposed technique, we conducted experiments in the classification of four emotions (anger, happiness, sadness, and neutral) on IEMOCAP database. The result showed that the proposed technique with either of SVM or LDA classifier outperforms the conventional ConvLSTM-based one.
{"title":"Speech emotion recognition using convolutional long short-term memory neural network and support vector machines","authors":"Nattapong Kurpukdee, Tomoki Koriyama, Takao Kobayashi, S. Kasuriya, C. Wutiwiwatchai, P. Lamsrichan","doi":"10.1109/APSIPA.2017.8282315","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282315","url":null,"abstract":"In this paper, we propose a speech emotion recognition technique using convolutional long short-term memory (LSTM) recurrent neural network (ConvLSTM-RNN) as a phoneme-based feature extractor from raw input speech signal. In the proposed technique, ConvLSTM-RNN outputs phoneme- based emotion probabilities to every frame of an input utterance. Then these probabilities are converted into statistical features of the input utterance and used for the input features of support vector machines (SVMs) or linear discriminant analysis (LDA) system to classify the utterance-level emotions. To assess the effectiveness of the proposed technique, we conducted experiments in the classification of four emotions (anger, happiness, sadness, and neutral) on IEMOCAP database. The result showed that the proposed technique with either of SVM or LDA classifier outperforms the conventional ConvLSTM-based one.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128006101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.1109/APSIPA.2017.8282008
Y. Bao, Yan-Na Zhang, Yu-E. Song, Bingzhao Li, P. Dang
With the rapid development of the offset linear canonical transform (OLCT) in the fields of optics and signal processing, it is necessary to consider the nonuniform sampling associated with the OLCT. Nowadays, the analysis and applications of the nonuniform sampling for deterministic signals in the OLCT domain have been well published and studied. However, none of the results about the reconstruction of nonuniform sampling for random signals in the OLCT domain have been proposed until now. In this paper, the nonuniform sampling and reconstruction of random signals in the OLCT domain are investigated. Firstly, a brief introduction to the fundamental knowledge of the OLCT and some special nonuniform sampling models are given. Then, the reconstruction theorems for random signals from nonuniform samples in the OLCT domain have been derived for different nonuniform sampling models. Finally, the simulation results are given to verify the accuracy of theoretical results.
{"title":"Nonuniform sampling theorems for random signals in the offset linear canonical transform domain","authors":"Y. Bao, Yan-Na Zhang, Yu-E. Song, Bingzhao Li, P. Dang","doi":"10.1109/APSIPA.2017.8282008","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282008","url":null,"abstract":"With the rapid development of the offset linear canonical transform (OLCT) in the fields of optics and signal processing, it is necessary to consider the nonuniform sampling associated with the OLCT. Nowadays, the analysis and applications of the nonuniform sampling for deterministic signals in the OLCT domain have been well published and studied. However, none of the results about the reconstruction of nonuniform sampling for random signals in the OLCT domain have been proposed until now. In this paper, the nonuniform sampling and reconstruction of random signals in the OLCT domain are investigated. Firstly, a brief introduction to the fundamental knowledge of the OLCT and some special nonuniform sampling models are given. Then, the reconstruction theorems for random signals from nonuniform samples in the OLCT domain have been derived for different nonuniform sampling models. Finally, the simulation results are given to verify the accuracy of theoretical results.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128080719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.1109/APSIPA.2017.8282168
Yujia Lu, K. Hayashi
In the adaptive group testing, the pool (a set of items to be tested) used in the next test is determined based on past test results, and its performance heavily depends on the control method of the pool. This paper proposes a new pool control method for Boolean compressed sensing based adaptive group testing. The proposed method firstly selects a pool size of the next test by minimizing the expectation of the approximated required number of tests after the next test based on the estimated number of remaining positive items. Then, when the selected pool size is one, an item having the highest probability of being positive will be selected as a pool, otherwise a pool with the selected size will be constructed by randomly selecting items. In addition, a new cardinality estimation method of positive items, that can be implemented in parallel with the proposed pool control method, is also proposed. Computer simulation results reveal that the adaptive group testing with the proposed method has better performance than that with the conventional methods for both with and without the information of cardinality of positive items.
{"title":"A new pool control method for Boolean compressed sensing based adaptive group testing","authors":"Yujia Lu, K. Hayashi","doi":"10.1109/APSIPA.2017.8282168","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282168","url":null,"abstract":"In the adaptive group testing, the pool (a set of items to be tested) used in the next test is determined based on past test results, and its performance heavily depends on the control method of the pool. This paper proposes a new pool control method for Boolean compressed sensing based adaptive group testing. The proposed method firstly selects a pool size of the next test by minimizing the expectation of the approximated required number of tests after the next test based on the estimated number of remaining positive items. Then, when the selected pool size is one, an item having the highest probability of being positive will be selected as a pool, otherwise a pool with the selected size will be constructed by randomly selecting items. In addition, a new cardinality estimation method of positive items, that can be implemented in parallel with the proposed pool control method, is also proposed. Computer simulation results reveal that the adaptive group testing with the proposed method has better performance than that with the conventional methods for both with and without the information of cardinality of positive items.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125476602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.1109/APSIPA.2017.8282279
Zhiping Zeng, Haihua Xu, Tze Yuang Chong, Chng Eng Siong, Haizhou Li
Code-switching language modeling is challenging due to statistics of each individual language, as well as statistics of cross-lingual language are insufficient. To compensate for the issue of statistical insufficiency, in this paper we propose a word-class n-gram language modeling approach of which only infrequent words are clustered while most frequent words are treated as singleton classes themselves. We first demonstrate the effectiveness of the proposed method on our English-Mandarin code-switching SEAME data in terms of perplexity. Compared with the conventional word n-gram language models, as well as the word-class n-gram language models of which entire vocabulary words are clustered, the proposed word-class n- gram language modeling approach can yield lower perplexity on our SEAME dev data sets. Additionally, we observed further perplexity reduction by interpolating the word n-gram language models with the proposed word-class n-gram language models. We also attempted to build word-class n-gram language models using third-party text data with our proposed method, and similar perplexity performance improvement was obtained on our SEAME dev data sets when they are interpolated with the word n-gram language models. Finally, to examine the contribution of the proposed language modeling approach to code-switching speech recognition, we conducted lattice based n-best rescoring.
{"title":"Improving N-gram language modeling for code-switching speech recognition","authors":"Zhiping Zeng, Haihua Xu, Tze Yuang Chong, Chng Eng Siong, Haizhou Li","doi":"10.1109/APSIPA.2017.8282279","DOIUrl":"https://doi.org/10.1109/APSIPA.2017.8282279","url":null,"abstract":"Code-switching language modeling is challenging due to statistics of each individual language, as well as statistics of cross-lingual language are insufficient. To compensate for the issue of statistical insufficiency, in this paper we propose a word-class n-gram language modeling approach of which only infrequent words are clustered while most frequent words are treated as singleton classes themselves. We first demonstrate the effectiveness of the proposed method on our English-Mandarin code-switching SEAME data in terms of perplexity. Compared with the conventional word n-gram language models, as well as the word-class n-gram language models of which entire vocabulary words are clustered, the proposed word-class n- gram language modeling approach can yield lower perplexity on our SEAME dev data sets. Additionally, we observed further perplexity reduction by interpolating the word n-gram language models with the proposed word-class n-gram language models. We also attempted to build word-class n-gram language models using third-party text data with our proposed method, and similar perplexity performance improvement was obtained on our SEAME dev data sets when they are interpolated with the word n-gram language models. Finally, to examine the contribution of the proposed language modeling approach to code-switching speech recognition, we conducted lattice based n-best rescoring.","PeriodicalId":142091,"journal":{"name":"2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132007961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}