Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980154
Hirayama Akari, Kimura Masaomi
Animals including humans, continuously acquire knowledge and skills throughout their lives. However, many machine learning models cannot learn new tasks without forgetting past knowledge. In neural networks, it is common to use one neural network for each training task, and successive training will reduce the accuracy of the previous task. This problem is called catastrophic forgetting, and research on continual learning is being conducted to solve it. In this paper, we proposed a method to reducing catastrophic forgetting, where new tasks are trained without retaining previously trained data. Our method assumes that tasks are classification. Our method adds random data to the training data in order to combine models trained on different tasks to avoid exceed generalization in the domain where train data do not exist combines models separately trained for each tasks. In the evaluation experiments, we confirmed that our method reduced forgetting for the original two-dimensional dataset and MNIST dataset.
{"title":"Catastrophic forgetting avoidance method for a Classification Model by Model Synthesis and Introduction of Background Data","authors":"Hirayama Akari, Kimura Masaomi","doi":"10.23919/APSIPAASC55919.2022.9980154","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980154","url":null,"abstract":"Animals including humans, continuously acquire knowledge and skills throughout their lives. However, many machine learning models cannot learn new tasks without forgetting past knowledge. In neural networks, it is common to use one neural network for each training task, and successive training will reduce the accuracy of the previous task. This problem is called catastrophic forgetting, and research on continual learning is being conducted to solve it. In this paper, we proposed a method to reducing catastrophic forgetting, where new tasks are trained without retaining previously trained data. Our method assumes that tasks are classification. Our method adds random data to the training data in order to combine models trained on different tasks to avoid exceed generalization in the domain where train data do not exist combines models separately trained for each tasks. In the evaluation experiments, we confirmed that our method reduced forgetting for the original two-dimensional dataset and MNIST dataset.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128757045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979951
Hsin-Wei Wang, Bi-Cheng Yan, Yi-Cheng Wang, Berlin Chen
Automatic speech recognition (ASR) has recently achieved remarkable success and reached human parity, thanks to the synergistic breakthroughs in neural model architectures and training algorithms. However, the performance of ASR in many real-world use cases is still far from perfect. There has been a surge of research interest in designing and developing feasible post-processing modules to improve recognition performance by refining ASR output sentences, which fall roughly into two categories. The first category of methods is ASR N-best hypothesis reranking. ASR N-best hypothesis reranking aims to find the oracle hypothesis with the lowest word error rate from a given N-best hypothesis list. The other category of methods take inspiration from, for example, Chinese spelling correction (CSC) or English spelling correction (ESC), seeking to detect and correct text-level errors of ASR output sentences. In this paper, we attempt to integrate the above two methods into the ASR error correction (AEC) module and explore the impact of different kinds of features on AEC. Empirical experiments on the widely-used AISHELL-l dataset show that our proposed method can significantly reduce the word error rate (WER) of the baseline ASR transcripts in relation to some top-of-line AEC methods, thereby demonstrating its effectiveness and practical feasibility.
由于神经模型架构和训练算法的协同突破,自动语音识别(ASR)最近取得了显著的成功,并达到了与人类相当的水平。然而,在许多实际用例中,ASR的性能仍然远非完美。设计和开发可行的后处理模块,通过改进ASR输出句子来提高识别性能,这方面的研究兴趣激增,大致分为两类。第一类方法是ASR n -最优假设重排序。ASR n -最佳假设重排序旨在从给定的n -最佳假设列表中找到单词错误率最低的oracle假设。另一类方法的灵感来自汉语拼写纠正(CSC)或英语拼写纠正(ESC),寻求检测和纠正ASR输出句子的文本级错误。在本文中,我们尝试将上述两种方法整合到ASR误差校正(AEC)模块中,并探讨不同类型的特征对AEC的影响。在广泛使用的ahell -l数据集上进行的实证实验表明,与一些顶级的AEC方法相比,我们提出的方法可以显著降低基线ASR转录本的单词错误率(WER),从而证明了该方法的有效性和实际可行性。
{"title":"Effective ASR Error Correction Leveraging Phonetic, Semantic Information and N-best hypotheses","authors":"Hsin-Wei Wang, Bi-Cheng Yan, Yi-Cheng Wang, Berlin Chen","doi":"10.23919/APSIPAASC55919.2022.9979951","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979951","url":null,"abstract":"Automatic speech recognition (ASR) has recently achieved remarkable success and reached human parity, thanks to the synergistic breakthroughs in neural model architectures and training algorithms. However, the performance of ASR in many real-world use cases is still far from perfect. There has been a surge of research interest in designing and developing feasible post-processing modules to improve recognition performance by refining ASR output sentences, which fall roughly into two categories. The first category of methods is ASR N-best hypothesis reranking. ASR N-best hypothesis reranking aims to find the oracle hypothesis with the lowest word error rate from a given N-best hypothesis list. The other category of methods take inspiration from, for example, Chinese spelling correction (CSC) or English spelling correction (ESC), seeking to detect and correct text-level errors of ASR output sentences. In this paper, we attempt to integrate the above two methods into the ASR error correction (AEC) module and explore the impact of different kinds of features on AEC. Empirical experiments on the widely-used AISHELL-l dataset show that our proposed method can significantly reduce the word error rate (WER) of the baseline ASR transcripts in relation to some top-of-line AEC methods, thereby demonstrating its effectiveness and practical feasibility.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128071793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979975
Weicheng Li, Tzer-jen Wei
In this paper, we present a voice conversion system that improves the quality of generated voice and its similarity to the target voice style significantly. Many VC systems use feature-disentangle-based learning techniques to separate speakers' voices from their linguistic content in order to translate a voice into another style. This is the approach we are taking. To prevent speaker-style information from obscuring the content embedding, some previous works quantize or reduce the dimension of the embedding. However, an imperfect disentanglement would damage the quality and similarity of the sound. In this paper, to further improve quality and similarity in voice conversion, we propose a novel style transfer method within an autoencoder-based VC system that involves generative adversarial training. The conversion process was objectively evaluated using the fair third-party speaker verification system, the results shows that ASGAN-VC outperforms VQVC + and AGAINVC in terms of speaker similarity. A subjectively observing that our proposal outperformed the VQVC + and AGAINVC in terms of naturalness and speaker similarity.
{"title":"ASGAN-VC: One-Shot Voice Conversion with Additional Style Embedding and Generative Adversarial Networks","authors":"Weicheng Li, Tzer-jen Wei","doi":"10.23919/APSIPAASC55919.2022.9979975","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979975","url":null,"abstract":"In this paper, we present a voice conversion system that improves the quality of generated voice and its similarity to the target voice style significantly. Many VC systems use feature-disentangle-based learning techniques to separate speakers' voices from their linguistic content in order to translate a voice into another style. This is the approach we are taking. To prevent speaker-style information from obscuring the content embedding, some previous works quantize or reduce the dimension of the embedding. However, an imperfect disentanglement would damage the quality and similarity of the sound. In this paper, to further improve quality and similarity in voice conversion, we propose a novel style transfer method within an autoencoder-based VC system that involves generative adversarial training. The conversion process was objectively evaluated using the fair third-party speaker verification system, the results shows that ASGAN-VC outperforms VQVC + and AGAINVC in terms of speaker similarity. A subjectively observing that our proposal outperformed the VQVC + and AGAINVC in terms of naturalness and speaker similarity.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"34 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113987836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980351
Sota Ichikawa, Takeshi Yamada, S. Makino
Recently, acoustic scene classification using an acoustic beamformer that is applied to a multichannel input signal has been proposed. Generally, prior information such as the direction of arrival of a target sound is necessary to generate a spatial filter for beamforming. However, it is not clear which sound is notable (i.e., useful for classification) in each individual sound scene and thus in which direction the target sound is located. It is therefore difficult to simply apply a beamformer for preprocessing. To solve this problem, we propose a method using a neural beamformer composed of the neural networks of a spatial filter generator and a classifier, which are optimized in an end-to-end manner. The aim of the proposed method is to automatically find a notable sound in each individual sound scene and generate a spatial filter to emphasize that notable sound, without requiring any prior information such as the direction of arrival and the reference signal of the target sound in both training and testing. The loss functions used in the proposed method are of four types: one is for classification and the remaining loss functions are for beamforming that help in obtaining a clear directivity pattern. To evaluate the performance of the proposed method, we conducted an experiment on classifying two scenes: one is a scene where a male is speaking under noise and another is a scene where a female is speaking under noise. The experimental results showed that the segmental SNR averaged over all the test data was improved by 10.7 dB. This indicates that the proposed method could successfully find speech as a notable sound in this classification task and generate the spatial filter to emphasize it.
{"title":"Neural Beamformer with Automatic Detection of Notable Sounds for Acoustic Scene Classification","authors":"Sota Ichikawa, Takeshi Yamada, S. Makino","doi":"10.23919/APSIPAASC55919.2022.9980351","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980351","url":null,"abstract":"Recently, acoustic scene classification using an acoustic beamformer that is applied to a multichannel input signal has been proposed. Generally, prior information such as the direction of arrival of a target sound is necessary to generate a spatial filter for beamforming. However, it is not clear which sound is notable (i.e., useful for classification) in each individual sound scene and thus in which direction the target sound is located. It is therefore difficult to simply apply a beamformer for preprocessing. To solve this problem, we propose a method using a neural beamformer composed of the neural networks of a spatial filter generator and a classifier, which are optimized in an end-to-end manner. The aim of the proposed method is to automatically find a notable sound in each individual sound scene and generate a spatial filter to emphasize that notable sound, without requiring any prior information such as the direction of arrival and the reference signal of the target sound in both training and testing. The loss functions used in the proposed method are of four types: one is for classification and the remaining loss functions are for beamforming that help in obtaining a clear directivity pattern. To evaluate the performance of the proposed method, we conducted an experiment on classifying two scenes: one is a scene where a male is speaking under noise and another is a scene where a female is speaking under noise. The experimental results showed that the segmental SNR averaged over all the test data was improved by 10.7 dB. This indicates that the proposed method could successfully find speech as a notable sound in this classification task and generate the spatial filter to emphasize it.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114038260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980344
Mingxin Zhang, T. Shinozaki
Reinforcement learning (RL) is a powerful strategy for making a flexible task-oriented dialog agent, but it is weak in learning speed. Deep Dyna-Q augments the agent's experience to improve the learning efficiency by internally simulating the user's behavior. It uses a deep neural network (DNN) based learnable user model to predict user action, reward, and dialog termination from the dialog state and the agent's action. However, it still needs many agent-user interactions to train the user model. We propose a DNN-Rule hybrid user model for Dyna-Q, where the DNN only simulates the user action. Instead, a rule-based function infers the reward and the dialog termination. We also investigate the training with rollout to further enhance the learning efficiency. Experiments on a movie-ticket booking task demonstrate that our approach significantly improves learning efficiency.
{"title":"DNN-Rule Hybrid Dyna-Q for Sample-Efficient Task-Oriented Dialog Policy Learning","authors":"Mingxin Zhang, T. Shinozaki","doi":"10.23919/APSIPAASC55919.2022.9980344","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980344","url":null,"abstract":"Reinforcement learning (RL) is a powerful strategy for making a flexible task-oriented dialog agent, but it is weak in learning speed. Deep Dyna-Q augments the agent's experience to improve the learning efficiency by internally simulating the user's behavior. It uses a deep neural network (DNN) based learnable user model to predict user action, reward, and dialog termination from the dialog state and the agent's action. However, it still needs many agent-user interactions to train the user model. We propose a DNN-Rule hybrid user model for Dyna-Q, where the DNN only simulates the user action. Instead, a rule-based function infers the reward and the dialog termination. We also investigate the training with rollout to further enhance the learning efficiency. Experiments on a movie-ticket booking task demonstrate that our approach significantly improves learning efficiency.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133112838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980121
Jin Wu, Yaqiong Gao, L. Yang, Zhengdong Su
Finite Impulse Response (FIR) digital filters are widely used in digital signal processing and other engineering because of their strict stability and linear phase. Aiming at the problems of low accuracy and weak optimization ability of traditional method to design digital filter, the newly proposed Grey Wolf Optimization (GWO) algorithm is used in this paper to design a linear-phase FIR filter to obtain the optimal transition-band sample value in the frequency sampling method to obtain the minimum stop-band attenuation, so as to improve the performance of the filter. And improved by embedding Lévy Flight (LF), which is the modified Lévy-embedded GWO (LGWO). Finally, the performance of traditional frequency sampling methods and optimization algorithms GWO and LGWO are compared. When the number of sampling points is 65 and 97, the stopband attenuation of LGWO is improved by 0.2029 dB and 0.2454 dB respectively compared with GWO algorithm. The better performance of LGWO is shown in the results.
{"title":"Design of Optimal FIR Digital Filter by Swarm Optimization Technique","authors":"Jin Wu, Yaqiong Gao, L. Yang, Zhengdong Su","doi":"10.23919/APSIPAASC55919.2022.9980121","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980121","url":null,"abstract":"Finite Impulse Response (FIR) digital filters are widely used in digital signal processing and other engineering because of their strict stability and linear phase. Aiming at the problems of low accuracy and weak optimization ability of traditional method to design digital filter, the newly proposed Grey Wolf Optimization (GWO) algorithm is used in this paper to design a linear-phase FIR filter to obtain the optimal transition-band sample value in the frequency sampling method to obtain the minimum stop-band attenuation, so as to improve the performance of the filter. And improved by embedding Lévy Flight (LF), which is the modified Lévy-embedded GWO (LGWO). Finally, the performance of traditional frequency sampling methods and optimization algorithms GWO and LGWO are compared. When the number of sampling points is 65 and 97, the stopband attenuation of LGWO is improved by 0.2029 dB and 0.2454 dB respectively compared with GWO algorithm. The better performance of LGWO is shown in the results.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133292263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979969
Ziyin Huang, Yue Cao, Sik-Ho Tsang, Yui-Lam Chan, K. Lam
In recent years, the video quality enhancement techniques have made a significant breakthrough, from the traditional methods, such as deblocking filter (DF) and sample additive offset (SAO), to deep learning-based approaches. While screen content coding (SCC) has become an important extension in High Efficiency Video Coding (HEVC), the existing approaches mainly focus on improving the quality of natural sequences in HEVC, not the screen content (SC) sequences in SCC. Therefore, we proposed a dual-input model for quality enhancement in SCC. One is the main branch with the image as input. Another one is the mask branch with side information extracted from the coded bitstream. Specifically, a mask branch is designed so that the coding unit (CU) information and the mode information are utilized as input, to assist the convolutional network at the main branch to further improve the video quality thereby the coding efficiency. Moreover, due to the limited number of SC videos, a new SCC dataset, namely PolyUSCC, is established. With our proposed dual-input technique, compared with the conventional SCC, BD-rates are further reduced 3.81% and 3.07%, by adding our mask branch onto two state-of-the-art models, DnCNN and DCAD, respectively.
近年来,视频质量增强技术取得了重大突破,从传统的去块滤波(DF)和样本加性偏移(SAO)等方法发展到基于深度学习的方法。虽然屏幕内容编码(SCC)已经成为高效视频编码(HEVC)的一个重要扩展,但现有的方法主要集中在提高HEVC中自然序列的质量,而不是提高SCC中的屏幕内容序列的质量。因此,我们提出了一种双输入模型来提高SCC的质量。一个是以图像作为输入的主分支。另一种是从编码的比特流中提取侧信息的掩码分支。具体来说,设计了一个掩码支路,利用编码单元(coding unit, CU)信息和模式信息作为输入,辅助主支路的卷积网络进一步提高视频质量从而提高编码效率。此外,由于SC视频数量有限,我们建立了一个新的SCC数据集PolyUSCC。采用我们提出的双输入技术,与传统的SCC相比,通过将我们的掩膜分支分别添加到DnCNN和DCAD两个最先进的模型中,bd率进一步降低了3.81%和3.07%。
{"title":"Quality Enhancement of Screen Content Video using Dual-input CNN","authors":"Ziyin Huang, Yue Cao, Sik-Ho Tsang, Yui-Lam Chan, K. Lam","doi":"10.23919/APSIPAASC55919.2022.9979969","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979969","url":null,"abstract":"In recent years, the video quality enhancement techniques have made a significant breakthrough, from the traditional methods, such as deblocking filter (DF) and sample additive offset (SAO), to deep learning-based approaches. While screen content coding (SCC) has become an important extension in High Efficiency Video Coding (HEVC), the existing approaches mainly focus on improving the quality of natural sequences in HEVC, not the screen content (SC) sequences in SCC. Therefore, we proposed a dual-input model for quality enhancement in SCC. One is the main branch with the image as input. Another one is the mask branch with side information extracted from the coded bitstream. Specifically, a mask branch is designed so that the coding unit (CU) information and the mode information are utilized as input, to assist the convolutional network at the main branch to further improve the video quality thereby the coding efficiency. Moreover, due to the limited number of SC videos, a new SCC dataset, namely PolyUSCC, is established. With our proposed dual-input technique, compared with the conventional SCC, BD-rates are further reduced 3.81% and 3.07%, by adding our mask branch onto two state-of-the-art models, DnCNN and DCAD, respectively.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123913961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980000
Benita Angela Titalim, Candy Olivia Mawalim, S. Okada, M. Unoki
Objective speech intelligibility (SI) metrics for hearing-impaired people play an important role in hearing aid development. The work on improving SI prediction also became the basis of the first Clarity Prediction Challenge (CPC1). This study investigates a physiological auditory model called EarModel and acoustic parameters for SI prediction. EarModel is utilized because it provides advantages in estimating human hearing, both normal and impaired. The hearing-impaired condition is simulated in EarModel based on audiograms; thus, the SI perceived by hearing-impaired people is more accurately predicted. Moreover, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and WavLM, as additional acoustic parameters for estimating the difficulty levels of given utterances, are included to achieve improved prediction accuracy. The proposed method is evaluated on the CPC1 database. The results show that the proposed method improves the SI prediction effects of the baseline and hearing aid speech prediction index (HASPI). Additionally, an ablation test shows that incorporating the eGeMAPS and WavLM can significantly contribute to the prediction model by increasing the Pearson correlation coefficient by more than 15% and decreasing the root-mean-square error (RMSE) by more than 10.00 in both closed-set and open-set tracks.
{"title":"Speech Intelligibility Prediction for Hearing Aids Using an Auditory Model and Acoustic Parameters","authors":"Benita Angela Titalim, Candy Olivia Mawalim, S. Okada, M. Unoki","doi":"10.23919/APSIPAASC55919.2022.9980000","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980000","url":null,"abstract":"Objective speech intelligibility (SI) metrics for hearing-impaired people play an important role in hearing aid development. The work on improving SI prediction also became the basis of the first Clarity Prediction Challenge (CPC1). This study investigates a physiological auditory model called EarModel and acoustic parameters for SI prediction. EarModel is utilized because it provides advantages in estimating human hearing, both normal and impaired. The hearing-impaired condition is simulated in EarModel based on audiograms; thus, the SI perceived by hearing-impaired people is more accurately predicted. Moreover, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and WavLM, as additional acoustic parameters for estimating the difficulty levels of given utterances, are included to achieve improved prediction accuracy. The proposed method is evaluated on the CPC1 database. The results show that the proposed method improves the SI prediction effects of the baseline and hearing aid speech prediction index (HASPI). Additionally, an ablation test shows that incorporating the eGeMAPS and WavLM can significantly contribute to the prediction model by increasing the Pearson correlation coefficient by more than 15% and decreasing the root-mean-square error (RMSE) by more than 10.00 in both closed-set and open-set tracks.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124028029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979938
Apichada Sillaparaya, A. Bhatranand, Chudanat Sudthongkong, K. Chamnongthai, Y. Jiraraksopakun
Early screening for the Obstructive Sleep Apnea (OSA), especially the first grade of Apnea-Hypopnea Index (AHI), can reduce risk and improve the effectiveness of timely treatment. The current gold standard technique for OSA diagnosis is Polysomnography (PSG), but the technique must be performed in a specialized laboratory with an expert and requires many sensors attached to a patient. Hence, it is costly and may not be convenient for a self-test by the patient. The characteristic of snore sounds has recently been used to screen the OSA and more likely to identify the abnormality of breathing conditions. Therefore, this study proposes a deep learning model to classify the OSA based on snore sounds. The snore sound data of 5 OSA patients were selected from the opened-source PSG- Audio data by the Sleep Study Unit of the Sismanoglio-Amalia Fleming General Hospital of Athens [1]. 2,439 snoring and breathing-related sound segments were extracted and divided into 3 groups of 1,020 normal snore sounds, 1,185 apnea or hypopnea snore sounds, and 234 non-snore sounds. All sound segments were separated into 60% training, 20% validation, and 20% test sets, respectively. The mean of Mel-Frequency Cepstral Coefficients (MFCC) of a sound segment were computed as the feature inputs of the deep learning model. Three fully connected layers were used in this deep learning model to classify into three groups as (1) normal snore sounds, (2) abnormal (apnea or hypopnea) snore sounds, and (3) non-snore sounds. The result showed that the model was able to correctly classify 85.2459%. Therefore, the model is promising to use snore sounds for screening OSA.
{"title":"Obstructive Sleep Apnea Classification Using Snore Sounds Based on Deep Learning","authors":"Apichada Sillaparaya, A. Bhatranand, Chudanat Sudthongkong, K. Chamnongthai, Y. Jiraraksopakun","doi":"10.23919/APSIPAASC55919.2022.9979938","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979938","url":null,"abstract":"Early screening for the Obstructive Sleep Apnea (OSA), especially the first grade of Apnea-Hypopnea Index (AHI), can reduce risk and improve the effectiveness of timely treatment. The current gold standard technique for OSA diagnosis is Polysomnography (PSG), but the technique must be performed in a specialized laboratory with an expert and requires many sensors attached to a patient. Hence, it is costly and may not be convenient for a self-test by the patient. The characteristic of snore sounds has recently been used to screen the OSA and more likely to identify the abnormality of breathing conditions. Therefore, this study proposes a deep learning model to classify the OSA based on snore sounds. The snore sound data of 5 OSA patients were selected from the opened-source PSG- Audio data by the Sleep Study Unit of the Sismanoglio-Amalia Fleming General Hospital of Athens [1]. 2,439 snoring and breathing-related sound segments were extracted and divided into 3 groups of 1,020 normal snore sounds, 1,185 apnea or hypopnea snore sounds, and 234 non-snore sounds. All sound segments were separated into 60% training, 20% validation, and 20% test sets, respectively. The mean of Mel-Frequency Cepstral Coefficients (MFCC) of a sound segment were computed as the feature inputs of the deep learning model. Three fully connected layers were used in this deep learning model to classify into three groups as (1) normal snore sounds, (2) abnormal (apnea or hypopnea) snore sounds, and (3) non-snore sounds. The result showed that the model was able to correctly classify 85.2459%. Therefore, the model is promising to use snore sounds for screening OSA.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121203955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980211
Terumi Umematsu, M. Tsujikawa, H. Sawada
In this paper, we propose a method of discriminating between concentration and non-concentration on the basis of facial videos, and we confirm the usefulness of excluding cognitive test results when a user has not been concentrating. In a preliminary experiment, we have confirmed that level of concentration has a strong impact on correct answer rates in memory tests. Our proposed concentration/non-concentration discrimination method uses 15 features extracted from facial videos, including blinking, gazing, and facial expressions (Action Units), and discriminates between concentration and non-concentration, which are reflected in terms of a binary correct answer label set based on subjectively rated concentration levels. In the preliminary experiment, memory test scores during non-concentration states were lower than those during concentration states by an average of 18%. This has usually been included as measurement error, and, by excluding scores during non-concentration states using the proposed method, measurement error was reduced to 4%. The proposed method is shown to be capable of obtaining test results that indicate true cognitive functions when people are concentrating, making possible a more accurate understanding of cognitive functions.
{"title":"Evaluation of Cognitive Test Results Using Concentration Estimation from Facial Videos","authors":"Terumi Umematsu, M. Tsujikawa, H. Sawada","doi":"10.23919/APSIPAASC55919.2022.9980211","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980211","url":null,"abstract":"In this paper, we propose a method of discriminating between concentration and non-concentration on the basis of facial videos, and we confirm the usefulness of excluding cognitive test results when a user has not been concentrating. In a preliminary experiment, we have confirmed that level of concentration has a strong impact on correct answer rates in memory tests. Our proposed concentration/non-concentration discrimination method uses 15 features extracted from facial videos, including blinking, gazing, and facial expressions (Action Units), and discriminates between concentration and non-concentration, which are reflected in terms of a binary correct answer label set based on subjectively rated concentration levels. In the preliminary experiment, memory test scores during non-concentration states were lower than those during concentration states by an average of 18%. This has usually been included as measurement error, and, by excluding scores during non-concentration states using the proposed method, measurement error was reduced to 4%. The proposed method is shown to be capable of obtaining test results that indicate true cognitive functions when people are concentrating, making possible a more accurate understanding of cognitive functions.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121336943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}