2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)最新文献_第8页

Catastrophic forgetting avoidance method for a Classification Model by Model Synthesis and Introduction of Background Data 基于模型综合和背景数据引入的分类模型避免灾难性遗忘方法

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980154

Hirayama Akari, Kimura Masaomi

Animals including humans, continuously acquire knowledge and skills throughout their lives. However, many machine learning models cannot learn new tasks without forgetting past knowledge. In neural networks, it is common to use one neural network for each training task, and successive training will reduce the accuracy of the previous task. This problem is called catastrophic forgetting, and research on continual learning is being conducted to solve it. In this paper, we proposed a method to reducing catastrophic forgetting, where new tasks are trained without retaining previously trained data. Our method assumes that tasks are classification. Our method adds random data to the training data in order to combine models trained on different tasks to avoid exceed generalization in the domain where train data do not exist combines models separately trained for each tasks. In the evaluation experiments, we confirmed that our method reduced forgetting for the original two-dimensional dataset and MNIST dataset.

包括人类在内的动物，终其一生都在不断地获取知识和技能。然而，许多机器学习模型无法在不忘记过去知识的情况下学习新任务。在神经网络中，每个训练任务通常使用一个神经网络，连续训练会降低前一个任务的准确性。这个问题被称为灾难性遗忘，人们正在进行关于持续学习的研究来解决这个问题。在本文中，我们提出了一种减少灾难性遗忘的方法，即在不保留以前训练过的数据的情况下训练新任务。我们的方法假设任务是分类。我们的方法是在训练数据中加入随机数据，将不同任务训练的模型组合在一起，避免在没有训练数据的领域出现超泛化现象。在评估实验中，我们证实了我们的方法减少了原始二维数据集和MNIST数据集的遗忘。

引用次数: 0

Effective ASR Error Correction Leveraging Phonetic, Semantic Information and N-best hypotheses 利用语音、语义信息和n -最优假设的有效ASR纠错

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979951

Hsin-Wei Wang, Bi-Cheng Yan, Yi-Cheng Wang, Berlin Chen

Automatic speech recognition (ASR) has recently achieved remarkable success and reached human parity, thanks to the synergistic breakthroughs in neural model architectures and training algorithms. However, the performance of ASR in many real-world use cases is still far from perfect. There has been a surge of research interest in designing and developing feasible post-processing modules to improve recognition performance by refining ASR output sentences, which fall roughly into two categories. The first category of methods is ASR N-best hypothesis reranking. ASR N-best hypothesis reranking aims to find the oracle hypothesis with the lowest word error rate from a given N-best hypothesis list. The other category of methods take inspiration from, for example, Chinese spelling correction (CSC) or English spelling correction (ESC), seeking to detect and correct text-level errors of ASR output sentences. In this paper, we attempt to integrate the above two methods into the ASR error correction (AEC) module and explore the impact of different kinds of features on AEC. Empirical experiments on the widely-used AISHELL-l dataset show that our proposed method can significantly reduce the word error rate (WER) of the baseline ASR transcripts in relation to some top-of-line AEC methods, thereby demonstrating its effectiveness and practical feasibility.

由于神经模型架构和训练算法的协同突破，自动语音识别(ASR)最近取得了显著的成功，并达到了与人类相当的水平。然而，在许多实际用例中，ASR的性能仍然远非完美。设计和开发可行的后处理模块，通过改进ASR输出句子来提高识别性能，这方面的研究兴趣激增，大致分为两类。第一类方法是ASR n -最优假设重排序。ASR n -最佳假设重排序旨在从给定的n -最佳假设列表中找到单词错误率最低的oracle假设。另一类方法的灵感来自汉语拼写纠正(CSC)或英语拼写纠正(ESC)，寻求检测和纠正ASR输出句子的文本级错误。在本文中，我们尝试将上述两种方法整合到ASR误差校正(AEC)模块中，并探讨不同类型的特征对AEC的影响。在广泛使用的ahell -l数据集上进行的实证实验表明，与一些顶级的AEC方法相比，我们提出的方法可以显著降低基线ASR转录本的单词错误率(WER)，从而证明了该方法的有效性和实际可行性。

{"title":"Effective ASR Error Correction Leveraging Phonetic, Semantic Information and N-best hypotheses","authors":"Hsin-Wei Wang, Bi-Cheng Yan, Yi-Cheng Wang, Berlin Chen","doi":"10.23919/APSIPAASC55919.2022.9979951","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979951","url":null,"abstract":"Automatic speech recognition (ASR) has recently achieved remarkable success and reached human parity, thanks to the synergistic breakthroughs in neural model architectures and training algorithms. However, the performance of ASR in many real-world use cases is still far from perfect. There has been a surge of research interest in designing and developing feasible post-processing modules to improve recognition performance by refining ASR output sentences, which fall roughly into two categories. The first category of methods is ASR N-best hypothesis reranking. ASR N-best hypothesis reranking aims to find the oracle hypothesis with the lowest word error rate from a given N-best hypothesis list. The other category of methods take inspiration from, for example, Chinese spelling correction (CSC) or English spelling correction (ESC), seeking to detect and correct text-level errors of ASR output sentences. In this paper, we attempt to integrate the above two methods into the ASR error correction (AEC) module and explore the impact of different kinds of features on AEC. Empirical experiments on the widely-used AISHELL-l dataset show that our proposed method can significantly reduce the word error rate (WER) of the baseline ASR transcripts in relation to some top-of-line AEC methods, thereby demonstrating its effectiveness and practical feasibility.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128071793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ASGAN-VC: One-Shot Voice Conversion with Additional Style Embedding and Generative Adversarial Networks ASGAN-VC:带有附加风格嵌入和生成对抗网络的一次性语音转换

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979975

Weicheng Li, Tzer-jen Wei

In this paper, we present a voice conversion system that improves the quality of generated voice and its similarity to the target voice style significantly. Many VC systems use feature-disentangle-based learning techniques to separate speakers' voices from their linguistic content in order to translate a voice into another style. This is the approach we are taking. To prevent speaker-style information from obscuring the content embedding, some previous works quantize or reduce the dimension of the embedding. However, an imperfect disentanglement would damage the quality and similarity of the sound. In this paper, to further improve quality and similarity in voice conversion, we propose a novel style transfer method within an autoencoder-based VC system that involves generative adversarial training. The conversion process was objectively evaluated using the fair third-party speaker verification system, the results shows that ASGAN-VC outperforms VQVC + and AGAINVC in terms of speaker similarity. A subjectively observing that our proposal outperformed the VQVC + and AGAINVC in terms of naturalness and speaker similarity.

在本文中，我们提出了一种语音转换系统，该系统显著提高了生成语音的质量和与目标语音风格的相似度。许多VC系统使用基于特征解缠的学习技术将说话者的声音与其语言内容分开，以便将声音翻译成另一种风格。这就是我们正在采取的方法。为了防止说话人风格的信息掩盖内容嵌入，一些先前的工作量化或降低嵌入的维数。然而，不完美的分离会破坏声音的质量和相似性。在本文中，为了进一步提高语音转换的质量和相似性，我们提出了一种在基于自编码器的VC系统中涉及生成对抗训练的新颖风格转移方法。使用公平的第三方说话人验证系统对转换过程进行客观评价，结果表明ASGAN-VC在说话人相似度方面优于VQVC +和AGAINVC。主观观察我们的提议在自然度和说话人相似度方面优于VQVC +和AGAINVC。

{"title":"ASGAN-VC: One-Shot Voice Conversion with Additional Style Embedding and Generative Adversarial Networks","authors":"Weicheng Li, Tzer-jen Wei","doi":"10.23919/APSIPAASC55919.2022.9979975","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979975","url":null,"abstract":"In this paper, we present a voice conversion system that improves the quality of generated voice and its similarity to the target voice style significantly. Many VC systems use feature-disentangle-based learning techniques to separate speakers' voices from their linguistic content in order to translate a voice into another style. This is the approach we are taking. To prevent speaker-style information from obscuring the content embedding, some previous works quantize or reduce the dimension of the embedding. However, an imperfect disentanglement would damage the quality and similarity of the sound. In this paper, to further improve quality and similarity in voice conversion, we propose a novel style transfer method within an autoencoder-based VC system that involves generative adversarial training. The conversion process was objectively evaluated using the fair third-party speaker verification system, the results shows that ASGAN-VC outperforms VQVC + and AGAINVC in terms of speaker similarity. A subjectively observing that our proposal outperformed the VQVC + and AGAINVC in terms of naturalness and speaker similarity.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"34 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113987836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Neural Beamformer with Automatic Detection of Notable Sounds for Acoustic Scene Classification 自动检测显著声音的神经波束形成器用于声学场景分类

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980351

Sota Ichikawa, Takeshi Yamada, S. Makino

Recently, acoustic scene classification using an acoustic beamformer that is applied to a multichannel input signal has been proposed. Generally, prior information such as the direction of arrival of a target sound is necessary to generate a spatial filter for beamforming. However, it is not clear which sound is notable (i.e., useful for classification) in each individual sound scene and thus in which direction the target sound is located. It is therefore difficult to simply apply a beamformer for preprocessing. To solve this problem, we propose a method using a neural beamformer composed of the neural networks of a spatial filter generator and a classifier, which are optimized in an end-to-end manner. The aim of the proposed method is to automatically find a notable sound in each individual sound scene and generate a spatial filter to emphasize that notable sound, without requiring any prior information such as the direction of arrival and the reference signal of the target sound in both training and testing. The loss functions used in the proposed method are of four types: one is for classification and the remaining loss functions are for beamforming that help in obtaining a clear directivity pattern. To evaluate the performance of the proposed method, we conducted an experiment on classifying two scenes: one is a scene where a male is speaking under noise and another is a scene where a female is speaking under noise. The experimental results showed that the segmental SNR averaged over all the test data was improved by 10.7 dB. This indicates that the proposed method could successfully find speech as a notable sound in this classification task and generate the spatial filter to emphasize it.

最近，提出了一种应用于多通道输入信号的声波束形成器的声场景分类方法。通常，诸如目标声音到达方向之类的先验信息对于产生用于波束形成的空间滤波器是必要的。然而，在每个单独的声音场景中，我们并不清楚哪个声音是值得注意的(即，对分类有用)，因此目标声音位于哪个方向。因此，很难简单地应用波束形成器进行预处理。为了解决这一问题，我们提出了一种使用由空间滤波器生成器和分类器组成的神经网络的神经波束形成器的方法，并以端到端方式进行优化。该方法的目的是在每个单独的声音场景中自动找到一个值得注意的声音，并生成一个空间滤波器来强调该值得注意的声音，在训练和测试中不需要任何先验信息，如目标声音的到达方向和参考信号。该方法中使用的损失函数有四种类型:一种用于分类，其余的损失函数用于波束形成，有助于获得清晰的指向性方向图。为了评估该方法的性能，我们对两个场景进行了分类实验:一个是男性在噪音下说话的场景，另一个是女性在噪音下说话的场景。实验结果表明，所有测试数据的平均信噪比提高了10.7 dB。这表明该方法能够成功地在分类任务中找到值得注意的语音，并生成空间滤波器对其进行强调。

{"title":"Neural Beamformer with Automatic Detection of Notable Sounds for Acoustic Scene Classification","authors":"Sota Ichikawa, Takeshi Yamada, S. Makino","doi":"10.23919/APSIPAASC55919.2022.9980351","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980351","url":null,"abstract":"Recently, acoustic scene classification using an acoustic beamformer that is applied to a multichannel input signal has been proposed. Generally, prior information such as the direction of arrival of a target sound is necessary to generate a spatial filter for beamforming. However, it is not clear which sound is notable (i.e., useful for classification) in each individual sound scene and thus in which direction the target sound is located. It is therefore difficult to simply apply a beamformer for preprocessing. To solve this problem, we propose a method using a neural beamformer composed of the neural networks of a spatial filter generator and a classifier, which are optimized in an end-to-end manner. The aim of the proposed method is to automatically find a notable sound in each individual sound scene and generate a spatial filter to emphasize that notable sound, without requiring any prior information such as the direction of arrival and the reference signal of the target sound in both training and testing. The loss functions used in the proposed method are of four types: one is for classification and the remaining loss functions are for beamforming that help in obtaining a clear directivity pattern. To evaluate the performance of the proposed method, we conducted an experiment on classifying two scenes: one is a scene where a male is speaking under noise and another is a scene where a female is speaking under noise. The experimental results showed that the segmental SNR averaged over all the test data was improved by 10.7 dB. This indicates that the proposed method could successfully find speech as a notable sound in this classification task and generate the spatial filter to emphasize it.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114038260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DNN-Rule Hybrid Dyna-Q for Sample-Efficient Task-Oriented Dialog Policy Learning 面向样本高效任务的对话策略学习的dnn -规则混合Dyna-Q

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980344

Mingxin Zhang, T. Shinozaki

Reinforcement learning (RL) is a powerful strategy for making a flexible task-oriented dialog agent, but it is weak in learning speed. Deep Dyna-Q augments the agent's experience to improve the learning efficiency by internally simulating the user's behavior. It uses a deep neural network (DNN) based learnable user model to predict user action, reward, and dialog termination from the dialog state and the agent's action. However, it still needs many agent-user interactions to train the user model. We propose a DNN-Rule hybrid user model for Dyna-Q, where the DNN only simulates the user action. Instead, a rule-based function infers the reward and the dialog termination. We also investigate the training with rollout to further enhance the learning efficiency. Experiments on a movie-ticket booking task demonstrate that our approach significantly improves learning efficiency.

强化学习(RL)是制造灵活的面向任务的对话代理的有力策略，但它在学习速度上较弱。Deep dynamic - q通过内部模拟用户的行为来增强智能体的经验，从而提高学习效率。它使用基于深度神经网络(DNN)的可学习用户模型，从对话状态和智能体的动作来预测用户的动作、奖励和对话终止。然而，它仍然需要许多代理-用户交互来训练用户模型。我们提出了Dyna-Q的DNN- rule混合用户模型，其中DNN只模拟用户的动作。相反，基于规则的函数推断奖励和对话终止。我们还研究了带rollout的培训，以进一步提高学习效率。在一个电影票预订任务上的实验表明，我们的方法显著提高了学习效率。

引用次数: 0

Design of Optimal FIR Digital Filter by Swarm Optimization Technique 用群优化技术设计最优FIR数字滤波器

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980121

Jin Wu, Yaqiong Gao, L. Yang, Zhengdong Su

Finite Impulse Response (FIR) digital filters are widely used in digital signal processing and other engineering because of their strict stability and linear phase. Aiming at the problems of low accuracy and weak optimization ability of traditional method to design digital filter, the newly proposed Grey Wolf Optimization (GWO) algorithm is used in this paper to design a linear-phase FIR filter to obtain the optimal transition-band sample value in the frequency sampling method to obtain the minimum stop-band attenuation, so as to improve the performance of the filter. And improved by embedding Lévy Flight (LF), which is the modified Lévy-embedded GWO (LGWO). Finally, the performance of traditional frequency sampling methods and optimization algorithms GWO and LGWO are compared. When the number of sampling points is 65 and 97, the stopband attenuation of LGWO is improved by 0.2029 dB and 0.2454 dB respectively compared with GWO algorithm. The better performance of LGWO is shown in the results.

有限脉冲响应(FIR)数字滤波器由于具有严格的稳定性和线性相位，在数字信号处理和其他工程中得到了广泛的应用。针对传统数字滤波器设计方法精度低、优化能力弱的问题，本文采用新提出的灰狼优化(GWO)算法设计线性相位FIR滤波器，在频率采样方法中获得最优过渡带采样值，以获得最小阻带衰减，从而提高滤波器的性能。并通过嵌入lcv (LF)进行改进，即改进后的lcv -embedded GWO (LGWO)。最后，比较了传统的频率采样方法和优化算法GWO和LGWO的性能。当采样点个数为65和97时，LGWO算法的阻带衰减比GWO算法分别提高了0.2029 dB和0.2454 dB。结果表明，LGWO具有较好的性能。

引用次数: 0

Quality Enhancement of Screen Content Video using Dual-input CNN 使用双输入CNN增强屏幕内容视频的质量

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979969

Ziyin Huang, Yue Cao, Sik-Ho Tsang, Yui-Lam Chan, K. Lam

In recent years, the video quality enhancement techniques have made a significant breakthrough, from the traditional methods, such as deblocking filter (DF) and sample additive offset (SAO), to deep learning-based approaches. While screen content coding (SCC) has become an important extension in High Efficiency Video Coding (HEVC), the existing approaches mainly focus on improving the quality of natural sequences in HEVC, not the screen content (SC) sequences in SCC. Therefore, we proposed a dual-input model for quality enhancement in SCC. One is the main branch with the image as input. Another one is the mask branch with side information extracted from the coded bitstream. Specifically, a mask branch is designed so that the coding unit (CU) information and the mode information are utilized as input, to assist the convolutional network at the main branch to further improve the video quality thereby the coding efficiency. Moreover, due to the limited number of SC videos, a new SCC dataset, namely PolyUSCC, is established. With our proposed dual-input technique, compared with the conventional SCC, BD-rates are further reduced 3.81% and 3.07%, by adding our mask branch onto two state-of-the-art models, DnCNN and DCAD, respectively.

近年来，视频质量增强技术取得了重大突破，从传统的去块滤波(DF)和样本加性偏移(SAO)等方法发展到基于深度学习的方法。虽然屏幕内容编码(SCC)已经成为高效视频编码(HEVC)的一个重要扩展，但现有的方法主要集中在提高HEVC中自然序列的质量，而不是提高SCC中的屏幕内容序列的质量。因此，我们提出了一种双输入模型来提高SCC的质量。一个是以图像作为输入的主分支。另一种是从编码的比特流中提取侧信息的掩码分支。具体来说，设计了一个掩码支路，利用编码单元(coding unit, CU)信息和模式信息作为输入，辅助主支路的卷积网络进一步提高视频质量从而提高编码效率。此外，由于SC视频数量有限，我们建立了一个新的SCC数据集PolyUSCC。采用我们提出的双输入技术，与传统的SCC相比，通过将我们的掩膜分支分别添加到DnCNN和DCAD两个最先进的模型中，bd率进一步降低了3.81%和3.07%。

{"title":"Quality Enhancement of Screen Content Video using Dual-input CNN","authors":"Ziyin Huang, Yue Cao, Sik-Ho Tsang, Yui-Lam Chan, K. Lam","doi":"10.23919/APSIPAASC55919.2022.9979969","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979969","url":null,"abstract":"In recent years, the video quality enhancement techniques have made a significant breakthrough, from the traditional methods, such as deblocking filter (DF) and sample additive offset (SAO), to deep learning-based approaches. While screen content coding (SCC) has become an important extension in High Efficiency Video Coding (HEVC), the existing approaches mainly focus on improving the quality of natural sequences in HEVC, not the screen content (SC) sequences in SCC. Therefore, we proposed a dual-input model for quality enhancement in SCC. One is the main branch with the image as input. Another one is the mask branch with side information extracted from the coded bitstream. Specifically, a mask branch is designed so that the coding unit (CU) information and the mode information are utilized as input, to assist the convolutional network at the main branch to further improve the video quality thereby the coding efficiency. Moreover, due to the limited number of SC videos, a new SCC dataset, namely PolyUSCC, is established. With our proposed dual-input technique, compared with the conventional SCC, BD-rates are further reduced 3.81% and 3.07%, by adding our mask branch onto two state-of-the-art models, DnCNN and DCAD, respectively.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123913961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Speech Intelligibility Prediction for Hearing Aids Using an Auditory Model and Acoustic Parameters 基于听觉模型和声学参数的助听器语音清晰度预测

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980000

Benita Angela Titalim, Candy Olivia Mawalim, S. Okada, M. Unoki

Objective speech intelligibility (SI) metrics for hearing-impaired people play an important role in hearing aid development. The work on improving SI prediction also became the basis of the first Clarity Prediction Challenge (CPC1). This study investigates a physiological auditory model called EarModel and acoustic parameters for SI prediction. EarModel is utilized because it provides advantages in estimating human hearing, both normal and impaired. The hearing-impaired condition is simulated in EarModel based on audiograms; thus, the SI perceived by hearing-impaired people is more accurately predicted. Moreover, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and WavLM, as additional acoustic parameters for estimating the difficulty levels of given utterances, are included to achieve improved prediction accuracy. The proposed method is evaluated on the CPC1 database. The results show that the proposed method improves the SI prediction effects of the baseline and hearing aid speech prediction index (HASPI). Additionally, an ablation test shows that incorporating the eGeMAPS and WavLM can significantly contribute to the prediction model by increasing the Pearson correlation coefficient by more than 15% and decreasing the root-mean-square error (RMSE) by more than 10.00 in both closed-set and open-set tracks.

客观的语音可理解度指标在助听器开发中具有重要作用。改进SI预测的工作也成为第一个清晰度预测挑战(CPC1)的基础。本研究探讨了一种称为EarModel的生理听觉模型和用于SI预测的声学参数。使用EarModel是因为它在评估人类听力方面具有优势，无论是正常的还是受损的。在EarModel中基于听力图模拟听力受损情况;因此，听力受损的人感知到的SI更准确地被预测出来。此外，扩展的日内瓦极简声学参数集(eGeMAPS)和WavLM作为额外的声学参数，用于估计给定话语的难度等级，以提高预测精度。在CPC1数据库上对该方法进行了评估。结果表明，该方法提高了基线和助听器语音预测指数(HASPI)的SI预测效果。此外，消融测试表明，结合eGeMAPS和WavLM可以显著促进预测模型，在封闭集和开放集轨道中，Pearson相关系数提高15%以上，均方根误差(RMSE)降低10.00以上。

{"title":"Speech Intelligibility Prediction for Hearing Aids Using an Auditory Model and Acoustic Parameters","authors":"Benita Angela Titalim, Candy Olivia Mawalim, S. Okada, M. Unoki","doi":"10.23919/APSIPAASC55919.2022.9980000","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980000","url":null,"abstract":"Objective speech intelligibility (SI) metrics for hearing-impaired people play an important role in hearing aid development. The work on improving SI prediction also became the basis of the first Clarity Prediction Challenge (CPC1). This study investigates a physiological auditory model called EarModel and acoustic parameters for SI prediction. EarModel is utilized because it provides advantages in estimating human hearing, both normal and impaired. The hearing-impaired condition is simulated in EarModel based on audiograms; thus, the SI perceived by hearing-impaired people is more accurately predicted. Moreover, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and WavLM, as additional acoustic parameters for estimating the difficulty levels of given utterances, are included to achieve improved prediction accuracy. The proposed method is evaluated on the CPC1 database. The results show that the proposed method improves the SI prediction effects of the baseline and hearing aid speech prediction index (HASPI). Additionally, an ablation test shows that incorporating the eGeMAPS and WavLM can significantly contribute to the prediction model by increasing the Pearson correlation coefficient by more than 15% and decreasing the root-mean-square error (RMSE) by more than 10.00 in both closed-set and open-set tracks.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124028029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Obstructive Sleep Apnea Classification Using Snore Sounds Based on Deep Learning 基于深度学习的鼾声阻塞性睡眠呼吸暂停分类

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979938

Apichada Sillaparaya, A. Bhatranand, Chudanat Sudthongkong, K. Chamnongthai, Y. Jiraraksopakun

Early screening for the Obstructive Sleep Apnea (OSA), especially the first grade of Apnea-Hypopnea Index (AHI), can reduce risk and improve the effectiveness of timely treatment. The current gold standard technique for OSA diagnosis is Polysomnography (PSG), but the technique must be performed in a specialized laboratory with an expert and requires many sensors attached to a patient. Hence, it is costly and may not be convenient for a self-test by the patient. The characteristic of snore sounds has recently been used to screen the OSA and more likely to identify the abnormality of breathing conditions. Therefore, this study proposes a deep learning model to classify the OSA based on snore sounds. The snore sound data of 5 OSA patients were selected from the opened-source PSG- Audio data by the Sleep Study Unit of the Sismanoglio-Amalia Fleming General Hospital of Athens [1]. 2,439 snoring and breathing-related sound segments were extracted and divided into 3 groups of 1,020 normal snore sounds, 1,185 apnea or hypopnea snore sounds, and 234 non-snore sounds. All sound segments were separated into 60% training, 20% validation, and 20% test sets, respectively. The mean of Mel-Frequency Cepstral Coefficients (MFCC) of a sound segment were computed as the feature inputs of the deep learning model. Three fully connected layers were used in this deep learning model to classify into three groups as (1) normal snore sounds, (2) abnormal (apnea or hypopnea) snore sounds, and (3) non-snore sounds. The result showed that the model was able to correctly classify 85.2459%. Therefore, the model is promising to use snore sounds for screening OSA.

早期筛查阻塞性睡眠呼吸暂停(OSA)，特别是呼吸暂停-低通气指数(AHI)一级，可以降低风险，提高及时治疗的有效性。目前诊断阻塞性睡眠呼吸暂停的金标准技术是多导睡眠图(PSG)，但该技术必须在专家的专业实验室中进行，并且需要在患者身上安装许多传感器。因此，它是昂贵的，可能不方便病人自检。鼾声的特征最近被用来筛查阻塞性睡眠呼吸暂停，更有可能识别呼吸条件的异常。因此，本研究提出了一种基于鼾声的深度学习模型对OSA进行分类。5例OSA患者的鼾声数据由雅典Sismanoglio-Amalia Fleming总医院睡眠研究组从开源的PSG- Audio数据中选取[1]。提取2439个与打鼾及呼吸相关的声音片段，分为3组，分别为1020个正常打鼾声、1185个呼吸暂停或低通气打鼾声和234个非打鼾声。所有声音片段分别被分成60%的训练集、20%的验证集和20%的测试集。计算一个声音片段的Mel-Frequency倒谱系数(MFCC)的平均值作为深度学习模型的特征输入。该深度学习模型使用了三个完全连接的层，将其分为三组:(1)正常打鼾声，(2)异常(呼吸暂停或呼吸不足)打鼾声和(3)非打鼾声。结果表明，该模型的分类正确率为85.2459%。因此，该模型有望利用鼾声筛查OSA。

{"title":"Obstructive Sleep Apnea Classification Using Snore Sounds Based on Deep Learning","authors":"Apichada Sillaparaya, A. Bhatranand, Chudanat Sudthongkong, K. Chamnongthai, Y. Jiraraksopakun","doi":"10.23919/APSIPAASC55919.2022.9979938","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979938","url":null,"abstract":"Early screening for the Obstructive Sleep Apnea (OSA), especially the first grade of Apnea-Hypopnea Index (AHI), can reduce risk and improve the effectiveness of timely treatment. The current gold standard technique for OSA diagnosis is Polysomnography (PSG), but the technique must be performed in a specialized laboratory with an expert and requires many sensors attached to a patient. Hence, it is costly and may not be convenient for a self-test by the patient. The characteristic of snore sounds has recently been used to screen the OSA and more likely to identify the abnormality of breathing conditions. Therefore, this study proposes a deep learning model to classify the OSA based on snore sounds. The snore sound data of 5 OSA patients were selected from the opened-source PSG- Audio data by the Sleep Study Unit of the Sismanoglio-Amalia Fleming General Hospital of Athens [1]. 2,439 snoring and breathing-related sound segments were extracted and divided into 3 groups of 1,020 normal snore sounds, 1,185 apnea or hypopnea snore sounds, and 234 non-snore sounds. All sound segments were separated into 60% training, 20% validation, and 20% test sets, respectively. The mean of Mel-Frequency Cepstral Coefficients (MFCC) of a sound segment were computed as the feature inputs of the deep learning model. Three fully connected layers were used in this deep learning model to classify into three groups as (1) normal snore sounds, (2) abnormal (apnea or hypopnea) snore sounds, and (3) non-snore sounds. The result showed that the model was able to correctly classify 85.2459%. Therefore, the model is promising to use snore sounds for screening OSA.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121203955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation of Cognitive Test Results Using Concentration Estimation from Facial Videos 基于面部视频的认知测试结果评价

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980211

Terumi Umematsu, M. Tsujikawa, H. Sawada

In this paper, we propose a method of discriminating between concentration and non-concentration on the basis of facial videos, and we confirm the usefulness of excluding cognitive test results when a user has not been concentrating. In a preliminary experiment, we have confirmed that level of concentration has a strong impact on correct answer rates in memory tests. Our proposed concentration/non-concentration discrimination method uses 15 features extracted from facial videos, including blinking, gazing, and facial expressions (Action Units), and discriminates between concentration and non-concentration, which are reflected in terms of a binary correct answer label set based on subjectively rated concentration levels. In the preliminary experiment, memory test scores during non-concentration states were lower than those during concentration states by an average of 18%. This has usually been included as measurement error, and, by excluding scores during non-concentration states using the proposed method, measurement error was reduced to 4%. The proposed method is shown to be capable of obtaining test results that indicate true cognitive functions when people are concentrating, making possible a more accurate understanding of cognitive functions.

在本文中，我们提出了一种基于面部视频区分集中和非集中的方法，并证实了在用户不集中注意力时排除认知测试结果的有效性。在初步实验中，我们已经证实，注意力的集中程度对记忆测试的正确率有很大的影响。我们提出的集中/不集中判别方法使用从面部视频中提取的15个特征，包括眨眼、凝视和面部表情(动作单元)，并区分集中和不集中，这反映在基于主观评价的集中水平的二元正确答案标签集上。在初步实验中，非集中状态下的记忆测试分数比集中状态下的平均分数低18%。这通常包含在测量误差中，并且，通过使用建议的方法排除非集中状态的分数，测量误差减少到4%。所提出的方法被证明能够获得表明人们集中注意力时真实认知功能的测试结果，从而使对认知功能的更准确理解成为可能。

{"title":"Evaluation of Cognitive Test Results Using Concentration Estimation from Facial Videos","authors":"Terumi Umematsu, M. Tsujikawa, H. Sawada","doi":"10.23919/APSIPAASC55919.2022.9980211","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980211","url":null,"abstract":"In this paper, we propose a method of discriminating between concentration and non-concentration on the basis of facial videos, and we confirm the usefulness of excluding cognitive test results when a user has not been concentrating. In a preliminary experiment, we have confirmed that level of concentration has a strong impact on correct answer rates in memory tests. Our proposed concentration/non-concentration discrimination method uses 15 features extracted from facial videos, including blinking, gazing, and facial expressions (Action Units), and discriminates between concentration and non-concentration, which are reflected in terms of a binary correct answer label set based on subjectively rated concentration levels. In the preliminary experiment, memory test scores during non-concentration states were lower than those during concentration states by an average of 18%. This has usually been included as measurement error, and, by excluding scores during non-concentration states using the proposed method, measurement error was reduced to 4%. The proposed method is shown to be capable of obtaining test results that indicate true cognitive functions when people are concentrating, making possible a more accurate understanding of cognitive functions.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121336943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0