Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children

IF 4.1 2区 计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-18 DOI:10.1109/TASLP.2024.3463503
Si-Ioi Ng;Cymie Wing-Yee Ng;Jiarui Wang;Tan Lee
{"title":"Automatic Detection of Speech Sound Disorder in Cantonese-Speaking Pre-School Children","authors":"Si-Ioi Ng;Cymie Wing-Yee Ng;Jiarui Wang;Tan Lee","doi":"10.1109/TASLP.2024.3463503","DOIUrl":null,"url":null,"abstract":"Speech sound disorder (SSD) is a type of developmental disorder in which children encounter persistent difficulties in correctly producing certain speech sounds. Conventionally, assessment of SSD relies largely on speech and language pathologists (SLPs) with appropriate language background. With the unsatisfied demand for qualified SLPs, automatic detection of SSD is highly desirable for assisting clinical work and improving the efficiency and quality of services. In this paper, methods and systems for fully automatic detection of SSD in young children are investigated. A microscopic approach and a macroscopic approach are developed. The microscopic system is based on detection of phonological errors in impaired child speech. A deep neural network (DNN) model is trained to learn the similarity and contrast between consonant segments. Phonological error is identified by contrasting a test speech segment to reference segments. The phone-level similarity scores are aggregated for speaker-level SSD detection. The macroscopic approach leverages holistic changes of speech characteristics related to disorders. Various types of speaker-level embeddings are investigated and compared. Experimental results show that the proposed microscopic system achieves unweighted average recall (UAR) from 84.0% to 91.9% on phone-level error detection. The proposed macroscopic approach can achieve a UAR of 89.0% on speaker-level SSD detection. The speaker embeddings adopted for macroscopic SSD detection can effectively discard the information related to speaker's personal identity.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4355-4368"},"PeriodicalIF":4.1000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10683876/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Speech sound disorder (SSD) is a type of developmental disorder in which children encounter persistent difficulties in correctly producing certain speech sounds. Conventionally, assessment of SSD relies largely on speech and language pathologists (SLPs) with appropriate language background. With the unsatisfied demand for qualified SLPs, automatic detection of SSD is highly desirable for assisting clinical work and improving the efficiency and quality of services. In this paper, methods and systems for fully automatic detection of SSD in young children are investigated. A microscopic approach and a macroscopic approach are developed. The microscopic system is based on detection of phonological errors in impaired child speech. A deep neural network (DNN) model is trained to learn the similarity and contrast between consonant segments. Phonological error is identified by contrasting a test speech segment to reference segments. The phone-level similarity scores are aggregated for speaker-level SSD detection. The macroscopic approach leverages holistic changes of speech characteristics related to disorders. Various types of speaker-level embeddings are investigated and compared. Experimental results show that the proposed microscopic system achieves unweighted average recall (UAR) from 84.0% to 91.9% on phone-level error detection. The proposed macroscopic approach can achieve a UAR of 89.0% on speaker-level SSD detection. The speaker embeddings adopted for macroscopic SSD detection can effectively discard the information related to speaker's personal identity.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
自动检测粤语学龄前儿童的语音障碍
言语发声障碍(SSD)是一种发育障碍,儿童在正确发出某些言语声音时会遇到持续性困难。传统上,对 SSD 的评估主要依赖于具有相应语言背景的言语和语言病理学家(SLPs)。由于对合格语言病理学家的需求得不到满足,自动检测 SSD 对辅助临床工作、提高服务效率和质量非常有必要。本文研究了全自动检测幼儿 SSD 的方法和系统。本文开发了一种微观方法和一种宏观方法。微观系统基于检测受损儿童语音中的语音错误。通过训练深度神经网络(DNN)模型来学习辅音片段之间的相似度和对比度。通过将测试语音片段与参考片段进行对比来识别语音错误。电话级的相似性分数被汇总到扬声器级的 SSD 检测中。宏观方法利用与障碍有关的语音特征的整体变化。研究并比较了各种类型的扬声器级嵌入。实验结果表明,所提出的微观系统在电话级错误检测方面实现了 84.0% 到 91.9% 的非加权平均召回率(UAR)。所提出的宏观方法在扬声器级 SSD 检测方面的 UAR 可达到 89.0%。宏观固态硬盘检测所采用的说话人嵌入可以有效地剔除与说话人个人身份相关的信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
11.30
自引率
11.10%
发文量
217
期刊介绍: The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.
期刊最新文献
CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection ELSF: Entity-Level Slot Filling Framework for Joint Multiple Intent Detection and Slot Filling Proper Error Estimation and Calibration for Attention-Based Encoder-Decoder Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1