Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement

IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2023-12-26 DOI:10.1016/j.csl.2023.101605
Vijay Ravi , Jinhan Wang , Jonathan Flint , Abeer Alwan
{"title":"Enhancing accuracy and privacy in speech-based depression detection through speaker disentanglement","authors":"Vijay Ravi ,&nbsp;Jinhan Wang ,&nbsp;Jonathan Flint ,&nbsp;Abeer Alwan","doi":"10.1016/j.csl.2023.101605","DOIUrl":null,"url":null,"abstract":"<div><p>Speech signals are valuable biomarkers for assessing an individual’s mental health, including identifying Major Depressive Disorder (MDD) automatically. A frequently used approach in this regard is to employ features related to speaker identity, such as speaker-embeddings. However, over-reliance on speaker identity features in mental health screening systems can compromise patient privacy. Moreover, some aspects of speaker identity may not be relevant for depression detection and could serve as a bias factor that hampers system performance. To overcome these limitations, we propose disentangling speaker-identity information from depression-related information. Specifically, we present four distinct disentanglement methods to achieve this — adversarial speaker identification (SID)-loss maximization (ADV), SID-loss equalization with variance (LEV), SID-loss equalization using Cross-Entropy (LECE) and SID-loss equalization using KL divergence (LEKLD). Our experiments, which incorporated diverse input features and model architectures, have yielded improved F1 scores for MDD detection and voice-privacy attributes, as quantified by Gain in Voice Distinctiveness (<span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>V</mi><mi>D</mi></mrow></msub></math></span>) and De-Identification Scores (DeID). On the DAIC-WOZ dataset (English), LECE using ComparE16 features results in the best F1-Scores of 80% which represents the audio-only SOTA depression detection F1-Score along with a <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>V</mi><mi>D</mi></mrow></msub></math></span> of −1.1 dB and a DeID of 85%. On the EATD dataset (Mandarin), ADV using raw-audio signal achieves an F1-Score of 72.38% surpassing multi-modal SOTA along with a <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>V</mi><mi>D</mi></mrow></msub></math></span> of −0.89 dB dB and a DeID of 51.21%. By reducing the dependence on speaker-identity-related features, our method offers a promising direction for speech-based depression detection that preserves patient privacy.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230823001249/pdfft?md5=7acff7dbe3c70a9a6ae6cde978bd02e2&pid=1-s2.0-S0885230823001249-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823001249","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Speech signals are valuable biomarkers for assessing an individual’s mental health, including identifying Major Depressive Disorder (MDD) automatically. A frequently used approach in this regard is to employ features related to speaker identity, such as speaker-embeddings. However, over-reliance on speaker identity features in mental health screening systems can compromise patient privacy. Moreover, some aspects of speaker identity may not be relevant for depression detection and could serve as a bias factor that hampers system performance. To overcome these limitations, we propose disentangling speaker-identity information from depression-related information. Specifically, we present four distinct disentanglement methods to achieve this — adversarial speaker identification (SID)-loss maximization (ADV), SID-loss equalization with variance (LEV), SID-loss equalization using Cross-Entropy (LECE) and SID-loss equalization using KL divergence (LEKLD). Our experiments, which incorporated diverse input features and model architectures, have yielded improved F1 scores for MDD detection and voice-privacy attributes, as quantified by Gain in Voice Distinctiveness (GVD) and De-Identification Scores (DeID). On the DAIC-WOZ dataset (English), LECE using ComparE16 features results in the best F1-Scores of 80% which represents the audio-only SOTA depression detection F1-Score along with a GVD of −1.1 dB and a DeID of 85%. On the EATD dataset (Mandarin), ADV using raw-audio signal achieves an F1-Score of 72.38% surpassing multi-modal SOTA along with a GVD of −0.89 dB dB and a DeID of 51.21%. By reducing the dependence on speaker-identity-related features, our method offers a promising direction for speech-based depression detection that preserves patient privacy.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过分离说话者提高基于语音的抑郁检测的准确性和私密性
语音信号是评估个人心理健康的重要生物标记,包括自动识别重度抑郁症(MDD)。这方面常用的一种方法是采用与说话者身份相关的特征,如说话者嵌入。然而,在心理健康筛查系统中过度依赖说话者身份特征可能会损害病人的隐私。此外,说话者身份的某些方面可能与抑郁检测无关,可能成为影响系统性能的偏差因素。为了克服这些局限性,我们建议将说话者身份信息与抑郁相关信息分离开来。具体来说,我们提出了四种不同的解缠方法来实现这一目标--对抗性说话人识别(SID)-损失最大化(ADV)、带方差的 SID 损失均衡(LEV)、使用交叉熵的 SID 损失均衡(LECE)和使用 KL 分歧的 SID 损失均衡(LEKLD)。我们的实验采用了不同的输入特征和模型架构,提高了 MDD 检测和语音隐私属性的 F1 分数,并通过语音独特性增益(GVD)和去识别分数(DeID)进行量化。在 DAIC-WOZ 数据集(英语)上,使用 ComparE16 特征的 LECE 得到了 80% 的最佳 F1 分数,代表了纯音频 SOTA 抑郁症检测 F1 分数,同时 GVD 为 -1.1 dB,DeID 为 85%。在 EATD 数据集(普通话)上,使用原始音频信号的 ADV 的 F1 分数为 72.38%,超过了多模式 SOTA,GVD 为 -0.89 dB dB,DeID 为 51.21%。通过减少对说话者身份相关特征的依赖,我们的方法为基于语音的抑郁检测提供了一个保护患者隐私的前景广阔的方向。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
期刊最新文献
Editorial Board Enhancing analysis of diadochokinetic speech using deep neural networks Copiously Quote Classics: Improving Chinese Poetry Generation with historical allusion knowledge Significance of chirp MFCC as a feature in speech and audio applications Artificial disfluency detection, uh no, disfluency generation for the masses
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1