Yan Pan , Jian Zhou , Huabin Wang , Wenming Zheng , Liang Tao , Hon Keung Kwan
{"title":"Enhancing bone-conducted speech with spectrum similarity metric in adversarial learning","authors":"Yan Pan , Jian Zhou , Huabin Wang , Wenming Zheng , Liang Tao , Hon Keung Kwan","doi":"10.1016/j.specom.2025.103223","DOIUrl":null,"url":null,"abstract":"<div><div>Although bone-conducted (BC) speech offers the advantage of being insusceptible to background noise, its transmission path through bone tissue entails not only serious attenuation of high-frequency components but also speech distortion and the loss of unvoiced speech, resulting in a substantial degradation in both speech quality and intelligibility. Existing BC speech enhancement methods focus mainly on approaching high-frequency component restoration but overlook the restoration of missing unvoiced speech and the mitigation of speech distortion, resulting in a noticeable gap in speech quality and intelligibility compared to air-conducted (AC) speech. In this paper, a spectrum-similarity metric based adversarial learning method is proposed for bone-conducted speech enhancement. The acoustic features corresponding to source-excitation and filter-response are disentangled using the WORLD vocoder and mapped to its AC speech counterparts with logarithmic Gaussian normalization and a vocal tract converter, respectively. To reconstruct unvoiced speech from BC speech and decrease the nonlinear speech distortion in BC speech, the vocal tract converter predicts low-dimensional Mel-cepstral coefficients of AC speech using a generator which is supervised by a classification discriminator and a spectrum similarity discriminator. While the classification discriminator is used to distinguish between authentic AC speech and enhanced BC speech, the spectrum similarity discriminator is designed to evaluate the spectrum similarity between enhanced BC speech and its AC counterpart. To evaluate spectrum similarity, the correlation of time–frequency units in spectrum of long duration is captured within the self-attention layer embedded in the spectrum similarity discriminator. Experimental results on various speech datasets show that the proposed method is capable of restoring unvoiced speech segment and diminishing speech distortion, resulting in predicting accurate fine-grained AC spectrum and thus significant improvement in terms of speech quality and speech intelligibility.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"170 ","pages":"Article 103223"},"PeriodicalIF":3.0000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016763932500038X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Although bone-conducted (BC) speech offers the advantage of being insusceptible to background noise, its transmission path through bone tissue entails not only serious attenuation of high-frequency components but also speech distortion and the loss of unvoiced speech, resulting in a substantial degradation in both speech quality and intelligibility. Existing BC speech enhancement methods focus mainly on approaching high-frequency component restoration but overlook the restoration of missing unvoiced speech and the mitigation of speech distortion, resulting in a noticeable gap in speech quality and intelligibility compared to air-conducted (AC) speech. In this paper, a spectrum-similarity metric based adversarial learning method is proposed for bone-conducted speech enhancement. The acoustic features corresponding to source-excitation and filter-response are disentangled using the WORLD vocoder and mapped to its AC speech counterparts with logarithmic Gaussian normalization and a vocal tract converter, respectively. To reconstruct unvoiced speech from BC speech and decrease the nonlinear speech distortion in BC speech, the vocal tract converter predicts low-dimensional Mel-cepstral coefficients of AC speech using a generator which is supervised by a classification discriminator and a spectrum similarity discriminator. While the classification discriminator is used to distinguish between authentic AC speech and enhanced BC speech, the spectrum similarity discriminator is designed to evaluate the spectrum similarity between enhanced BC speech and its AC counterpart. To evaluate spectrum similarity, the correlation of time–frequency units in spectrum of long duration is captured within the self-attention layer embedded in the spectrum similarity discriminator. Experimental results on various speech datasets show that the proposed method is capable of restoring unvoiced speech segment and diminishing speech distortion, resulting in predicting accurate fine-grained AC spectrum and thus significant improvement in terms of speech quality and speech intelligibility.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.