{"title":"Single-channel Speech Separation Based on Double-density Dual-tree CWT and SNMF","authors":"Md. Imran Hossain, Md. Abdur Rahim, Md. Najmul Hossain","doi":"10.33166/aetic.2024.01.001","DOIUrl":null,"url":null,"abstract":"Speech is essential to human communication; therefore, distinguishing it from noise is crucial. Speech separation becomes challenging in real-world circumstances with background noise and overlapping speech. Moreover, the speech separation using short-term Fourier transform (STFT) and discrete wavelet transform (DWT) addresses time and frequency resolution and time-variation issues, respectively. To solve the above issues, a new speech separation technique is presented based on the double-density dual-tree complex wavelet transform (DDDTCWT) and sparse non-negative matrix factorization (SNMF). The signal is separated into high-pass and low-pass frequency components using DDDTCWT wavelet decomposition. For this analysis, we only considered the low-pass frequency components and zeroed out the high-pass ones. Subsequently, the STFT is then applied to each sub-band signal to generate a complex spectrogram. Therefore, we have used SNMF to factorize the joint form of magnitude and the absolute value of real and imaginary (RI) components that decompose the basis and weight matrices. Most researchers enhance the magnitude spectra only, ignore the phase spectra, and estimate the separated speech using noisy phase. As a result, some noise components are present in the estimated speech results. We are dealing with the signal's magnitude as well as the RI components and estimating the phase of the RI parts. Finally, separated speech signals can be achieved using the inverse STFT (ISTFT) and the inverse DDDTCWT (IDDDTCWT). Separation performance is improved for estimating the phase component and the shift-invariant, better direction selectivity, and scheme freedom properties of DDDTCWT. The speech separation efficiency of the proposed algorithm outperforms performance by 6.53–8.17 dB SDR gain, 7.37-9.87 dB SAR gain, and 14.92–17.21 dB SIR gain compared to the NMF method with masking on the TIMIT dataset.","PeriodicalId":36440,"journal":{"name":"Annals of Emerging Technologies in Computing","volume":"10 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Emerging Technologies in Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33166/aetic.2024.01.001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0
Abstract
Speech is essential to human communication; therefore, distinguishing it from noise is crucial. Speech separation becomes challenging in real-world circumstances with background noise and overlapping speech. Moreover, the speech separation using short-term Fourier transform (STFT) and discrete wavelet transform (DWT) addresses time and frequency resolution and time-variation issues, respectively. To solve the above issues, a new speech separation technique is presented based on the double-density dual-tree complex wavelet transform (DDDTCWT) and sparse non-negative matrix factorization (SNMF). The signal is separated into high-pass and low-pass frequency components using DDDTCWT wavelet decomposition. For this analysis, we only considered the low-pass frequency components and zeroed out the high-pass ones. Subsequently, the STFT is then applied to each sub-band signal to generate a complex spectrogram. Therefore, we have used SNMF to factorize the joint form of magnitude and the absolute value of real and imaginary (RI) components that decompose the basis and weight matrices. Most researchers enhance the magnitude spectra only, ignore the phase spectra, and estimate the separated speech using noisy phase. As a result, some noise components are present in the estimated speech results. We are dealing with the signal's magnitude as well as the RI components and estimating the phase of the RI parts. Finally, separated speech signals can be achieved using the inverse STFT (ISTFT) and the inverse DDDTCWT (IDDDTCWT). Separation performance is improved for estimating the phase component and the shift-invariant, better direction selectivity, and scheme freedom properties of DDDTCWT. The speech separation efficiency of the proposed algorithm outperforms performance by 6.53–8.17 dB SDR gain, 7.37-9.87 dB SAR gain, and 14.92–17.21 dB SIR gain compared to the NMF method with masking on the TIMIT dataset.
语音是人类交流的基本要素,因此将语音与噪音区分开来至关重要。在现实世界中,由于背景噪声和语音重叠,语音分离变得极具挑战性。此外,使用短期傅里叶变换(STFT)和离散小波变换(DWT)进行语音分离时,需要分别解决时间和频率分辨率以及时变问题。为解决上述问题,本文提出了一种基于双密度双树复小波变换(DDDTCWT)和稀疏非负矩阵因式分解(SNMF)的新型语音分离技术。通过 DDDTCWT 小波分解,信号被分离成高通和低通频率分量。在本分析中,我们只考虑低通频率分量,而将高通频率分量清零。随后,STFT 应用于每个子带信号,生成复频谱图。因此,我们使用 SNMF 对分解基矩阵和权重矩阵的幅值和实部与虚部(RI)分量的绝对值的联合形式进行因式分解。大多数研究人员只增强了幅度频谱,忽略了相位频谱,并使用噪声相位来估计分离的语音。因此,在估计的语音结果中会出现一些噪声成分。我们既要处理信号的幅度,也要处理 RI 分量,并估算 RI 部分的相位。最后,可以使用反 STFT(ISTFT)和反 DDDTCWT(IDDTCWT)来分离语音信号。在估计相位分量和 DDDTCWT 的移位不变性、更好的方向选择性和方案自由度特性时,分离性能得到了提高。在 TIMIT 数据集上,与带掩码的 NMF 方法相比,所提算法的语音分离效率提高了 6.53-8.17 dB SDR 增益、7.37-9.87 dB SAR 增益和 14.92-17.21 dB SIR 增益。